Database Management Systems

Raghu Ramakrishnan, Johannes Gehrke

Mentioned 7

Database Management Systems provides comprehensive and up-to-date coverage of the fundamentals of database systems. Coherent explanations and practical examples have made this one of the leading texts in the field. The third edition continues in this tradition, enhancing it with more practical material. The new edition has been reorganized to allow more flexibility in the way the course is taught. Now, instructors can easily choose whether they would like to teach a course which emphasizes database application development or a course that emphasizes database systems issues. New overview chapters at the beginning of parts make it possible to skip other chapters in the part if you don't want the detail. More applications and examples have been added throughout the book, including SQL and Oracle examples. The applied flavor is further enhanced by the two new database applications chapters.

More on Amazon.com

Mentioned in questions and answers.

I'm a little confused about data consistency and data integrity. From Oracle Database Concepts:

data integrity
--------------
Business rules that dictate the standards for acceptable data. These rules
are applied to a database by using integrity constraints and triggers to
prevent invalid data entry.

From Wikipedia

Consistency
-----------
Consistency states that only valid data will be written to the database.

So what's the difference between data consistency and data integrity?

Thanks in advance.

From here

Data consistency

Execution of transaction in isolation preserves the consistency of the data. The data value stored in the database must satisfy certain consistency constraints. For example the balance on an account should not fall below $100. The DDL commands provide the facilities to specify such constraints. The database systems check these constraints every- time the database is updated. In general it can be an arbitrary predicate pertaining to the database. To check the consistency of database most of programmer depends on the constraints and these are usually costly to test.

Data integrity

The integrity of data refers to as "whole data at one place". The data is in the form of tables and each table has some attributes. These are the cases where we wish to ensure that a value that appears in one relation for given set of attributes should also appear for a certain set of attributes in another relation (referential integrity). Database modification can cause violation of integrity. When the integrity constraint is violated, the normal procedure rejects the action that has actually caused the violation.

enter image description here

You can read first chapter for ACID.

In this book I'm currently reading while following a course on databases, the following example of an illegal query using an aggregate operator is given:

Find the name and age of the oldest sailor.

Consider the following attempt to answer this query:

SELECT S.sname, MAX(S.age)
FROM Sailors S

The intent is for this query to return not only the maximum age but also the name of the sailors having that age. However, this query is illegal in SQL--if the SELECT clause uses an aggregate operation, then it must use only aggregate operations unless the query contains a GROUP BY clause!

Some time later while doing an exercise using MySQL, I faced a similar problem, and made a mistake similar to the one mentioned. However, MySQL didn't complain and just spit out some tables which later turned out not to be what I needed.

Is the query above really illegal in SQL, but legal in MySQL, and if so, why is that? In what situation would one need to make such a query?

Further elaboration of the question:

The question isn't about whether or not all attributes mentioned in a SELECT should also be mentioned in a GROUP BY. It's about why the above query, using atributes together with aggregate operations on attributes, without any GROUP BY is legal in MySQL.

Let's say the Sailors table looked like this:

+----------+------+
| sname    | age  |
+----------+------+
| John Doe |   30 |
| Jane Doe |   50 |
+----------+------+

The query would then return:

+----------+------------+
| sname    | MAX(S.age) |
+----------+------------+
| John Doe |         50 |
+----------+------------+

Now who would need that? John Doe ain't 50, he's 30! As stated in the citation from the book, this is a first attempt to get the name and age of the oldest sailor, in this example, Jane Doe at the age of 50.

SQL would say this query is illegal, but MySQL just proceeds and spits out "garbage". Who would need this kind of result? Why does MySQL allow this little trap for newcomers?

Based on a link which a_horse_with_no_name provided in a comment, I have arrived at my own answer:

It seems that the MySQL way of using GROUP BY differs from the SQL way, in order to permit leaving out columns, from the GROUP BY clause, when they are functionally dependant on other included columns anyways.

Lets say we have a table displaying the activity of a bank account. It's not a very thought-out table, but it's the only one we have, and that will have to do. Instead of keeping track of an amount, we imagine an account starts at '0', and all transactions to it is recorded instead, so the amount is the sum of the transactions. The table could look like this:

+------------+----------+-------------+
| costumerID | name     | transaction |
+------------+----------+-------------+
|       1337 | h4x0r    |         101 |
|         42 | John Doe |         500 |
|       1337 | h4x0r    |        -101 |
|         42 | John Doe |        -200 |
|         42 | John Doe |         500 |
|         42 | John Doe |        -200 |
+------------+----------+-------------+

It is clear that the 'name' is functionally dependant on the 'costumerID'. (The other way around would also be possible in this example.)

What if we wanted to know the costumerID, name and current amount of each customer?

In such a situation, two very similar queries would return the following right result:

+------------+----------+--------+
| costumerID | name     | amount |
+------------+----------+--------+
|         42 | John Doe |    600 |
|       1337 | h4x0r    |      0 |
+------------+----------+--------+

This query can be executed in MySQL, and is legal according to SQL.

SELECT costumerID, name, SUM(transaction) AS amount
FROM Activity
GROUP BY costumerID, name

This query can be executed in MySQL, and is NOT legal according to SQL.

SELECT costumerID, name, SUM(transaction) AS amount
FROM Activity
GROUP BY costumerID

The following line would make the query return and error instead, since it would now have to follow the SQL way of using aggregation operations and GROUP BY:

SET sql_mode = 'ONLY_FULL_GROUP_BY';

The argument for allowing the second query in MySQL, seems to be that it is assumed that all columns mentioned in SELECT, but not mentioned in GROUP BY, are either used inside an aggregate operation, (the case with 'transaction'), or are functionally dependent on other included columns, (the case with 'name'). In the case of 'name', we can be sure that the correct 'name' is chosen for all group entries, since it is functionally dependant on 'costumerID', and therefore there is only one possibly name for each group of costumerID's.

This way of using GROUP BY seems flawed tough, since it doesn't do any further checks on what is left out from the GROUP BY clause. People can pick and choose columns from their SELECT statement to put in their GROUP BY clause as they see fit, even if it makes no sense to include or leave out any particular column.

The Sailor example illustrates this flaw very well. When using aggregation operators (possibly in conjunction with GROUP BY), each group entry in the returned set has only one value for each of its columns. In the case of Sailors, since the GROUP BY clause is left out, the whole table is put into one single group entry. This entry needs a name and a maximum age. Choosing a maximum age for this entry is a no-brainer, since MAX(S.age) only returns one value. In the case of S.sname though, wich is only mentioned in SELECT, there are now as many choices as there are unique sname's in the whole Sailor table, (in this case two, John and Jane Doe). MySQL doens't have any clue which to choose, we didn't give it any, and it didn't hit the brakes in time, so it has to just pick whatever comes first, (Jane Doe). If the two rows were switched, it would actually give "the right answer" by accident. It just seems plain dumb that something like this is allowed in MySQL, that the result of a query using GROUP BY could potententially depend on the ordering of the table, if something is left out in the GROUP BY clause. Apparently, that's just how MySQL rolls. But still couldn't it at least have the courtesy of warning us when it has no clue what it's doing because of a "flawed" query? I mean, sure, if you give the wrong instructions to a program, it probably wouldn't (or shouldn't) do as you want, but if you give unclear instructions, I certainly wouldn't want it to just start guessing or pick whatever comes first... -_-'

how does indexing increases the performance of data retrieval?

How indexing works?

Database products (RDMS) such as Oracle, MySQL builds their own indexing system, they give some control to the database administrators however nobody exactly knows what happens on the background except people makes research in that area, so why indexing :

Put simply, database indexes help speed up retrieval of data. The other great benefit of indexes is that your server doesn't have to work as hard to get the data. They are much the same as book indexes, providing the database with quick jump points on where to find the full reference (or to find the database row).

There are many indexing techiques for example :

  • Primary indexing, secondary indexing
  • B-trees and variants (B+-trees,B*-trees)
  • Hashing and variants (linear hashing, spiral etc.)

for example, just think that you have a database with the primary keys are sorted (simply) and these all data is stored in blocks (in hdd) so everytime you want to access the data you don't want to increase the access time (sometimes called transaction time or i/o time) the indexing helps you which data is stored in which block by using these primary keys. Alice (primary key is names, not good example but just give an idea)

Alice
...
...
AZ...
Bob
Bri
...
Bza
...

Now you have an index in this index you only store Alice and Bob and the blocks they point, with this way users can access the data faster.The RDMS deals with the details.

I don't give the details but if you want to delve these topics, i offer you take an Database course or look at this popular book which is taught most of the universities.

Database Management Systems Ramakrishn CGherke

alt text

EDIT: Problem solved. It was my reading that was incorrect, not the SQL :p

Hi!

I was just reading in my database book about the LIKE SQL statement. It said this:

SELECT whatever FROM whereever WHERE 'Jeff' LIKE 'Jeff';

It continued to say that the statement 'Jeff' LIKE 'Jeff' would always return false. The book did not tell me why, nor can I find this anywhere else. Does this then mean that the following SQL also would return null?

SELECT W.name FROM whereever W WHERE w.name LIKE 'Jeff';

For reference, the book is: Database Management Systems: Ramakrishnan - Gehrke From McGRAW - Hill. International edition. ISBN 0-07-123151-X PAGE 140.

I took a look at that page (Amazon "search inside this book") and the key thing that you're missing is that the author is making a point there about whitespace. The book actually says

Thus, 'Jeff'='Jeff ' is true, while 'Jeff' LIKE 'Jeff ' is false.

Note the spaces!

Just to make it clear why the mistake occurred, here's the text:

enter image description here

Since the space is at the end of the line, it's hard to see. But as I say below, the quote mark wouldn't wrap to the next line if there were no space there.

I'm studying B+trees for indexing and I try to understand more than just memorizing the structure. As far as I understand the inner nodes of a B+tree forms an index on the leaves and the leaves contains pointers to where the data is stored on disk. Correct? Then how are lookups made? If a B+tree is so much better than a binary tree, why don't we use B+trees instead of binary trees everywhere?

I read the wikipedia article on B+ trees and I understand the structure but not how an actual lookup is performed. Could you guide me perhaps with some link to reading material?

What are some other uses of B+ trees besides database indexing?

B+ trees are better than binary tree all the dbms use them,

a lookup in B+Tree is LOGF N being F the base of LOG and the fan out. The lookup is performed exactly like in a binary tree but with a bigger fan out and lower height thats why it is way better.

B+Tree are usually known for having the data in the leaf(if they are unclustered probably not), this means you dont have to make another jump to the disk to get the data, you just take it from the leaf.

B+Tree is used almost everywhere, Operating Systems use them, datawarehouse (not so much here but still), lots of applications.

B+Tree are perfect for range queries, and are used whenever you have unique values, like a primary key, or any field with low cardinality.

If you can get this book http://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638 its one of the best. Its basically the bible for any database guy.

this may be a dumb question or asked many times, i searched for it but did not find a proper answer.

What is exactly going on when we type on SQL engine

"Create table xxxx" 
how to implement this one in c++, i mean how it creates a variable dynamicalyy "xxxx" and store the data in it. If i queried
"select * from xxxx" 
how it go to variable "xxxx' and get the values. I already seen SQLite and Postgres, i am not able to understand those complex things. Can any one explain it in a simpler way. If possible with a little example. (Note: what ever the sql engine it is).

Databases use various disk-based data-structures, such as B-trees, Linear-Hash-Tables, and Heap Files, to store data on disk. It is an incredibly complex topic that requires a lot of study, but essentially depending on the type of data you want to store (whether it is fixed length or variable length), a common strategy used by many databases is to store variable length data on pages stored in a heap file.

The heap file is basically a sequence of fixed-length pages, each page containing slots for different records. Pages are stored on disk, but are loaded into memory using a paging sub-system, similar to the way a file-system works. In order to facilitate quick lookup of records stored on pages, an index structure (usually a B-tree or B+tree, but also possibly a hash table) is used to find the correct page and record number of the record you're looking for stored in the heap file.

The SQL query-engine is layered on top of the underlying data structures. The query engine parses the query, and then looks up the appropriate records using indexes. (There's many other facets involved here, but that's the gist of it.)

Needless to say, implementing a relational database in C++ (or any language) requires an enormous amount of work. If you're really interested, I would recommend reading Database Management Systems.

i am using transaction scope in my project. Isolation level is IsolationLevel.ReadCommitted
i Also tried IsolationLevel.RepeatableRead

using (DBOperations dboperation = new DBOperations())
{
  GetAccountVoucherNuber();
}

i am calling this method from Transaction Scope.

private void GetAccountVoucherNuber() {

    var _dB0010013 = _dB0010013Repository.GetAll().Where(e = > e.Category == "D" && e.CategoryType == "TSNO" && e.Branch == Branch && EntityFunctions.TruncateTime(e.OnDate) == EntityFunctions.TruncateTime(TranInit.EntryDate) && e.CodeOne == BatchCode).FirstOrDefault();
    if (_dB0010013 == null) {
        _dB0010013 = new DB0010013();
        _dB0010013.Branch = Branch;
        _dB0010013.Category = "D";
        _dB0010013.CategoryType = "TSNO";
        _dB0010013.CodeOne = BatchCode;
        _dB0010013.CodeTwo = "";
        _dB0010013.CodeThree = "";
        _dB0010013.OnDate = TranInit.EntryDate;
        _dB0010013.Note = "";
        _dB0010013.LastNo = 1;

        var _operationStatus = _dB0010013Repository.AddAndSave(_dB0010013, false);
    } else {
        _dB0010013.LastNo += 1;
        var _operationStatus = _dB0010013Repository.UpdateAndSave(_dB0010013, false);
    }

}


when two or more user submits the page at same time.
i am getting same no for two transaction.
eg. suppose user1 got lastNo =85 +=1 =86
at the same time user2 also get lastNo =85 +=1 =86
so the same no is applied for two different transactions.
How Can i Lock Select Statement after getting values. or what is way to handle it.
i googled a lot but didnt find any solution.

My DbOperataion Class added for reference.

public class DBOperations:IDisposable
    {
        private TransactionScope transactionscope;

        public void Dispose()
        {
            if (this.transactionscope != null)
                this.transactionscope.Dispose();
        }
        public DBOperations()
        {
            this.Initialize();
        }
        public void Initialize()
        {
            try
            {
                this.Dispose();

                TransactionOptions transactionoption = new TransactionOptions();
                transactionoption.Timeout = new TimeSpan(1, 0, 0);
               transactionoption.IsolationLevel = System.Transactions.IsolationLevel.ReadCommitted;

                this.transactionscope = new TransactionScope(TransactionScopeOption.Required, transactionoption);

            }
            catch (Exception ex)
            {
                throw new Exception("Data Error " + ex.Message);
            }
        }
}

I prefer to set up Transaction by TransactionOptions and TransactionScope classes. From Isolation Level on Wiki, the RepeatableRead level should be enough in your case.

using System.Transactions;
....

TransactionOptions options = new TransactionOptions();
options.IsolationLevel = IsolationLevel.RepeatableRead;
// Set timeout to avoid dead lock conditions.
options.Timeout = new TimeSpan(0, 5, 0); 

using (TransactionScope trans = new TransactionScope(TransactionScopeOption.Required, options)) {

    // Execute this operation in the transaction.
    GetAccountVoucherNuber();

    // Call `Complete()` explicitly about transaction commit. Or, it will do rollback..... 
    trans.Complete();
}

I suppose to you SHOULD call Complete() on your TransactionScope .

UPDATE

Read uncommitted will do read-lock and write-lock in the transaction. In other words, your question is about to avoid Non-repeatable reads, and the Read uncommitted or Serializable isolation level could avoid the problem. If you need to avoid Phantoms problem, then you need to use Serializable isolation level. Different isolation level will have different lock behaviors. About database locks, I am prefer this book, database management systems.

After transaction commit or rollback, it will release all resources it holds, including read-locks and write-locks.