Shi Wu Chu Li Gai Nian Yu Ji Shu

Jim Gray, Andreas Reuter

Mentioned 12

A comprehensive presentation of the key concepts and techniques of transaction processing. The authors provide a description of the transaction concepts and how it fits in a distributed computing environment, as well as a thorough discussion of the complex issues related to transaction recovery. The book will be invaluable to anyone interested in using or implementing distributed systems or client server systems.

More on

Mentioned in questions and answers.

I have a question about SQL and locking strategies. As an example, suppose I have a view counter for the images on my website. If I have a sproc or similar to perform the following statements:

UPDATE images SET counter=counter+1 WHERE image_id=some_parameter;

Assume that the counter for a specific image_id has value '0' at time t0. If two sessions updating the same image counter, s1 and s2, start concurrently at t0, is there any chance that these two sessions both read the value '0', increase it to '1' and both try to update the counter to '1', so the counter will get value '1' instead of '2'?

s1: begin
s1: begin
s1: read counter for image_id=15, get 0, store in temp1
s2: read counter for image_id=15, get 0, store in temp2
s1: write counter for image_id=15 to (temp1+1), which is 1 
s2: write counter for image_id=15 to (temp2+1), which is also 1
s1: commit, ok
s2: commit, ok

End result: incorrect value '1' for image_id=15, should have been 2.

My questions are:

  1. Is this scenario possible?
  2. If so, does the transaction isolation level matter?
  3. Is there a conflict resolver which would detect such a conflict as an error?
  4. Can one use any special syntax in order to avoid a problem (something like Compare And Swap (CAS) or explicit locking techniques)?

I'm interested in a general answer, but if there are none I'm interested in MySql and InnoDB-specific answers, since I'm trying to use this technique to implement sequences on InnoDB.

EDIT: The following scenario could also be possible, resulting in the same behavior. I'm assuming that we are in isolation level READ_COMMITED or higher, so that s2 gets the value from the start of the transaction although s1 already wrote '1' to the counter.

s1: begin
s1: begin
s1: read counter for image_id=15, get 0, store in temp1
s1: write counter for image_id=15 to (temp1+1), which is 1 
s2: read counter for image_id=15, get 0 (since another tx), store in temp2
s2: write counter for image_id=15 to (temp2+1), which is also 1
s1: commit, ok
s2: commit, ok

If the locking is not done properly it certainly is possible to get this type race condition, and the default locking mode (read committed) does allow it. In this mode, the reads only place a shared lock on the record, so they can both see 0, increment it and write 1 out to the database.

In order to avoid this race condition, you need to set an exclusive lock on the read operation. 'Serializable' and 'Repeatable Read' concurrency modes will do this, and for an operation on a single row they are pretty much equivalent.

To make it completely atomic you have to:

  • Set an appropriate transaction isolation level such as Serializable. Normally you can do this from a client library or explicilty in SQL.
  • Begin the transaction
  • Read the data
  • Update it
  • Commit the transaction.

You can also force an exclusive lock on the read with a HOLDLOCK (T-SQL) or equivalent hint, depending on your SQL dialect.

A single update query will do this atomically but you can't split the operation (perhaps to read the value and return it to the client) without ensuring that the reads take out an exclusive lock. You will need to get the value out atomically in order to implement a sequence, so the update by itself is probably not quite all you need. Even with the atomic update, you still have a race condition to read the value after the update. The read will still have to take place within a transaction (storing what it got in a variable) and issue an exclusive lock during the read.

Note that to do this without creating a hot spot your database needs to have proper support for autonomous (nested) transactions within a stored procedure. Note that sometimes 'nested' is used to refer to chaining transactions or save points, so the term can be a bit confusing. I've edited this to refer to autonomous transactions.

Without autonomous transactions your locks are inherited by the parent transaction, which can roll back the whole lot. This means they will be held until the parent transaction commits, which can turn your sequence into a hot spot that serialises all transactions using that sequence. Anything else trying to use the sequence will block until the whole parent transaction commits.

IIRC Oracle supports autonomous transactions but DB/2 didn't until fairly recently and SQL Server doesn't. Off the top of my head I don't know whether InnoDB supports them, but Grey and Reuter go on at some length about how difficult they are to implement. In practice I'd guess it's quite likely that it might not. YMMV.

I am trying to understand isolation/locks in SQL Server.

I have following scenario in READ COMMITTED isolation level(Default)

We have a table.

create table Transactions(Tid int,amt int)

with some records

insert into Transactions values(1, 100)
insert into Transactions values(2, -50)
insert into Transactions values(3, 100)
insert into Transactions values(4, -100)
insert into Transactions values(5, 200)

Now from msdn i understood

When a select is fired shared lock is taken so no other transaction can modify data(avoiding dirty read).. Documentation also talks about row level, page level, table level lock. I thought of following scenarion

Begin Transaction

select * from Transactions

some buisness logic which takes 5 minutes



What I want to understand is for what duration of time shared lock would be acquired and which (row, page, table).

Will lock will be acquire only when statement select * from Transactions is run or would it be acquire for whole 5+ minutes till we reach COMMIT.

You are asking the wrong question, you are concerned about the implementation details. What you should think of and be concerned with are the semantics of the isolation level. Kendra Little has a nice poster explaining them: Free Poster! Guide to SQL Server Isolation Levels.

Your question should be rephrased like:

select * from Items

Q: What Items will I see?
A: All committed Items

Q: What happens if there are uncommitted transactions that have inserted/deleted/update Items?
A: your SELECT will block until all uncommitted Items are committed (or rolled back).

Q: What happens if new Items are inserted/deleted/update while I run the query above?
A: The results are undetermined. You may see some of the modifications, won't see some other, and possible block until some of them commit.

READ COMMITTED makes no promise once your statement finished, irrelevant of the length of the transaction. If you run the statement again you will have again exactly the same semantics as state before, and the Items you've seen before may change, disappear and new one can appear. Obviously this implies that changes can be made to Items after your select.

Higher isolation levels give stronger guarantees: REPEATABLE READ guarantees that no item you've selected the first time can be modified or deleted until you commit. SERIALIZABLE adds the guarantee that no new Item can appear in your second select before you commit.

This is what you need to understand, no how the implementation mechanism works. After you master these concepts, you may ask the implementation details. They're all described in Transaction Processing: Concepts and Techniques.

reading about NoSQL (, a movement aimed at encouraging the dropping of traditional relational databases in favor of custom, application-suited storage systems.

Intrigued by the idea of trying to write a small personal storage system (for the .net framework) as a learning pet project, what are you suggestions or useful links? Where to start? How to balance what's on the hard drive and what's in memory?

I think this could be an interesting opportunity to learn the insides of database inner work, but I really lack the most basic theory of it. Thanks.

The NoSQL movement is aimed at huge scale systems, at sizes where the relational model truly breaks. Before you start writing your own storage I highly recommend understanding the relational model, as is one of the best documented and well understood domains in CS. Start with the Gray's and Reuter's Transaction Processing, this book explains everything there is to know about implementing a classic RDBMS. Next on your list should be the Readings in Database Systems, this is a collection of the most relevant scientific papers and articles.

Well it all depends on the app you are building.

For example, if your app just needs to persist a few hundred objects and cut through them in a few ways and doesn't care if stuff gets corrupt once in a while. You could potentially just use LINQ to query a List and persist the List to disk once in a while.

If you need anything that has the magic ACID properties, well its going to take tons of work.

If you need something that supports Transactions, its going to take tons of work.

If you need something that understands the ANSI-SQL, you are going to have to write a parser, which is lots of work.

Before embarking on writing any kind of database I think you should understand a lot of database theory, get a book, read it.

Does anyone know of a resource that will tell me the sequence of locks that will be taken out on a table/page/row/index during a select/insert/update/delete in SQL Server 2005 AND how different table hints and isolation levels will impact the locks taken?

I know I am asking a lot here, but surely this information must be documented somewhere?

Thanks in advance,


SQL Server locking is based on the concepts in Transaction Processing: Concepts and Techniques. This book explains in great detail how locks are to be acquired, what locks are needed and why things must be the way they are.

The resources Marc linked are good coverage on the topic, but the details are scattered and you need to know where to look. Here is a primer to start you up:

The transaction isolation levels only affect read locks. Under normal read committed when reading a row an S-lock is acquired that is released immediately after the read. If the isolation level is elevated to repeatable read then the S-locks are held until the transaction ends. On higher serializable level range locks are placed instead of simple row locks, and they are held until the transaction commits. The snapshot modes are different in that they don't necessarily affect the type of lock, but the source of the read: rows are retrieved from the version store instead.

Lock order/hierarchy is always the same:

  • an Sch-S lock is placed on the metadata at the start of any DML operation. DDL operations require Sch-M locks and thus conflict, so DML can be assured of the 'stability' of the schema on which it operates (object schema, not database schema...).
  • The lock hierarchy path to a row is table-page-row. The actual granularity decided by the engine is dynamic. Typically is row.
  • No matter the granularity, the path to the actual locked resource is protected with intent locks. Ie. to S-lock a row, the reader must acquire IS-lock on the table and the page. To S-lock a page, it needs an IS-lock on table.
  • Single partition operations acquiring more that 5000 locks on a scan may trigger lock escalation. Escalation is always an attempt (ie. will never block if failed). Escalation in practice goes always from row-level locking to table (partition in 2008) level locking.

The lock hints can never change the order of locks, they can only change:

  • the type of lock (U-lock or X-lock when an S-lock would be required)
  • the granularity (enforce table, or page or row)
  • the duration (hold S-locks)
  • the blocking behavior (readpast to skip incompatible rows).

I did not talk too much about insert/update/deletes since they are quite uninteresting: they require X locks, that's it. The only interesting thing about it is the way update works because it first acquire an U-lock that is later converted to an X-lock. This behavior is needed to leverage the U-lock asymmetry that allows pending S-locks to drain before the update proceeds.

With this I hope you can go and find all the details left out from the articles and books linked.

How about these:

UPDATE: how about these more on transaction isolation levels and query hints:

If you're interested in these rather advanced topics, I'd strongly recommend you get the SQL Server 2008 Internals book by Kalen Delaney (and others) which has all these nitty gritty details in them - even in this book, the "locking" topic only begin on pages 610 and up :-)

alt text


I have located many resources on the web giving general overviews of MVCC (multi-version concurrency control) concepts, but no detailed technical references on exactly how it should work or be implemented. Are there any documents online, or books offline, that contain enough theory (and a bit of practical help, ideally) on which to base an implementation? I wish to emulate more or less what PostgreSQL does.

(For info I will be implementing it in SAS using SAS/Share - which provides some locking primitives and concurrent read/write access to the underlying data store, but nothing in the way of transaction isolation or proper DBMS features. If anyone is familiar with SAS/Share and thinks this is an impossible task, please shout!)

Is there a really good book that covers in depth: two phase commit, paxos as well as limitations in achieving consistency, availability, partition tolerance.

Browsing Amazon it is amazing to see the number of distributed systems books that don't even cover paxos.

The thing with the Paxos algorithm is that it's fairly recent (published as a journal article in 1998 according to and classic books like Readings in Database Systems or Transaction Processing cover the other topics you care about, but actually pre-date Paxos!

Background to question: I'm looking to implement a caching system for my website. Currently we're exploring memcache as a means of doing this. However, I am looking to see if something similar exists for SQL Server. I understand that MySQL has query cache which although is not distributed works as a sort of 'stop gap' measure. Is MySQL query cache equivalent to the buffer cache in SQL Server?

So here are my questions:

  1. Is there a way to know is currently stored in the buffer cache?
  2. Follow up to this, is there a way to force certain tables or result sets into the cache
  3. How much control do I have over what goes on in the buffer and procedure cache? I understand there used to be a DBCC PINTABLE command but that has since been discontinued.
  4. Slightly off topic: Should the caching even exists on the database layer? Or it is more prudent to manage caches using Velocity/Memcache? Is so, why? It seems like cache invalidation is something of a pain when handling many objects with overlapping triggers.


SQL Server implements a buffer pool same way every database product under the sun does (more or less) since System R showed the way. The gory details are explain in Transaction Processing: Concepts and Techniques. I addition it has a caching framework used by the procedure cache, permission token cache and many many other caching classes. This framework is best described in Clock Hands - what are they for.

But this is not the kind of caching applications are usually interested in. The internal database cache is perfect for scale-up scenarios where a more powerfull back end database is able to respond faster to more queries by using these caches, but the modern application stack tends to scale out the web servers and the real problem is caching the results of query interogations in a cache used by the web farm. Ideally, this cache should be shared and distributed. Memcached and Velocity are examples of such application caching infrastructure. Memcache has a long history by now, its uses and shortcommings are understood, there is significant know-how around how to use it, deploy it, manage it and monitor it.

The biggest problem with caching in the application layer, and specially with distributed caching, is cache invalidation. How to detect the changes that occur in the back end data and mark cached entries invalid so that new requests don't use stale data.

The simplest (for some definition of simple...) alternative is proactive invalidation from the application. The code knows when it changes an entity in the database, and after the change occurs it takes the extra step to mark the cached entries invalid. This has several short commings:

  • Is difficult to know exactly which cached entries are to be invalidated. Dependencies can be quite complex, things are always more that just a simple table/entry, there are aggregate queries, joins, partitioned data etc etc.
  • Code discipline is required to ensure all paths that modify data also invalidate the cache.
  • Changes to the data that occur outside the application scope are not detected. In practice, there are always changes that occur outside the application scope: other applications using the same data, import/export and ETL jobs, manual intervention etc etc.

A more complicated alternative is a cache that is notified by the database itself when changes occur. Not many technologies are around to support this though, it cannot work without an active support from the database. SQL Server has Query Notifications for such scenarios, you can read more about it at The Mysterious Notification. Implementing QN based caching in a standalone application is fairly complicated (and often done badly) but it works fine when implemented correctly. Doing so in a shared scaled out cache like Memcached is quite a feats of strength, but is doable.

HP NonStop systems (previously known as "Tandem") are known for their high availability and reliability, and higher price.

How do Linux or Unix based clusters compare with them, in these respects and others?

On a fault-tolerant machine the fault tolerance is handled directly in hardware and transparent to the application. Programming a cluster requires you to explicitly handle the fault tolerance in the application.

In practice, a clustered application architecture is much more complex to build and error prone than an application built for a fault-tolerant platform such as NonStop. This means that there is a far greater scope for unreliability driven by application bugs, as the London Stock Exchange found out the hard way. They had an incumbent Tandem-based system, which was quite a common architecture for stock exchange trading applications. Their new CEO had the bright idea that Microsoft was the way forward and had a big-5 consultancy build a .Net system based on a cluster of 120 servers.

The problem with clustered applications is that the failures can be correlated. If an application or configuration bug exists in the system it will typically be replicated on all of the nodes. This means that you can get a single situation or event that can take out the whole cluster. The additional complexity of clustered applications makes them more error-prone to develop and deploy, which raises the odds of this happening. A clustered system built on (for example) Linux and J2EE is vulnerable to the same types of failure modes.

IMHO this is a major advantage of older-style mainframe architectures. Several vendors (IBM, HP, DEC and probably several others I can't think of) made fault tolerant systems. The underlying programming model for this type of system is somewhat simpler than a clustered n-tier application server. This means that there is comparatively little to go wrong and for a given amount of effort you can achieve a more reliable system. A surprising number of older architectures are still alive and well and living quite comfortably in their market niches. IBM still sell plenty of Z and I series machines; Unisys still makes the A Series and 2200 series; VMS and NonStop are still allive within HP. The sales of these systems are not all to existing clients - for example a Commercial Underwriting system (GENIUS) runs on the ISeries and is still a market leader in this niche with new rollouts going on as I write this. The application has survived two attempts to rewrite it (1 in in Java and 1 in .Net) that I am aware of and the 'Old School' platform doesn't really seem to be cramping its style.

I wouldn't go shorting any screen-scraper vendors just yet ...

Gray & Reuter's Transaction Processing: Concepts and Techniques is somewhat dry and academic, but has a good treatment of fault-tolerant systems architecture. One of the authors was a key player in the design of Tandem's systems.

I need to design some data structures for something that's roughly a cross between a database and a file system. In fact there's a large overlap in the design of file systems and databases. I've done some work on both, not very in-depth, but still, I have some idea about the topics. However I need more information and I am wondering if there's a good overview of design concepts out there. Things like overviews of algorithms, data structures and the like. The advantages of different types of trees or hash tables for key lookup. Perhaps some information about data ordering an alignment. In short, the experiences of other people in implementing filesystems and databases that would help the next person avoid the same mistakes.

Gray wrote a book titled "Transaction Processing: Concepts and Techniques" - - which covers a great deal of what you would need to build your own database.

One starting point for file systems would be - but I suspect you would have to chase down individual papers or Linux source code to keep up with the variety of different file systems created since then.

Is there a good C++ framework to implement XA distributed transactions?

With the term "good" I mean usable, simple (doesn't imply "easy"), well-structured.

Due to study reasons, at the moment I'm proceeding with a personal implementation, following X/Open XA specification.

Thank you in advance.

I am not aware of an open-source or free transaction monitor that has any degree of maturity, although This link does have some fan-out. The incumbent commercial ones are BEA's Tuxedo, Tibco's Enterprise Message Service (really a transactional message queue manager like IBM's MQ) and Transarc's Encina (now owned by IBM). These systems are all very expensive.

If you want to make your own (and incidentally make a bit of a name for yourself by filling a void in the open-source software space) get a copy of Grey and Reuter. This is the definitive work on transaction processing systems architecture, written by two of the foremost experts in the field.

Interestingly, they claim that one can implement a working TP monitor in around 10,000 lines of C. This actually sounds quite reasonable, as what it does is not all that complex. On occasion I have been tempted to try.

Essentially you need to make a distributed transaction coordinator that runs as a daemon process. You will need to get the resource manager protocol working from it, so starting with this as a prototype is probably a good start. If you can get it to independently roll back or commit a transaction you have the basis of the TM-RM interface.

The XA API as defined in the spec is the API to control the transaction manager. Strictly speaking, you don't need to make a 3-tier architecture to use distributed transactions of this sort, but they are more or less pointless without a TP monitor. How you communicate from the front-end to the middle-tier can be left as an exercise for the reader. You are probably best off using an existing ORB, of which there are several good open-source implementations available.

Depending on whether you want to make the DTC and the app server separate processes (which is possibly desirable for stability but not strictly necessary) you could also use ACE as a basis for the DTC server.

If you want to make a high-performance middle-tier server, check out Douglas Schmidt's ACE framework. This comes with an ORB called TAO, and is flexible enough to allow you to use more or less any threading model that takes your fancy. Using this is a trade-off between learning it and the effort of writing your own and debugging all the synchronisation and concurrancy issues.

I’ve lately come across a rather frustrating situation where SQL server refuses to issue locks only against a primary key when a statement like this select * from table with (rowlock updlock) where key=value is executed against it. Now don’t get me wrong here, it does lock the row but it goes one step farther and locks the table too.

I’ve read up about SQL lock escalation and I’ve looked at using specific indexes in the lock hints but you see, this just isn’t practical when there are numerous indexes in a table with millions of records and concurrent updates that need to happen on those records. For a small table and a specific query it’s possible to get the desired behaviour, but when the table has a large width (many columns) and there are numerous processes using the data, this method melts down and can become a real point of contention.

What I’d like to see added is a new lockhint suck as PKLock (which would stand for Primary Key Lock) which would issue a lock against the primary key of a row and anytime an index, a table scan or other method is used to get the row, it would check this lock and honour it instead of locking the entire table.

As such a table lock would not need to be issued and this would greatly increase the capacity for parallel execution of code against the DB.

Please weigh in on this idea and point out any flaws which it might have, ways that it could be improved or other elements that should be added to resolve my dilemma.



If I execute this query

begin transaction
select lockname from locks  where lockname='A'
update Locks Set locked=1 where lockname='A'

and then this query:

begin transaction
select lockname from locks  where lockname='A'

Row A is returned in both examples before committing the transactions. This is reading behind the update, not blocking.

A successful solution should do the following without specifying indexes to use:

  1. With Query 1: Read and lock A, update A
  2. With Query 2: Read and lock B, update B, Commit query 2
  3. With Query 2: Read B and be blocked until Locks on A are released
  4. With Query 1: Commit Query 1
  5. With Query 2 :Read and lock A, update A, Commit query 2

You have asked this question before and were given the answer: fix your schema and your code. In that post the lock conflict was an IX lock, and a conflict on intent locks indicates high granularity locks, which in turn indicate table scans. You don't need lock hints, you just need an index and decent queries. Take for instance your other question, Why does row level locking not appear to work correctly in SQL server? where answer is trivial: Locks table needs to be organized by a clustered index on LockName:

CREATE TABLE [dbo].[Locks]( 
    [LockName] [varchar](50) NOT NULL, 
    [Locked] [bit] NOT NULL, 

insert into Locks (LockName, Locked) values ('A', 0);
insert into Locks (LockName, Locked) values ('B', 0);

On one session do this:

begin transaction
update Locks 
  set Locked=1 
  output inserted.*
  where LockName = 'A';

On the other session do this:

begin transaction
update Locks 
  set Locked=1 
  output inserted.*
  where LockName = 'B';

There is no update conflict, no blocking, no need for (wrong) hints, nothing. Just good ole' correct schema and query design.

As a side note, the lock you describe here already exists and are called key-locks. They are the default, implicit mode SQL Server operates. Just how in the world do you imagine SQL Server can publish TPC-C benchmark numbers of 16000 tpc transaction per second? You have all the parallelism capacity you need in the server, you just need to read a book or two to understand how to use it. There is plenty of literature on the subject, you can start with Transaction Processing: Concepts and Techniques.


begin transaction 
select lockname from locks  where lockname='A' 
update Locks Set locked=1 where lockname='A'

This will never work, no matter how many/diverse lock hints you try. This is why you have the update with output syntax:

begin transaction 
update Locks 
 Set locked=1 
 output inserted.*
 where lockname='A'

this ensures that you first update, then return what you've updated. This technique is fairly common in databases for exactly the semantics you seek: resource acquisition. In fact this technique is the cornerstone of the resource acquisition poster child: queue processing. See Queues paragraph in OUTPUT Clause. In queues you have a table of resources to be processed, and each thread grabs one, locks it and start processing:

create table Resources (
   id int identity(1,1) not null,
   enqueue_time datetime not null default getutcdate(),
   is_processing bit not null default 0,
   payload xml);

create clustered index cdxResources on Resources 
  (is_processing, enqueue_time);

-- enqueue:
insert into Resources (payload) values ('<do>This</do>');
insert into Resources (payload) values ('<do>That</do>');
insert into Resources (payload) values ('<do>Something</do>');
insert into Resources (payload) values ('<do>Anything</do>');

Now from separate sessions, run this:

begin transaction;
with cte as (
  select top(1) *
  from Resources with(readpast)
  where is_processing = 0
  order by enqueue_time)
update cte
  set is_processing = 1
  output inserted.*;   

You'll see each session grabs it's own resource, locks it and skipps everything locked by everybody else. It so happens I have in production a system that runs exactly like this, with over 5M resources int he table (they are web service payment processing requests), and dequeueing and processing around 50 per second, from 100 concurent processors (takes about 2sec. per call to process). On a piece of junk hardware. So it absolutely is possible.

I'm doing this paper work for university about DataWarehouse, more specifically about OLTP. I couldn't find much information on the web. I find general and superficial summaries, but nothing that coould give me the possibility to do a more detailed work.

I would really apreciate any help about finding that kind of information, which im a bit lost right now.

So, is there any pdf ou e-book or something which i can use to base my work?

Thanks in advanced, John

PS. about OLTP, datawarehouse i can find enought information

Thats true, there isn't so much papers about. Books are the best source for this topic.
I don't have so much information, but some links may help. Tell me if this is what you are looking for.
What kind of transaction are you interesting: cards, stock market?

Algorithms: Message passing, subscribe, AMQP.
Enterprise standards: PCI
Protocols: ISO 8583, FIX
A famous OLTP product: FIS
Books: a, b, 1, 2