SQL tuning

Dan Tow

Mentioned 9

Explains how to tune SQL (Structured Query Language) queries to achieve optimal performance, regardless of the database platform, covering such topics as how to understand and control SQL execution plans and how to diagram SQL queries for the best execution plan for the query, in a tutorial that includes hands-on exercises to reinforce concepts. Original. (Advanced)

More on Amazon.com

Mentioned in questions and answers.

I have been using MySQL for years, mainly on smaller projects until the last year or so. I'm not sure if it's the nature of the language or my lack of real tutorials that gives me the feeling of being unsure if what I'm writing is the proper way for optimization purposes and scaling purposes.

While self-taught in PHP I'm very sure of myself and the code I write, easily can compare it to others and so on.

With MySQL, I'm not sure whether (and in what cases) an INNER JOIN or LEFT JOIN should be used, nor am I aware of the large amount of functionality that it has. While I've written code for databases that handled tens of millions of records, I don't know if it's optimum. I often find that a small tweak will make a query take less than 1/10 of the original time... but how do I know that my current query isn't also slow?

I would like to become completely confident in this field in the ability to optimize databases and be scalable. Use is not a problem -- I use it on a daily basis in a number of different ways.

So, the question is, what's the path? Reading a book? Website/tutorials? Recommendations?

As others have said, time and practice is the only real approach.

More practically, I found that EXPLAIN worked wonders for me personally. Learning to read the output of that was probably the biggest single leap I made in being able to write efficient queries.

The second thing I found really helpful was SQL Tuning by Dan Tow, which describes a fairly formal methodology for extracting performance. It's a bit involved, but works well in lots of situations. And if nothing else, it will give you a much better understanding of the way joins are processed.

I have two insert statements, almost exactly the same, which run in two different schemas on the same Oracle instance. What the insert statement looks like doesn't matter - I'm looking for a troubleshooting strategy here.

Both schemas have 99% the same structure. A few columns have slightly different names, other than that they're the same. The insert statements are almost exactly the same. The explain plan on one gives a cost of 6, the explain plan on the other gives a cost of 7. The tables involved in both sets of insert statements have exactly the same indexes. Statistics have been gathered for both schemas.

One insert statement inserts 12,000 records in 5 seconds.

The other insert statement inserts 25,000 records in 4 minutes 19 seconds.

The number of records being insert is correct. It's the vast disparity in execution times that confuses me. Given that nothing stands out in the explain plan, how would you go about determining what's causing this disparity in runtimes?

(I am using Oracle 10.2.0.4 on a Windows box).

Edit: The problem ended up being an inefficient query plan, involving a cartesian merge which didn't need to be done. Judicious use of index hints and a hash join hint solved the problem. It now takes 10 seconds. Sql Trace / TKProf gave me the direction, as I it showed me how many seconds each step in the plan took, and how many rows were being generated. Thus TKPROF showed me:-

Rows     Row Source Operation
-------  ---------------------------------------------------
  23690  NESTED LOOPS OUTER (cr=3310466 pr=17 pw=0 time=174881374 us)
  23690   NESTED LOOPS  (cr=3310464 pr=17 pw=0 time=174478629 us)
2160900    MERGE JOIN CARTESIAN (cr=102 pr=0 pw=0 time=6491451 us)
   1470     TABLE ACCESS BY INDEX ROWID TBL1 (cr=57 pr=0 pw=0 time=23978 us)
   8820      INDEX RANGE SCAN XIF5TBL1 (cr=16 pr=0 pw=0 time=8859 us)(object id 272041)
2160900     BUFFER SORT (cr=45 pr=0 pw=0 time=4334777 us)
   1470      TABLE ACCESS BY INDEX ROWID TBL1 (cr=45 pr=0 pw=0 time=2956 us)
   8820       INDEX RANGE SCAN XIF5TBL1 (cr=10 pr=0 pw=0 time=8830 us)(object id 272041)
  23690    MAT_VIEW ACCESS BY INDEX ROWID TBL2 (cr=3310362 pr=17 pw=0 time=235116546 us)
  96565     INDEX RANGE SCAN XPK_TBL2 (cr=3219374 pr=3 pw=0 time=217869652 us)(object id 272084)
      0   TABLE ACCESS BY INDEX ROWID TBL3 (cr=2 pr=0 pw=0 time=293390 us)
      0    INDEX RANGE SCAN XIF1TBL3 (cr=2 pr=0 pw=0 time=180345 us)(object id 271983)

Notice the rows where the operations are MERGE JOIN CARTESIAN and BUFFER SORT. Things that keyed me into looking at this were the number of rows generated (over 2 million!), and the amount of time spent on each operation (compare to other operations).

Another good reference that presents a general technique for query tuning is the book SQL Tuning by Dan Tow.

We are trying to optimize some of our queries.

One query is doing the following:

SELECT t.TaskID, t.Name as Task, '' as Tracker, t.ClientID, (<complex subquery>) Date,
INTO [#Gadget]
FROM task t

SELECT TOP 500 TaskID, Task, Tracker, ClientID, dbo.GetClientDisplayName(ClientID) as Client 
FROM [#Gadget]
order by CASE WHEN Date IS NULL THEN 1 ELSE 0 END , Date ASC

DROP TABLE [#Gadget]

(I have removed the complex subquery. I don't think it's relevant other than to explain why this query has been done as a two stage process.)

I thought it would be far more efficient to merge this down into a single query using subqueries as:

SELECT TOP 500 TaskID, Task, Tracker, ClientID, dbo.GetClientDisplayName(ClientID)
FROM
(
    SELECT t.TaskID, t.Name as Task, '' as Tracker, t.ClientID, (<complex subquery>) Date,
    FROM task t
) as sub    
order by CASE WHEN Date IS NULL THEN 1 ELSE 0 END , Date ASC

This would give the optimizer better information to work out what was going on and avoid any temporary tables. I assumed it should be faster.

But it turns out it is a lot slower. 8 seconds vs. under 5 seconds.

I can't work out why this would be the case, as all my knowledge of databases imply that subqueries would always be faster than using temporary tables.

What am I missing?

Edit --

From what I have been able to see from the query plans, both are largely identical, except for the temporary table which has an extra "Table Insert" operation with a cost of 18%.

Obviously as it has two queries the cost of the Sort Top N is a lot higher in the second query than the cost of the Sort in the Subquery method, so it is difficult to make a direct comparison of the costs.

Everything I can see from the plans would indicate that the subquery method would be faster.

Obviously, SQL Server is choosing the wrong query plan. Yes, that can happen, I've had exactly the same scenario as you a few times.

The problem is that optimizing a query (you mention a "complex subquery") is a non-trivial task: If you have n tables, there are roughly n! possible join orders -- and that's just the beginning. So, it's quite plausible that doing (a) first your inner query and (b) then your outer query is a good way to go, but SQL Server cannot deduce this information in reasonable time.

What you can do is to help SQL Server. As Dan Tow writes in his great book "SQL Tuning", the key is usually the join order, going from the most selective to the least selective table. Using common sense (or the method described in his book, which is a lot better), you could determine which join order would be most appropriate and then use the FORCE ORDER query hint.

Anyway, every query is unique, there is no "magic button" to make SQL Server faster. If you really want to find out what is going on, you need to look at (or show us) the query plans of your queries. Other interesting data is shown by SET STATISTICS IO, which will tell you how much (costly) HDD access your query produces.

What are some of the methods/techniques experienced SQL developers use to determine if a particular SQL query will scale well as load increases, rows in associated tables increase etc.

I strongly recommend reading some reference material on this. This (hyperlink below) is probably a pretty good book to look into. Make sure to look under "Selectivity", among other topics.

SQL Tuning - Dan Tow

Let's say I have poorly performing query. How do you usually go about sql optimization? What are the first things I should look in query execution plan? Is there a good article or book about this?

The Performance Tuning Guide is a great place to start but Jonathan Lewis's Cost Based Oracle Fundamentals is the canonical reference on what the optimizer is doing and why. Depending on the complexity of the problem, CBO Fundamentals may be radical overkill, though.

As a first pass, if you look at the estimated cardinality of each step in the plan, it's useful to ask whether that cardinality is reasonable. If the cardinalities are reasonable, it's likely that Oracle is picking the most efficient plan and you need to look for other approaches to tuning the query. If the cardinalities are wildly incorrect, on the other hand, it's likely that Oracle has chosen a poorly performing query plan and that something about the statistics needs to be adjusted.

In order to make use of the execution plan (which is a description of how the database will implement your request), one has to understand what options are available to it (the database) and make a judgement on whether the choice made by the optimizer was the correct one. This requires a lot of knowledge and experience.

For the work I'm doing (ETL processing), performance "problems" usually falls into one of two categories:

  1. The query takes a long time because reading lots of data takes lots of time :)
  2. The optimizer made a mistake and picked the wrong execution plan

For (1), I have to decide whether I can restructure the data differently so I scan less data. Indexes are seldom of use since I'm interested in large enough subsets to make indexed access slower than a full table scan. For example, I might store a horizontal subset of the data (last 30 days) or a vertical subset of the data (10 columns instead of 80), or an aggregate of the data. Either way will reduce the size of the data to make an increase in processing speed. Of course, if the data is used only once, I've just moved the problem elsewhere.

For (2), I usually start by checking "Cardinality" (num rows) at the top line in the xplan. If I know that my query returns 5,000,000 rows, but it says 500, I can be pretty sure that the optimizer messed up somewhere. If total cardinality is in the right ball park, I start from the other end instead and check each step of the plan until I find the first big estimation error. If the cardinality is wrong, the join method is probably wrong between that table and the next, and this error will cascade through the rest of the plan.

Google for "Tuning by cardinality feedback", and you will find a paper written by Wolfgang Breitling who describes (in a much better way) the abobve approach. It's a really good read!

Also, be sure to hang around Jonathan Lewis Blog. if there is something about the Oracle optimizer he doesn't know, it's not worth knowing. He has written the best book on the subject as well. Check out Cost-Based Oracle fundamentals. If I could send one book back in time to myself, this would be it.

Expert Oracle Database Architecture by Tom Kyte, (the man behind "Ask tom"), is an awesome read as well. My first read of this book was a disappointment, because I was looking for "tuning tips" and found none. On my second read I realized that by knowing how the database works, you can eliminate entire classes of performance problems by "designing for performance" from start instead of "adding" performance at the end :)

SQL Tuning by Dan Tow, is another awesome read for the fundamentals of how exactly one could determine the optimal execution plan. That knowledge can be used as a way of troubleshooting an execution plan (or forcing the optimizer into doing it your way).

If you made it this far, now would be a good time to fiddle around with hints. Not before.

I've been pondering the question which of those 2 Statements might have a higher performance (and why):

select * from formelement 
where formid = (select id from form where name = 'Test')

or

select * 
from formelement fe 
inner join form f on fe.formid = f.id 
where f.name = 'Test'

One form contains several form elements, one form element is always part of one form.

Thanks,

Dennis

The performance depends on the query plan choosen by the SQL Server Engine. The query plan depends on a lot of factors, including (but not limited to) the SQL, the exact table structure, the statistics of the tables, available indexes, etc.

Since your two queries are quite simple, my guess would be that they result in the same (or a very similar) execution plan, thus yielding comparable performance.

(For large, complicated queries, the exact wording of the SQL can make a difference, the book SQL Tuning by Dan Tow gives a lot of great advice on that.)

I am trying to learn SQL query optimization using SQL Server 2005. However, I did not find any practical example where I can improve performance by tweaking query alone. Can you please list some example queries - prior and after optimization?

It has to be query tuning – not adding index, not making covering index, not making any alteration to table. No change is supposed to be done on table and index. [ I understand that indexes are important. This is only for learning purpose]

It would be great if you can exaplain using temp table instead of refering any sample databases like adventureworks.

Expecting your support...

Thanks

Lijo

I suggest you reading a good book, such as SQL Tuning

I want to know, When I use where clause following the directives join, What happen ?

In fact, something that I want to know is priority of execute. For instance, Consider the below code:

select * from tablename1 t1 inner join tablename2 t2 on t1.id=t2.id where id='10'

Then what happen in the above code first?

SQL will select all rows that their id are 10 and then runs join keyword ? Or join keyword runs in first and then where clause?

To answer your question Join runs first, but if you want to filter try this query

select * 
from (
  select *
  from tablename1
  where id='10') t1,
  (
  select *
  from tablename2 
  where id = '10'
) t2  

Update: As others suggest checking execution plan will help to understand how query performs, but still if you want to decide on your own, better to do it yourself.

Update 2: Thanks to @GiorgiNakeuri for his points. As mentioned in the comments after a change to inner queries Inner Join is not needed. A simple cartesian join is enough.

Read Art of SQL for more info on sql optimisations.

For more advanced knowledge try SQL Tuning

I been looking for some general Tips and Tricks for Query Opitimization in SQL Server 2008.

I came accross the below which am quite unsure what are the consequences of using views / triggies

  • Try to replace views with original source table
  • Triggers should not be used if possible, incorporate the logic of trigger in stored procedure

Can someone shed light on these explaining pros and cons and also point me to list of general Tips and Tricks for Query optimization.

A view is a virtual table formed from a query such as 'SELECT a, b FROM tableC' This creates a 'virtual table' of a & b. A source table is the table its self, and it is far more efficient to be able to 'SELECT * FROM tableC' so that a virtual table is not created.

Triggers are events that are triggered when a specified event happens, such as a specific query. Stored procedures are compiled transactions and can function exactly like a trigger except they are far more efficient. In short, do not use triggers.

I would seriously consider this http://www.amazon.com/SQL-Tuning-Dan-Tow/dp/0596005733/ref=sr_1_1?ie=UTF8&qid=1345679033&sr=8-1&keywords=sql+optimization