Relational Database Index Design and the Optimizers

Tapio Lahdenmaki, Mike Leach

Mentioned 5

Improve the performance of relational databases with indexes designed for today's hardware Over the last few years, hardware and software have advanced beyond all recognition, so it's hardly surprising that relational database performance now receives much less attention. Unfortunately, the reality is that the improved hardware hasn't kept pace with the ever-increasing quantity of data processed today. Although disk packing densities have increased enormously, making storage costs extremely low and sequential read very fast, random reads are still painfully slow. Many of the old design recommendations are therefore no longer valid-the optimal point of indexing has come a long way. Consequently many of the old problems haven't actually gone away-they have simply changed their appearance. This book provides an easy but effective approach to the design of indexes and tables. Using lots of examples and case studies, the authors describe how the DB2, Oracle, and SQL Server optimizers determine how to access data, and how CPU and response times for the resulting access paths can be quickly estimated. This enables comparisons to be made of the various designs, and helps you choose available choices for the most appropriate design. This book is intended for anyone who wants to understand the issues of SQL performance or how to design tables and indexes effectively. With this title, readers with many years of experience of relational systems will be able to better grasp the implications that have been brought into play by the introduction of new hardware. An Instructor's Manual presenting detailed solutions to all the problems in the book is available online from the Wiley editorial department. An Instructor Support FTP site is also available.

More on Amazon.com

Mentioned in questions and answers.

I would like to know if there are general rules for creating an index or not. How do I choose which fields I should include in this index or when not to include them?

I know its always depends on the environment and the amount of data, but I was wondering if we could make some globally accepted rules about making indexes in Oracle.

Wow, that's just such a huge topic, it's hard to answer in this format. I srtongly recommend this book.

Relational Database Index Design and the Optimizers by Tapio Lahdenmaki

You don't just use indexes to make table access faster, sometimes you make indexes to avoid table access altogether. Something not mentioned yet but vital.

There's a whole science to this if you really want to make your database perform maximally.

Ah, one specific optimization to Oracle is building reverse key indexes. If you have a PK index of a monoatomically increasing value, like a sequence, and you have highly concurrent inserts and don't plan to range scan that column then make it a reverse key index.

See how specific these optimizations can be?

What are the things that you would consider when defining indexes, clustered and non-clustered, for SQL Server? Are there any anti-patterns that DB newbies should be aware of? Please explain the "Why" or provide references if possible.

Consider reading Relational Database Index Design and the Optimizers. It will give you a lot of ideas and the reasons why they are good.

How many records should there be before I consider indexing my sql tables?

If there's a unique constraint on the table (and there should be at least one), then that will usually be enforced by a unique index.

Otherwise, you add indexes when the query performance is bad and adding the index will demonstrably improve the performance. There are books on the subject of how to create good sets of indexes on tables, including Relational Database Index Design and the Optimizers. It will give you a lot of ideas and the reasons why they are good.

See also:

and, no doubt, a host of others.

Tabe1 has around 10 Lack records (1 Million) and does not contain any primary key. Retrieving the data by using SELECT command ( With a specific WHERE condition) is taking large amount of time. Can we reduce the time of retrieval by adding a primary key to the table or do we need to follow any other ways to do the same. Kindly help me.

It depends on the SELECT statement, and the size of each row in the table, the number of rows in the table, and whether you are retrieving all the data in each row or only a small subset of the data (and if a subset, whether the data columns that are needed are all present in a single index), and on whether the rows must be sorted.

If all the columns of all the rows in the table must be returned, then you can't speed things up by adding an index. If, on the other hand, you are only trying to retrieve a tiny fraction of the rows, then providing appropriate indexes on the columns involved in the filter conditions will greatly improve the performance of the query. If you are selecting all, or most, of the rows but only selecting a few of the columns, then if all those columns are present in a single index and there are no conditions on columns not in the index, an index can help.

Without a lot more information, it is hard to be more specific. There are whole books written on the subject, including:

I have identified the query constructs my users normally use. Would it make sense for me to create composite indexes to support those constructs and provide FIRST_ROWS capability?

If I migrate from SE to IDS, I will lose the ability to write low-level functions with C-ISAM calls, but gain FIRST_ROWS along with other goodies like: SET-READS for index scans (onconfig USE_[KO]BATCHEDREAD), optimizer directives, parallel queries, etc.


Information from Comments

Pawnshop production tables are queried by: customer.name char(30) using wildcards (LASSURF* to find LASTNAME SURNAME, FIRSTNAME) or queried by pawns.ticket_number INT. Customer and pawns are joined by: customer.name = pawns.name, not customer.serial = pawns.fk. Pawns with trx date older than 1 year are moved to historical table (>500K nrows) in a different database, on another hard disk. Index on historical is by trx_date descending. This is where the ad-hoc composite query constructs come into play.

Once a customer's pawn transaction is found, the row is updated when an intrest or redeem pymt is made by the customer. If customers don't make a pymt in 90 days, users will mananually update which pawns they will forfeit. pawns.status changes to inactive when a customer redeems a pawn or forfeits it for lack of pymt. inactives are moved out of pawns table into historical table when their trx dates are older than 1 year, so no mass-updating occurs in this app. Pawnshops run this proc every morning before opening business.

{ISQL 2.10.06E (SE-DOS16M protected mode) pawns table optimization - 
 once-daily, before start of business, procedure}

 unload to "U:\UNL\ACTIVES.UNL"
    select * from pawns where pawns.status = "A"
  order by pawns.cust_name, pawns.trx_date;

 unload to "U:\UNL\INACTIVE.UNL"
    select * from pawns
     where pawns.status <> "A"
       and pawns.trx_date >= (today - 365)
  order by pawns.cust_name, pawns.trx_date desc;

 unload to "U:\UNL\HISTORIC.UNL"
    select * from pawns
     where pawns.status <> "A"
       and pawns.trx_date < (today - 365)
  order by pawns.trx_date desc;

 drop table pawns;

 create table pawns
 (
     trx_num serial,
     cust_name char(30),
     status char(1),
     trx_date date,
 . . . ) in "S:\PAWNSHOP.DBS\PAWNS";

 load from "U:\UNL\ACTIVES.UNL" insert into pawns;         {500:600 nrows avg.}
 load from "U:\UNL\INACTIVE.UNL" insert into pawns;        {6500:7000 nrows avg.}
 load from "U:\UNL\HISTORIC.UNL" insert into dss:historic; {>500K nrows}

 create cluster index pa_cust_idx on pawns (cust_name);

 {this groups each customers pawns together, actives in
  oldest trx_date order first, then inactive pawns within the last year in most 
  recent trx_date order. inactives older than 1 year are loaded into historic 
  table in a separate database, on a separate hard disk. historic table 
  optimization is done on a weekly basis for DSS queries.}

 create unique index pa_trx_num_idx on pawns (trx_num);
 create index pa_trx_date_idx on pawns (trx_date);
 create index pa_status_idx on pawns (status);

 {grant statements...}

 update statistics;

There isn't a simple yes/no answer - it is a balancing act, as with so many performance issues.

There are two main costs associated with indexes which must be balanced against the benefits.

  1. Indexes must be maintained as rows are added, deleted, modified in the table. The cost is not huge, but neither is it negligible.
  2. Indexes occupy disk space.

There is also a small overhead when queries are optimized simply because there are more indexes to consider.

The primary benefit of good indexes is vastly improved performance on selecting data when the index can be used to good effect.

If your tables are not very volatile and are frequently searched with criteria where the indexes can help, then it probably makes sense to create the composite indexes, assuming that disk space is not an issue.

If your tables are very volatile, or if a specific index will seldom be used (but is beneficial on those few occasions when it is used), then you should perhaps weigh the almost one-off cost of a slower query against the cost of storing and maintaining the index for those few occasions when it can be used.

There is a quite good book on the subject of index design: Relational Database Index Design and the Optimizers by Lahdenmäki and Leach (it is also fairly expensive).


In the latest comment, Frank says:

[L]ooking for a couple of things. As its already been said, the simplest thing to do is to allow Informix to start returning rows once it has them. (Oracle does this by default.) The larger picture to what Frank is asking for is something similar to what Google has. Ok it really goes back to Alta Vista and the 90's when talking about search indexes on the web. The idea is that you can do a quick search, pick up the first n things while reporting a "number" of rows returned in the search. (As if the number reported by Google is accurate.)

This additional comment from Frank makes more sense in the context of the question for which this is a continuation.

Obviously, unless the SQL statement forces Informix to do a sort, it makes results available as soon as it has them; it always has. The FIRST_ROWS optimization hint indicates to IDS that if it has a choice of two query plans and one will let it produce the first rows more quickly than the other, then it should prefer the one that produces the first rows quickly, even if it is more expensive overall than the alternative. Even in the absence of the hint, IDS still tries to make the data available as quickly as possible - it just also tries to do it as efficiently as possible too.

When the query is prepared, you get an estimate of how many rows may be returned - you could use that as an indicator (a few, quite a lot, very many). Separately, you can quickly and independently discover the number of rows in the main table you are searching. Given this metadata, you can certainly use a technique with a scroll cursor to give you a backing store in the database that contains the primary key values of the rows you are interested in. At any time, you can load an array with the display data for a set of interesting rows for display to the user. On user request, you can arrange to display another page full of information. At some point in the proceedings, you will find that you've reached the end of the data in the scroll cursor. Clearly, if you do FETCH LAST, you force that to happen. If you just do a few more FETCH NEXTs, then you will eventually get a NOTFOUND condition.

All of this has been possible with Informix (IDS and its prior incarnations, OnLine, Turbo, SE, plus I4GL) since the late 80s. The FIRST_ROWS optimization is more recent; it is still just a hint to the optimizer, and usually makes little difference to what the optimizer does.