Joe Celko's SQL Programming Style

Joe Celko

Mentioned 12

Are you an SQL programmer that, like many, came to SQL after learning and writing procedural or object-oriented code? Or have switched jobs to where a different brand of SQL is being used, or maybe even been told to learn SQL yourself? If even one answer is yes, then you need this book. A "Manual of Style" for the SQL programmer, this book is a collection of heuristics and rules, tips, and tricks that will help you improve SQL programming style and proficiency, and for formatting and writing portable, readable, maintainable SQL code. Based on many years of experience consulting in SQL shops, and gathering questions and resolving his students' SQL style issues, Joe Celko can help you become an even better SQL programmer. + Help you write Standard SQL without an accent or a dialect that is used in another programming language or a specific flavor of SQL, code that can be maintained and used by other people. + Enable you to give your group a coding standard for internal use, to enable programmers to use a consistent style. + Give you the mental tools to approach a new problem with SQL as your tool, rather than another programming language - one that someone else might not know!

More on Amazon.com

Mentioned in questions and answers.

Convention has it that table names should be the singular of the entity that they store attributes of.

I dislike any T-SQL that requires square brackets around names, but I have renamed a Users table to the singular, forever sentencing those using the table to sometimes have to use brackets.

My gut feel is that it is more correct to stay with the singular, but my gut feel is also that brackets indicate undesirables like column names with spaces in them etc.

Should I stay, or should I go?

In his book "SQL Programming Style," Joe Celko suggests that a collection (e.g. a table) should be named in the plural, while a scalar data element (e.g. a column) should be named in the singular.

He cites ISO-11179-4 as a standard for metadata naming, which supports this guideline.

Let me argue why this makes sense.

  1. Table is a set. Every row in the table is an object (columns = fields). In programming, you name the collections using plural names (Arrays and collections should have plural names to indicate that they are collections of objects rather than single objects), e.g. args in Java program.
  2. Java arguments are args in plural exactly because they can hold 0, 1, and millions of arguments.
  3. You iterate them for every student in students, not for every s in student. You select a subset of students from a set of students. You do not select a subset of students from a student. That is conceptual reason for the proper naming.

Debunking the fallacies. The most pupular answer says

Reason 1 (Concept). You can think of bag containing apples like "AppleBag", it doesn't matter if contains 0, 1 or a million apples, it is always the same bag. Tables are just that, containers, the table name must describe what it contains, not how much data it contains. Additionally, the plural concept is more about a spoken language one (actually to determine whether there is one or more), a table is not intended to be read by a human.

Lets convert AppleBag => BagOfApples. Now, the same "conceptual" argument says somehting opposite to itself, we see Apples and therefore answer must be plural!

This is because there is nothing conceptual in this ****. It fails even to reason about English language, the simple logic. In English, a BagOfApples != AnApple!

The argument "Bag contains 0,1, .. millions" of apples just proves that the collection must be called "Apples", likewise Java arguments or stackoverflow.com/posts. Somehow, the enemies of civilization manage to conclude that singular must be used here. The posts is just a folder. Why should it be plural?

Let's teach mr. Artuwood some logic: folder is a folder and must describe what it contains, not how much data it contains.

In fact, if you start thinking you realize that apple contains apple describes nonsense. The true concept is that Bag contains Apples. Either we name our container Bag(of specific kind) or think what it contains, the Apples (those dirty bastards tried to reduce that to the numbering problem. But we are after apples, not their number.). Yes, if we abstract the wrapper away, we realize that we are after Apples, what is contained in it. We fetch and iterate either a Bag or Apples, not apple.

Reason 2. (Convenience). it is easier come out with singular names, than with plural ones. Objects can have irregular plurals or not plural at all, but will always have a singular one (with few exceptions like News).

Customer Order User Status News

Which convenience is he talking about? Let's just point out that that plural come out simpler than singulars.

Reason 3. (Aesthetic and Order).

Aestetics is on our side. It is simple, as he did. We just claim that and that is all what is needed to turn the table.

Reason 4 (Simplicity). Put all together, Table Names, Primary Keys, Relationships, Entity Classes... is better to be aware of only one name (singular) instead of two (singular class, plural table, singular field, singular-plural master-detail...)

Am I alone seeing that for simplicity, everything must become singular? Yes, forget the plural using English language. It will make things simpler.

In fact, in many languages, sex is attached to all things, not only boys and girls. This is stupid and makes language unnecessarily complex. But, I have noticed that it simplifies the referencing. When I say "fox, dog and wolf" in Russian and then say "he", it unambiguisly means that I am referencing the "wolf" because "fox" and "dog" are "she". Now you say that distinction that helps to reduce ambiguity creates it. How so?

Probably, the logic is "let's keep arbitrary crap in our language and enforce the disorder by removing the rules which make sense". Yes, the proposal to use singular where logic demands plural while keeping to attach inappropriate attributes like sex to inappropriate items is a pursuit of nonsense in our crazy world.

As of

SELECT activity.name reads better than SELECT activities.name

You probably need to SELECT student.name FROM students as student

Ok, here probably is the argument: how is table's alias is singular if table is plural? Ok, this makes some sense. But saying that column (object's attribute) is singular, therefore, the object collection must be singular as well, is nonsense.

Reason 5. (Globalization). The world is getting smaller, you may have a team of different nationalities, not everybody has English as native language. Would be easier for a non-native English language programmer to think of "Repository" than of "Repositories", or avoid them type "Statuses" instead of "Status". Having singular names can lead to less errors caused by typos, save time by avoid spending extra seconds to think "is it Child or Children?", hence improving productivity.

This bigot will break his mind, no doubt. But why other non-native speakers, like me, should buy making nonsense "for simplicity and internationalization"? Why not improve our English and keep the logic unraped? It is much easier to understand the language in this case, after all.

Otherwise, we should also abandon the verbs, the articles, the adjectives and speak only in nouns, for simplicity. It would be simpler (and more productive) if we limit our speech to moooo and nothing else, for the animals to be able to speak English as well. This way we achieve much wider internationalization.

Reason 6. (Why not?). It can even save you writing time, save you disk space, and even make your computer keyboard lasts more!

This argument supports why we should abandon the plurals in our human communaction as well. No, instead of moooo, mo moo, moo moo, we will say I, III, III, II. Much shorter than with singular proposal.


The bottom line is that all the "arguments" against plural is pure bullshit. If you think about it, it says against singular. When all you have to support your position is bullshit, it means that you need bullshit to support your view, you are on a wrong side.

The only meaningful argument for the singular is that table is a very special set of objects. Table is a Class! Yes, it contains all objects of specific type and we use singular for class names. Class Dog contains all dogs. What is dog1? It is a Dog. On the other hand, users deal with tables as collections. They add/remove items into collection and we use plural for collections.

Often I will end up with very complex SQL statements and I wondered if there was style guideline out there that dictates a common way of laying various aspects of a query out.

I am look for something descriptive along the lines of Python's PEP8 or Zend Frameworks guidelines and not code by example.

Most of my queries are written for MySQL.

So do you know of a universal style guide? Perhaps you have written one yourself. Please share your guidelines.

Two guides I know of are Joe Celko's SQL Programming Style and the venerable Code Complete.

There's also the SQL-92 standard. It doesn't contain a style section, but you might consider it's style to be implicitly canonical.

Is there such a thing as Design Patterns in SQL ???

Yes. :)

SQL Design Patterns: Expert Guide to SQL Programming

(Not even sure if this is a recommended book.. just illustrating that "proper design" exists).

SQL seems to be the most neglected language when it comes to formatting nicely and readably... And as SQL statements can be incredibly detailed and complex, it makes it extremely hard to work with. But I find that when I try to format my SQL code in the best way possible I'm sometimes unsure of how to do it. I know the standards for Java, C#, Python, etc.... but when it comes to SQL I haven't seen too many guidelines or accepted practices. What are tips/rules for formatting SQL so that it's clear, legible, and logical? Can you give example code to illustrate? What have you found to be the most standard, accepted way of formatting SQL?

You could try checking out Joe Celko's book SQL Programming Style. I'm sure that there are a lot of people who disagree with his style, but it's a good start.

Some of my own "rules"

  • SQL keywords are always all upper-case
  • Table names are "proper" case, while columns and variables are all lower-case
  • Each "major" clause in a statement is at the start of a line
  • JOIN and WHERE criteria appear beneath and are indented and aligned
  • Nested items are indented further
  • I use aliases for all tables and views

For example:

SELECT
    column_1,
    column_2,
    CASE
        WHEN column_5 = 'Blah' THEN 1
        WHEN column_6 = 'Blah' THEN 2
        ELSE 3
    END AS column_alias
FROM
    My_Table MT
INNER JOIN My_Other_Table MOT ON
    MOT.column_1 = MT.column_1
WHERE
    MT.column_2 = 'Some Value' AND
    (
        MT.column_3 = 'Some other value' OR
        MT.column_4 = 'Some other value'
    )

We put common prefixes on related tables to assure they display next to each other in our DB management software (Toad, Enterprise Manager, etc).

So for example, all user tables start with the word User:

  • User
  • UserEvent
  • UserPurchase

Ideally, in honor of the three great virtues of a programmer these tables should be named User, Event, Purchase respectively to save some typing, agreed?

Is this naming convention the best (only?) practice for grouping related tables together naturally?

I wouldn't use a naming convention for purposes of alphabetizing table names. It's nice when it works out that way, but this shouldn't be by design.

Read Joe Celko's book "SQL Programming Style." His first chapter in that book is about naming conventions, guided by the ISO 11179 standard for metadata naming. One of his recommendations is to avoid unnecessary prefixes in your naming convention.

I work at a company that has some very non-standardized SQL conventions (They were written by Delphi Developers years ago). Where is the best place that I can find SQL industry standard convention definitions that are most commonly used?

In his book "SQL Programming Style," Joe Celko suggests a number of conventions, for example that a collection (e.g. a table) should be named in the plural, while a scalar data element (e.g. a column) should be named in the singular.

He cites ISO-11179-4 as a standard for metadata naming, which supports this guideline.

i have many to many database and i'm there is a query that i just got stuck with and cant do it.

i have 4 tables Artists, Tracks, Albums, Clips

now i'm on the artist page so i need to get them by the artist page, i already got all of them, but not the way i want them.

because some tracks, albums, clips belong to other artists as well (duet) and i need to display their name.

but the problem is that i'm selecting using the artist id so my GROUPC_CONCAT function wont work here is the query that gets the artist albums.

   SELECT  al.album_name, GROUP_CONCAT(a.artist_name SEPARATOR ', ') AS 'artist_name'

   FROM 
     Albums al
   LEFT JOIN 
     ArtistAlbums art ON art.album_id = al.album_id
   LEFT JOIN 
     Artists a on a.artist_id = art.artist_id
   WHERE 
     a.artist_id = 10
   GROUP BY 
     al.album_id

one of the albums have two artists attached to it, but it does not get the other artist name.

when i select by the album_id i get the two artists.

please note that i'm new to mysql and i did not find any answers on this particular problem almost no resources on many-to-many querying.

how can i tackle this problem.?

any resources or books on many-to-many that show how to deal with the database on the application layer will be much appreciated, thanks in advance.

Think of table aliases as really being row aliases. That is, for purposes of expressions in the WHERE clause and the select-list, the alias refers to a single row at a time.

Since you've created a condition such that a.artist_id = 10, then only rows for that artist match the condition. What you really want is to match all artists on an album given that one artist is artist_id = 10.

For that, you need another join, so that the row where artist_id = 10 is matched to all the rows for that respective album.

SELECT  al.album_name, GROUP_CONCAT(a2.artist_name SEPARATOR ', ') AS `artist_name`
FROM 
  Albums al
INNER JOIN 
  ArtistAlbums art ON art.album_id = al.album_id
INNER JOIN 
  Artists a on a.artist_id = art.artist_id
INNER JOIN 
  ArtistAlbums art2 ON art2.album_id = al.album_id
INNER JOIN 
  Artists a2 on a2.artist_id = art2.artist_id
WHERE 
  a.artist_id = 10
GROUP BY 
  al.album_id

P.S.: I've also replaced your use of LEFT JOIN with INNER JOIN. There's no reason you needed LEFT JOIN, since you're explicitly looking for albums that have a matching artist, not all albums whether or not it has artist 10. Review the meaning of different types of join.


Re your followup question:

I don't know of a book specifically about many-to-many queries. I'd recommend books like:

I have something that looks like this:

if ( $_SESSION['username'] ) 
        $owner = $_SESSION['username'];

    $q = "SELECT * FROM projects where owners='$owner' ORDER BY created DESC";

The problem is that the 'owners' column may have multiple entries separated by commas. I need a way to cycle through all the entries and select any table that has $owner in the owners columns. Whats the best way to go about this?

The proper design for a relational database is to store only one value in a given "cell" (i.e. in a given column on a given row -- but please do not use spreadsheet terminology with relational databases, you'll make Joe Celko cry. :-).

Here's an example:

CREATE TABLE Projects (
 project_id INT PRIMARY KEY,
 ...
);

CREATE TABLE Users (
 user_id INT PRIMARY KEY,
 ...
);

CREATE TABLE Project_Owners (
  project_id INT NOT NULL,
  user_id INT NOT NULL,
  PRIMARY KEY (project_id, user_id),
  FOREIGN KEY (project_id) REFERENCES Projects(project_id),
  FOREIGN KEY (user_id) REFERENCES Users(user_id)
);

You insert one row to Project_Owners table for each owner of a given project. So you can easily have a project with multiple owners, and a user who works on multiple projects.

We say that Projects has a many-to-many relationship to Users.

So when you want to get projects associated with a given owner, you'd need to look up the owner in the users table, and then join that to projects via the many-to-many table:

<?php

$owner = mysql_real_escape_string($_SESSION["username"]);

$sql = "SELECT p.* FROM Projects
INNER JOIN Project_Owners o USING (project_id)
INNER JOIN Users u USING (user_id)
WHERE u.name = '$owner'";

Or if you want to simplify it, you can use integers in your session data. Then you can skip a join:

<?php

$owner = (int) $_SESSION["username"];

$sql = "SELECT p.* FROM Projects
INNER JOIN Project_Owners o USING (project_id)
WHERE o.user_id = $owner";

The concept of storing just one value in a cell is part of the basic definition of First Normal Form. Any book on SQL should cover this, even the Manga Guide to Databases.

Should variable names in PL/SQL use underscores or capital letters to separate words?

this_is_my_variable or thisIsMyVariable?

I can't find a clear answer to this and I've seen both.

I would strongly recommend investing in Joe Celko's SQL Programming Style.

Basically echoing @user1598390, you should avoid Camel Case and jaming everything together.

Camel Case has numerous issues, primarily:

  1. There has to be an agreed format : upcCode, UpcCode, UPCCode. Causes confusion and you end up with multiple versions of the same variable.
  2. Studies have found that it slows reading and writing.

No Spaces:

  1. This is a hold over from punchcards where you reduced space due to the limited number of columns.
  2. We don't use punchcards.

I don't know how to name this column because my english is bad. Now, I name it "sorting", is it correct?

| id | name | sorting |
-----------------------
| 1  | foo  |    0    | 
| 2  | bar  |    1    |

I would take a page from Joe Celko's excellent SQL Programming Style. Section 1.2.4 takes you through developing standardized postfixes. In your case, the following excerpt applies:

“_seq” = sequence, ordinal numbering. This is not the same thing as a tag number, because it cannot have gaps.

So something like sort_seq or sorting_seq could apply here.

I have a table with entries for Items as being 'lost' and 'found'. Each row has a date for the event. Im hoping to build a query with matching pairs of 'itemid', 'lost date', 'found date' by joining the table to itself.

This works to a point: unfortunately if there are multiple lost and found pairs for a given item each 'lost date' will be joined with all the 'found dates' that follow it.

Still with me?

The query goes something like:

select c0.ItemId, c0.ChangeDate, c1.ChangeDate from Changes c0
join Changes c1 on 
   c0.ItemId = c1.ItemId and c1.ChangeDate >= c0.ChangeDate
where c0.ChangeType = 9 (lost) and c1.ChangeType = 10 (found);

What Im hoping to achieve is some form of a given 'lost date' paired with only the next 'found date' in sequence (or NULL if no 'found date' exists). Im (pretty) sure this is possible but Im not seeing the path.

I was wondering about putting a sub-select in the first join and using a LIMIT 1 to get only one record but I don't see how to join this to the appropriate row in the main part of the select. MySQL tells me it doesn't exist. Fair enough.

Generally when dealing with pairs of dates (e.g. start/end for scheduling) the advice is don't put them on separate rows. Put them in two columns of the same row. See Joe Celko's SQL Programming Style.

But that said, you can solve it with your current schema by searching doing another self-join to search for a ChangeDate between the two. If none is found (that is, if c2.* is null because of the outer join), then c0 and c1 are "adjacent."

select c0.ItemId, c0.ChangeDate, c1.ChangeDate 
from Changes c0
inner join Changes c1 on 
   c0.ItemId = c1.ItemId and c1.ChangeDate > c0.ChangeDate
left outer join Changes c2 on
   c0.ItemId = c2.ItemId and c2.ChangeDate > c0.ChangeDate 
                         and c2.ChangeDate < c1.ChangeDate 
                         and c2.ChangeType IN (9,10) -- edit
where c0.ChangeType = 9 (lost) and c1.ChangeType = 10 (found)
   and c2.ItemId IS NULL;

In the above example, I've assumed that ChangeDate is unique, and I changed the >= to >. If ChangeDate is not unique, you'll have to come up with some other expression to test for c2 "between" c0 and c1.

Say you have a table named Product and it has an auto-number/identity column. Do you name it simply Id or do you name it ProductId or Product_Id? Please explain why.

There is no clear answer because it depends on your other tools, your existing naming conventions, and even personal preference. For example, you might use an ORM that expects "ID" on every table. Personally, I would go with "ProductID" because it makes it easier to read queries (for me) and I don't like underscores.

One argument that is completely wrong is that "ID" saves typing time: you spend a lot more time reading code than writing it, so you should always prioritize readable code over saving keystrokes.

If you want some 'real' references, check out Joe Celko's book on SQL programming style and/or ISO 11179.