August 8th, 2010 by depesz | Tags: , , , | 14 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

Yesterday (August, 7th), Tom Lane committed:

Log Message:
-----------
Recognize functional dependency on primary keys. This allows a table's
other columns to be referenced without listing them in GROUP BY, so long as
the primary key column(s) are listed in GROUP BY.
 
Eventually we should also allow functional dependency on a UNIQUE constraint
when the columns are marked NOT NULL, but that has to wait until NOT NULL
constraints are represented in pg_constraint, because we need to have
pg_constraint OIDs for all the conditions needed to ensure functional
dependency.
 
Peter Eisentraut, reviewed by Alex Hunsaker and Tom Lane

One of the most common sources of problems when people move to PostgreSQL from MySQL, is that such query:

select field_a, field_b, count(*)
from table
group by field_a

is legal and working in MySQL, but not working in PostgreSQL.

Now, with this new patch, subset of such queries will work in PostgreSQL.

First of all – why subset? Simple – if field_a is PRIMARY KEY of given table, and we group by it – it is understandable that you can't have more than 1 value of field_b for it. But if it's not PRIMARY KEY – it can be that you have many different values for field_b for given field_a – and in such case, the query will still not work.

Let's see it on example:

CREATE TABLE people (
    id serial PRIMARY KEY,
    firstname TEXT,
    lastname TEXT
);
 
CREATE TABLE visits (
    id serial PRIMARY KEY,
    person_id INT4 NOT NULL references people (id),
    when_logged timestamptz
);
 
INSERT INTO people (firstname, lastname)
VALUES
    ('Bill', 'Hicks'),
    ('George', 'Carlin'),
    ('Louis', 'C.K.'),
    ('Robin', 'Williams'),
    ('Zach', 'Galifianakis');
 
INSERT INTO visits (person_id, when_logged)
    SELECT
        1 + floor(random() * 5),
        now() - '1 year'::INTERVAL * random()
    FROM
        generate_series(1,1000);

Now, let's assume we want list of people with count of visits. Previously one would need to:

SELECT
    p.id,
    count(*)
FROM
    people p
    JOIN visits v on p.id = v.person_id
GROUP BY p.id;

or:

SELECT
    p.id,
    p.firstname,
    p.lastname,
    count(*)
FROM
    people p
    JOIN visits v on p.id = v.person_id
GROUP BY p.id, p.firstname, p.lastname;

(or some trick with min(firstname)).

But now, we can:

SELECT
    p.id,
    p.firstname,
    p.lastname,
    count(*)
FROM
    people p
    JOIN visits v on p.id = v.person_id
GROUP BY p.id;

And it just works. Nice.

  1. 14 comments

  2. Aug 10, 2010

    It’s about time that PostgreSQL conformed to this part of the 1999 SQL standard. I reported it as a bug several years ago, but was told that it was an ‘optional’ part of the standard, not a ‘requirement’, so was deliberately left out.

  3. Aug 10, 2010

    It IS optional, and so are hundreds of other parts of the official SQL standard. I don’t know of any database that implements this part of the standard. But it’s great to see nevertheless. PostgreSQL keeps getting more and more sophisticated.

  4. Aug 11, 2010

    It is *NOT* optional. The 1992 standard states
    [quote]If T is a grouped table, then each in each that references a column of T shall reference a grouping column or be specified within a .[/quote]This requirement was removed from the 1999 standard, therefore it was no longer a “requirement”.

    If you don’t know of any other database that implements this part of the standard then take a look at MySQL. It specifically had to add the ONLY_FULL_GROUP_BY option to “turn off” the change in the SQL standard. This is explained fully in http://www.oreillynet.com/databases/blog/2007/05/debunking_group_by_myths.html

  5. Aug 11, 2010

    @Tony:
    Sorry, but providing MySQL as a proof of correctness isn’t going to fly well. Not until they will raise correct error in such case:

    create table x (i int4, z int4);

    insert into x (i, z) values (1,2);

    insert into x (i, z) values (1,3);

    insert into x (i, z) values (2,4);

    select i, z from x group by i;

  6. Aug 13, 2010

    You test is irrelevant as your table does not contain a primary key, therefore it does not have anything to do with functional dependencies.

    At least the MySQL developers made an attempt to implement the standard, and not ignore it completely.

  7. Aug 13, 2010

    @Tony:
    are you serious? Sorry, I just can’t believe someone might say something like this seriously.

    Horribly broken group by semantics in MySQL (which basically lead to choosing *random* elements) you say “they made an attempt to implement the standard?

    I will never say PostgreSQL is perfect. But at the very least I can depends on the results its providing, that it will not choose random values, or execute half of *single command*.

  8. Aug 14, 2010

    Yes, I *AM* serious. You claim that MySQL’s implementation of GROUP BY fails on a statement that will never be used in the real world, which shows just how desperate you are to throw dirt at MySQL.

    Just face it, PostgreSQL’s implememtation of GROUP BY has been flawed since the 1999 standard, but you’e too chicken to admit it.

    BTW, If MySQL is so-o-o bad, then why does it have far more users than your favourite product???

  9. # Thom Brown
    Aug 14, 2010

    I think it is fair to say that MySQL’s implementation of GROUP BY does cause unpredictable results.

    For example:

    mysql> create table monkeys (
    id mediumint not null auto_increment,
    country varchar(64),
    name varchar(64),
    PRIMARY KEY (id)
    );

    mysql> insert into monkeys (country, name) values (‘UK’, ‘Ooky’);
    Query OK, 1 row affected (0.06 sec)

    mysql> insert into monkeys (country, name) values (‘UK’, ‘Booja’);
    Query OK, 1 row affected (0.00 sec)

    mysql> insert into monkeys (country, name) values (‘Brazil’, ‘Huga’);
    Query OK, 1 row affected (0.00 sec)

    mysql> insert into monkeys (country, name) values (‘Brazil’, ‘Kola’);
    Query OK, 1 row affected (0.00 sec)

    mysql> insert into monkeys (country, name) values (‘Brazil’, ‘Dave’);
    Query OK, 1 row affected (0.00 sec)

    mysql> select country, name from monkeys group by country;
    +———+——+
    | country | name |
    +———+——+
    | Brazil | Huga |
    | UK | Ooky |
    +———+——+
    2 rows in set (0.01 sec)

    mysql> — And to demonstrate the order by only affects the output of retrieved results and not the actual selection…
    mysql> select country, name from monkeys group by country order by name desc;
    +———+——+
    | country | name |
    +———+——+
    | UK | Ooky |
    | Brazil | Huga |
    +———+——+
    2 rows in set (0.00 sec)

    mysql> — And to confuse results further…
    mysql> select country, name, count(*) from monkeys group by country;
    +———+——+———-+
    | country | name | count(*) |
    +———+——+———-+
    | Brazil | Huga | 3 |
    | UK | Ooky | 2 |
    +———+——+———-+
    2 rows in set (0.00 sec)

    So if a GROUP BY is used on a non-unique column, MySQL will only pick whichever row for each of those values it finds first. I don’t understand how this is useful.

    But if you do have a use-case for this functionality, I’m interested to know what it is. (please don’t troll, just explain your reasoning)

  10. Aug 15, 2010

    The article at http://www.oreillynet.com/databases/blog/2007/05/debunking_group_by_myths.html explains why it is possible to obtain some unexpected results with GROUP BY in MySQL. Can these results be duplicated in PostgreSQL?

    You are still avoiding the important question, which is where it states in the standard that functional dependency is optional in GROUP BY.

  11. Aug 15, 2010

    @Tony:
    I find further discussion impossible – it seems that we’ll have to agree to disagree.

    You have your idea what is correct, and I’m happy for you. I have another one, and I’m quite happy myself. If you’re happy using MySQL – that’s great. Please use it all the time you want. Have a nice life.

  12. # Thom Brown
    Aug 15, 2010

    “explains why it is possible to obtain some unexpected results with GROUP BY in MySQL. Can these results be duplicated in PostgreSQL?”

    As it happens, yes:

    test=# select distinct on (country) country, name from monkeys;
    country | name
    ———+——
    Brazil | Huga
    UK | Ooky
    (2 rows)

    With the added ability to provide a predictable result:

    test=# select distinct on (country) country, name from monkeys order by country, name;
    country | name
    ———+——-
    Brazil | Dave
    UK | Booja
    (2 rows)

    The article actually says: “On the other hand, we cannot expect the database management system to know what we were thinking when we included the birth in the SELECT list in the first place. So, despite that the error message may seem a bit puzzling, it is still preferable over silently returning non-sense data.”

    He’s basically saying that such cases (the same as I gave examples of yesterday) aren’t desirable. He then goes on to talk about MySQL’s solution to this being “ONLY_FULL_GROUP_BY” to correctly raise an error in such cases. But cautions: “Alas, MySQL’s ONLY_FULL_GROUP_BY is not as clever as it should be, and there are particular cases where ONLY_FULL_GROUP_BY is too restrictive in enforcing only full GROUP BY clauses.” So it was a flawed implementation. I suspect (and hope) this has been fixed by now.

    If the user understands their data, they don’t even need this feature. It’s really a shortcut. But the next step is for it to work for UNIQUE columns (with a NOT NULL constraint), and won’t require any special system setting to get it to behave.

    So MySQL does do the job correctly, but not by default. I’m not that acquainted with the MySQL community, but generally, do most users know of the ONLY_FULL_GROUP_BY feature? If so, it won’t be a common problem anyway.

    But anyone adequately familiar with how GROUP BY works will be able to avoid such problems in any RDBMS.

  13. # Frank
    Sep 9, 2010

    Too bad Tony Marston thinks he owns the truth but at the same time has no idea how to do a proper test. MySQL fails bigtime when it comes to GROUP BY, at least 80% of all queries including GROUP BY executed on MySQL have bugs. Use ONLY_FULL_GROUP_BY to find these queries and than start playing with the data to see what happens when the data is different than expected.

    Tony, start testing before barking.

  14. Sep 9, 2010

    I think this debate can be summed up as follows:

    - Some versions of the SQL standard are strict on this issue, some of them try to be smart
    - MySQL is too lax: it leaves it up to the SQL author to be smart, and allows less smart developers to make mistakes with unpredictable consequences; its default behaviour does not match any version of the standard
    - PostgreSQL (and just about every other DBMS) is too strict: it doesn’t try to be smart, so the SQL author has no chance to do certain things which are logically sound
    - MySQL in ONLY_FULL_GROUP_BY mode is, presumably, strict in the same way
    - PostgreSQL 9.1 will be a lot smarter, allowing the SQL author to do more while still preventing them from creating unpredictable queries.

    The MySQL approach would be fine if all MySQL programmers were trained professionals who understood the issues and therefore didn’t need the “hand-holding” provided by other DBMSs. I will leave it to your judgement whether this is the case.

    The other open question is whether users of PostgreSQL 9.1 will understand *why* they no longer need to specify certain columns in their Group By clauses. Functional dependency may not be something often considered in day-to-day use of a database.

  15. Nov 18, 2010

    Hi All!

    seems I’m late to the party overhere :)

    I think most things that could be said in this thread have been said. I just want to point out 2 things:

    1) @ROWAN COLLINS: excellent summary!
    2) @THOM BROWN: I understand the sentiment, but I don’t think of it as a shortcut – to me it seems only logical.

    Personally, I think the standard practice of learning how to write a GROUP BY – “copy all non aggregate columns from the SELECT-list to the GROUP BY claise” – or more clever – “put aggregate functions like MAX or MIN around your non-aggregate columns in the SELECT list” is actually the shortcut: it’s a form of copy-paste programming.

    I can certainly live without a more advanced 1999/2003 kind of GROUP BY, but to me it has the merit of reducing (and deduplicating) code and typically this means less opportunity for programming errors.

    Anyway – it’s thrilling to see work is being done to implement this in PostgreSQL: Kudos! I can’t name another RDBMS that implements this (but maybe this is my ignorance, for example I don’t have any experience with DB2 or Teradata, maybe they have it? Anyone?)

    kind regards,

    Roland Bouman

Leave a comment