Waiting for 9.6 – Remove GROUP BY columns that are functionally dependent on other columns.

On 11th of February, Tom Lane committed patch:

Remove GROUP BY columns that are functionally dependent on other columns.
 
If a GROUP BY clause includes all columns of a non-deferred primary key,
as well as other columns of the same relation, those other columns are
redundant and can be dropped from the grouping; the pkey is enough to
ensure that each row of the table corresponds to a separate group.
Getting rid of the excess columns will reduce the cost of the sorting or
hashing needed to implement GROUP BY, and can indeed remove the need for
a sort step altogether.
 
This seems worth testing for since many query authors are not aware of
the GROUP-BY-primary-key exception to the rule about queries not being
allowed to reference non-grouped-by columns in their targetlists or
HAVING clauses.  Thus, redundant GROUP BY items are not uncommon.  Also,
we can make the test pretty cheap in most queries where it won't help
by not looking up a rel's primary key until we've found that at least
two of its columns are in GROUP BY.
 
David Rowley, reviewed by Julien Rouhaud

Continue reading Waiting for 9.6 – Remove GROUP BY columns that are functionally dependent on other columns.

Waiting for 9.5 – Support GROUPING SETS, CUBE and ROLLUP.

On 16th of May, Andres Freund committed patch:

Support GROUPING SETS, CUBE and ROLLUP.
 
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.
 
This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.
 
The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.
 
The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.
 
Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage.  The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting.  The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.
 
A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.
 
Needs a catversion bump because stored rules may change.
 
Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
    Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com

Continue reading Waiting for 9.5 – Support GROUPING SETS, CUBE and ROLLUP.

Waiting for 9.4 – Implement the FILTER clause for aggregate function calls.

On 17th of July, Noah Misch committed patch:

Implement the FILTER clause for aggregate function calls.
 
This is SQL-standard with a few extensions, namely support for
subqueries and outer references in clause expressions.
 
catversion bump due to change in Aggref and WindowFunc.
 
David Fetter, reviewed by Dean Rasheed.

Continue reading Waiting for 9.4 – Implement the FILTER clause for aggregate function calls.

Grouping data into time ranges

Today some guy on IRC asked question, which I didn't fully understand, but which could (probably) be summarized: how to group data into 5 minute intervals, based on some timestamp column.

Well, it seems trivial (as long as you know how to do it), but since he clearly didn't know how to do it (or I misunderstood his problem), here's the explanation:

Continue reading Grouping data into time ranges

How to group messages into chats?

My jabber server had the feature, that it logs all messages that got sent through it.

This is pretty cool, and useful. And now, i got asked to use it to create list of conversations.

What exactly is this? Whenever I send (or receive) something there is record in database with information about which local user, communication type (send/recv), correspondent, when it happened, and what is the body of message.

And based on this, we want to list messages into chats. How?

Continue reading How to group messages into chats?

Waiting for 9.1 – Recognize functional dependency on primary keys.

Yesterday (August, 7th), Tom Lane committed:

Log Message:
-----------
Recognize functional dependency on primary keys.  This allows a table's
other columns to be referenced without listing them in GROUP BY, so long as
the primary key column(s) are listed in GROUP BY.
 
Eventually we should also allow functional dependency on a UNIQUE constraint
when the columns are marked NOT NULL, but that has to wait until NOT NULL
constraints are represented in pg_constraint, because we need to have
pg_constraint OIDs for all the conditions needed to ensure functional
dependency.
 
Peter Eisentraut, reviewed by Alex Hunsaker and Tom Lane

Continue reading Waiting for 9.1 – Recognize functional dependency on primary keys.

Getting unique elements

Let's assume you have some simple database with “articles" – each article can be in many “categories". And now you want to get list of all articles in given set of categories.

Standard approach:

SELECT
    a.*
FROM
    articles AS a
    JOIN articles_in_categories AS aic ON a.id = aic.article_id
WHERE
    aic.category_id IN (14,62,70,53,138)

Will return duplicated article data if given article is in more than one from listed categories. How to remove redundant rows?

Continue reading Getting unique elements

Tips N’ Tricks – count of all and just some

Let's assume you have very simple table with users:

# \d users
                           Table "public.users"
  Column   |  Type   |                     Modifiers
-----------+---------+----------------------------------------------------
 id        | integer | not null default nextval('users_id_seq'::regclass)
 username  | text    | not null
 is_active | boolean | not null default true
Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
    "users_username_key" UNIQUE, btree (username)

And you'd like to count all users, and know how many of them are active …

Continue reading Tips N' Tricks – count of all and just some