Comments on: Getting list of unique elements

By: Thomas

Thomas — Sat, 25 Jul 2009 20:59:26 +0000

re 1) yes, I am aware of that, but in your example you said the table was not updated very often. If “not very” is something like once a week, then the histogram can probably be used without problems. Autovacuum should take care of that.

re 2) right. I was thinking about an absolute number (not more than 100)

By: depesz

depesz — Sat, 25 Jul 2009 20:04:27 +0000

@Thomas:
great idea, with 2 small problems:
1. statistics can (and usually are) not up to date
2. it would require the number of values to be *really* low. in terms of absolute numbers. the method I showed in the post works well for low number of values but in relation to number of rows in table. i.e. .it will work quite well for 10000 values in 1 million row table.

By: Thomas

Thomas — Sat, 25 Jul 2009 20:00:08 +0000

If you have only a few distinct values, wouldn’t it be most efficient to query pg_stats for the histogram of the column?

That requires no additional coding and should be quite accurate assuming statistics_target is high enough.

Thomas

By: depesz

depesz — Thu, 16 Jul 2009 05:50:42 +0000

@Jeff Davis:
I’m not sure. Wouldn’t UPDATE obtain lock on the row? So the concurrent addition would have to wait for transaction end.

By: Jeff Davis

Jeff Davis — Wed, 15 Jul 2009 00:34:58 +0000

The function remove_from_dictionary() appears unsafe. After “tmpint” is set, and before the DELETE is executed, the item may be added by some concurrent process. You may be able to make it safe by adding a “WHERE element_count = 0” to the DELETE.

By: depesz

depesz — Mon, 13 Jul 2009 19:18:17 +0000

@alvherre:
I would *love* to see it in optimizer, but I’m definitely not the right person to ask about being in TODO – my C skills are next to none.

By: alvherre

alvherre — Mon, 13 Jul 2009 19:16:22 +0000

I think what you implemented in plpgsql in your last solution is called “skip scan” or something like that. I think this is something that should be considered in the optimizer — TODO for 8.5?

By: Mac

Mac — Fri, 10 Jul 2009 21:22:44 +0000

Yeah, the approach of a table would work… but It’s not backwards compatible with legacy code, it requires code rewrite… and it’s just cumbersome. And it makes the whole DB schema more complex for not much benefit.

By: Scott Bailey

Scott Bailey — Fri, 10 Jul 2009 15:48:34 +0000

Yeah I was thinking about this yesterday. I was reading that in 8.4 the optimizer can use bitmap indexes internally but you still can’t create a bitmap index on a table. The bitmap index is ideal for these high volume, low cardinality tables.

But I think we could actually mimic the behavior with a table that stored the index name and an array of values. As each row is indexed, it would look up the position of the value in the array and append it to the array if not found. (I’m guessing this is pretty close to how the enum type works internally.)

You could use that approach to make the above mentioned AutoEnum type (DynaEnum sounds like it might blow up on you) and for bitmap indexes.

By: Mac

Mac — Fri, 10 Jul 2009 11:37:20 +0000

Thanks for this hint.

What I’m really looking for is a DynaEnum or AutoEnum datatype which supports the same features as an enum, plus:

– Behaves strictly like a varchar in all aspects…
– Automatically creates new values when needed – up to a limited number of new values (a few hundreds / thousands would probably be OK)
– Takes much less space
– Gives access to an quick function to count the distinct values

For instance in the case of Credit Card transactions processing I would have the following:

CREATE TABLE CreditCard (brand AUTOENUM, […], transactionOutcome AUTOENUM);

INSERT INTO CreditCardTransaction VALUES (‘VISA’, …, ‘OK’);
INSERT INTO CreditCardTransaction VALUES (‘VISA’, …, ‘OK’);
INSERT INTO CreditCardTransaction VALUES (‘AMEX’, …, ‘DENIED’);

-> this would automatically create 2 AutoEnums, one with (‘VISA’, ‘AMEX’) and the other one with (‘OK’, ‘DENIED’);

The storage need for those 2 enums would probably be really small.

I have many tables which would benefit from this. Especially tables which are used to log events, where the number of possible values is small but not necessarily known beforehand – typically error messages -.

By: gregj

gregj — Fri, 10 Jul 2009 10:42:10 +0000

about the first example, doing count like that on default transactional level is dangerous.
In reality you would require all transactions on that table to be serializable, which is true for any concurrent math operations.