<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Getting list of unique elements</title>
	<atom:link href="http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/</link>
	<description></description>
	<lastBuildDate>Thu, 29 Jul 2010 21:40:44 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
	<item>
		<title>By: Thomas</title>
		<link>http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/comment-page-1/#comment-27931</link>
		<dc:creator>Thomas</dc:creator>
		<pubDate>Sat, 25 Jul 2009 20:59:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.depesz.com/?p=1455#comment-27931</guid>
		<description>re 1) yes, I am aware of that, but in your example you said the table was not updated very often. If &quot;not very&quot; is something like once a week, then the histogram can probably be used without problems. Autovacuum should take care of that.

re 2) right. I was thinking about an absolute number (not more than 100)</description>
		<content:encoded><![CDATA[<p>re 1) yes, I am aware of that, but in your example you said the table was not updated very often. If &#8220;not very&#8221; is something like once a week, then the histogram can probably be used without problems. Autovacuum should take care of that.</p>
<p>re 2) right. I was thinking about an absolute number (not more than 100)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: depesz</title>
		<link>http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/comment-page-1/#comment-27929</link>
		<dc:creator>depesz</dc:creator>
		<pubDate>Sat, 25 Jul 2009 20:04:27 +0000</pubDate>
		<guid isPermaLink="false">http://www.depesz.com/?p=1455#comment-27929</guid>
		<description>@Thomas:
great idea, with 2 small problems:
1. statistics can (and usually are) not up to date
2. it would require the number of values to be *really* low. in terms of absolute numbers. the method I showed in the post works well for low number of values but in relation to number of rows in table. i.e. .it will work quite well for 10000 values in 1 million row table.</description>
		<content:encoded><![CDATA[<p>@Thomas:<br />
great idea, with 2 small problems:<br />
1. statistics can (and usually are) not up to date<br />
2. it would require the number of values to be *really* low. in terms of absolute numbers. the method I showed in the post works well for low number of values but in relation to number of rows in table. i.e. .it will work quite well for 10000 values in 1 million row table.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Thomas</title>
		<link>http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/comment-page-1/#comment-27928</link>
		<dc:creator>Thomas</dc:creator>
		<pubDate>Sat, 25 Jul 2009 20:00:08 +0000</pubDate>
		<guid isPermaLink="false">http://www.depesz.com/?p=1455#comment-27928</guid>
		<description>If you have only a few distinct values, wouldn&#039;t it be most efficient to query pg_stats for the histogram of the column? 

That requires no additional coding and should be quite accurate assuming statistics_target is high enough.

Thomas</description>
		<content:encoded><![CDATA[<p>If you have only a few distinct values, wouldn&#8217;t it be most efficient to query pg_stats for the histogram of the column? </p>
<p>That requires no additional coding and should be quite accurate assuming statistics_target is high enough.</p>
<p>Thomas</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: depesz</title>
		<link>http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/comment-page-1/#comment-27901</link>
		<dc:creator>depesz</dc:creator>
		<pubDate>Thu, 16 Jul 2009 05:50:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.depesz.com/?p=1455#comment-27901</guid>
		<description>@Jeff Davis:
I&#039;m not sure. Wouldn&#039;t UPDATE obtain lock on the row? So the concurrent addition would have to wait for transaction end.</description>
		<content:encoded><![CDATA[<p>@Jeff Davis:<br />
I&#8217;m not sure. Wouldn&#8217;t UPDATE obtain lock on the row? So the concurrent addition would have to wait for transaction end.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jeff Davis</title>
		<link>http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/comment-page-1/#comment-27896</link>
		<dc:creator>Jeff Davis</dc:creator>
		<pubDate>Wed, 15 Jul 2009 00:34:58 +0000</pubDate>
		<guid isPermaLink="false">http://www.depesz.com/?p=1455#comment-27896</guid>
		<description>The function remove_from_dictionary() appears unsafe. After &quot;tmpint&quot; is set, and before the DELETE is executed, the item may be added by some concurrent process. You may be able to make it safe by adding a &quot;WHERE element_count = 0&quot; to the DELETE.</description>
		<content:encoded><![CDATA[<p>The function remove_from_dictionary() appears unsafe. After &#8220;tmpint&#8221; is set, and before the DELETE is executed, the item may be added by some concurrent process. You may be able to make it safe by adding a &#8220;WHERE element_count = 0&#8243; to the DELETE.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: depesz</title>
		<link>http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/comment-page-1/#comment-27891</link>
		<dc:creator>depesz</dc:creator>
		<pubDate>Mon, 13 Jul 2009 19:18:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.depesz.com/?p=1455#comment-27891</guid>
		<description>@alvherre:
I would *love* to see it in optimizer, but I&#039;m definitely not the right person to ask about being in TODO - my C skills are next to none.</description>
		<content:encoded><![CDATA[<p>@alvherre:<br />
I would *love* to see it in optimizer, but I&#8217;m definitely not the right person to ask about being in TODO &#8211; my C skills are next to none.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: alvherre</title>
		<link>http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/comment-page-1/#comment-27890</link>
		<dc:creator>alvherre</dc:creator>
		<pubDate>Mon, 13 Jul 2009 19:16:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.depesz.com/?p=1455#comment-27890</guid>
		<description>I think what you implemented in plpgsql in your last solution is called “skip scan” or something like that.  I think this is something that should be considered in the optimizer -- TODO for 8.5?</description>
		<content:encoded><![CDATA[<p>I think what you implemented in plpgsql in your last solution is called “skip scan” or something like that.  I think this is something that should be considered in the optimizer &#8212; TODO for 8.5?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mac</title>
		<link>http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/comment-page-1/#comment-27879</link>
		<dc:creator>Mac</dc:creator>
		<pubDate>Fri, 10 Jul 2009 21:22:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.depesz.com/?p=1455#comment-27879</guid>
		<description>Yeah, the approach of a table would work... but It&#039;s not backwards compatible with legacy code, it requires code rewrite... and it&#039;s just cumbersome. And it makes the whole DB schema more complex for not much benefit.</description>
		<content:encoded><![CDATA[<p>Yeah, the approach of a table would work&#8230; but It&#8217;s not backwards compatible with legacy code, it requires code rewrite&#8230; and it&#8217;s just cumbersome. And it makes the whole DB schema more complex for not much benefit.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Scott Bailey</title>
		<link>http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/comment-page-1/#comment-27876</link>
		<dc:creator>Scott Bailey</dc:creator>
		<pubDate>Fri, 10 Jul 2009 15:48:34 +0000</pubDate>
		<guid isPermaLink="false">http://www.depesz.com/?p=1455#comment-27876</guid>
		<description>Yeah I was thinking about this yesterday. I was reading that in 8.4 the optimizer can use bitmap indexes internally but you still can&#039;t create a bitmap index on a table. The bitmap index is ideal for these high volume, low cardinality tables. 

But I think we could actually mimic the behavior with a table that stored the index name and an array of values. As each row is indexed, it would look up the position of the value in the array and append it to the array if not found. (I&#039;m guessing this is pretty close to how the enum type works internally.) 

You could use that approach to make the above mentioned AutoEnum type (DynaEnum sounds like it might blow up on you) and for bitmap indexes.</description>
		<content:encoded><![CDATA[<p>Yeah I was thinking about this yesterday. I was reading that in 8.4 the optimizer can use bitmap indexes internally but you still can&#8217;t create a bitmap index on a table. The bitmap index is ideal for these high volume, low cardinality tables. </p>
<p>But I think we could actually mimic the behavior with a table that stored the index name and an array of values. As each row is indexed, it would look up the position of the value in the array and append it to the array if not found. (I&#8217;m guessing this is pretty close to how the enum type works internally.) </p>
<p>You could use that approach to make the above mentioned AutoEnum type (DynaEnum sounds like it might blow up on you) and for bitmap indexes.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mac</title>
		<link>http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/comment-page-1/#comment-27874</link>
		<dc:creator>Mac</dc:creator>
		<pubDate>Fri, 10 Jul 2009 11:37:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.depesz.com/?p=1455#comment-27874</guid>
		<description>Thanks for this hint.

What I&#039;m really looking for is a DynaEnum or AutoEnum datatype which supports the same features as an enum, plus:

- Behaves strictly like a varchar in all aspects...
- Automatically creates new values when needed - up to a limited number of new values (a few hundreds / thousands would probably be OK)
- Takes much less space
- Gives access to an quick function to count the distinct values


For instance in the case of Credit Card transactions processing I would have the following:

CREATE TABLE CreditCard (brand AUTOENUM, [...], transactionOutcome AUTOENUM);

INSERT INTO CreditCardTransaction VALUES (&#039;VISA&#039;, ..., &#039;OK&#039;);
INSERT INTO CreditCardTransaction VALUES (&#039;VISA&#039;, ..., &#039;OK&#039;);
INSERT INTO CreditCardTransaction VALUES (&#039;AMEX&#039;, ..., &#039;DENIED&#039;);

-&gt; this would automatically create 2 AutoEnums, one with (&#039;VISA&#039;, &#039;AMEX&#039;) and the other one with (&#039;OK&#039;, &#039;DENIED&#039;);

The storage need for those 2 enums would probably be really small.

I have many tables which would benefit from this. Especially tables which are used to log events, where the number of possible values is small but not necessarily known beforehand - typically error messages -.</description>
		<content:encoded><![CDATA[<p>Thanks for this hint.</p>
<p>What I&#8217;m really looking for is a DynaEnum or AutoEnum datatype which supports the same features as an enum, plus:</p>
<p>- Behaves strictly like a varchar in all aspects&#8230;<br />
- Automatically creates new values when needed &#8211; up to a limited number of new values (a few hundreds / thousands would probably be OK)<br />
- Takes much less space<br />
- Gives access to an quick function to count the distinct values</p>
<p>For instance in the case of Credit Card transactions processing I would have the following:</p>
<p>CREATE TABLE CreditCard (brand AUTOENUM, [...], transactionOutcome AUTOENUM);</p>
<p>INSERT INTO CreditCardTransaction VALUES (&#8216;VISA&#8217;, &#8230;, &#8216;OK&#8217;);<br />
INSERT INTO CreditCardTransaction VALUES (&#8216;VISA&#8217;, &#8230;, &#8216;OK&#8217;);<br />
INSERT INTO CreditCardTransaction VALUES (&#8216;AMEX&#8217;, &#8230;, &#8216;DENIED&#8217;);</p>
<p>-&gt; this would automatically create 2 AutoEnums, one with (&#8216;VISA&#8217;, &#8216;AMEX&#8217;) and the other one with (&#8216;OK&#8217;, &#8216;DENIED&#8217;);</p>
<p>The storage need for those 2 enums would probably be really small.</p>
<p>I have many tables which would benefit from this. Especially tables which are used to log events, where the number of possible values is small but not necessarily known beforehand &#8211; typically error messages -.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: gregj</title>
		<link>http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/comment-page-1/#comment-27873</link>
		<dc:creator>gregj</dc:creator>
		<pubDate>Fri, 10 Jul 2009 10:42:10 +0000</pubDate>
		<guid isPermaLink="false">http://www.depesz.com/?p=1455#comment-27873</guid>
		<description>about the first example, doing count like that on default transactional level is dangerous. 
In reality you would require all transactions on that table to be serializable, which is true for any concurrent math operations.</description>
		<content:encoded><![CDATA[<p>about the first example, doing count like that on default transactional level is dangerous.<br />
In reality you would require all transactions on that table to be serializable, which is true for any concurrent math operations.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
