Joining BTree and GIN/GiST indexes

Today, I'd like to show you how you can use the same index for two different types of conditions. One that is using normal BTree indexing ( equal, less than, greater than ), and one that is using GIN/GiST index, for full text searching.

Before I will go any further – I will be showing explain analyze plans, but please don't compare times – the dataset is small (it fits entirely in the memory), and the conditions are picked to select small part of the table, so the timing will be not really impressive.

The point of this post is to show you (in case you didn't know) that you can do certain things, so you can test it in your environment and see how it performs with your data, on your hardware.

With the disclaimer done, let's go. First, we need some test data.

I created simple table:

                            TABLE "public.test"
    COLUMN    |  TYPE   |                     Modifiers                     
--------------+---------+---------------------------------------------------
 id           | INTEGER | NOT NULL DEFAULT NEXTVAL('test_id_seq'::regclass)
 file_path    | text    | 
 file_size    | INTEGER | 
 file_content | text    | 
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)

To this table, I loaded some data about man/readme/perldoc files from my system.

Content looks like this:

$ SELECT * FROM test WHERE id = 10762;
-[ RECORD 1 ]+-------------------------------------------------------
 id           | 10762
 file_path    | /usr/share/doc/texlive-doc/latex/pgf-umlsd/README
 file_size    | 100
 file_content | SOME LaTeX macros FOR UML SEQUENCE Diagrams.          +
              | Home page OF project: http://pgf-umlsd.googlecode.com/+
              |

Of course this is rather short readme. There are others. Some quick stats:

Total number of records: 11974
Table size: 58MB
Minimal size of content: 1 character
Average size of content: 10,096 characters
Maximal size of content: 837,009 characters
Total size of content: 120,899,970

The data size is not really impressive, but that's not the point.

First, let's see some sample queries:

-- number of documents containing the word "jpeg":
$ explain analyze select count(*) from test where file_content ~* 'jpeg';
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1423.68..1423.69 rows=1 width=0) (actual time=975.556..975.556 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..1423.67 rows=1 width=0) (actual time=3.504..975.486 rows=168 loops=1)
         Filter: (file_content ~* 'jpeg'::text)
         Rows Removed by Filter: 11806
 Planning time: 0.915 ms
 Execution time: 975.578 ms
(6 rows)

-- number of documents smaller than 700 characters:
$ EXPLAIN analyze SELECT COUNT(*) FROM test WHERE file_size < 700;
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1426.39..1426.40 ROWS=1 width=0) (actual TIME=1.938..1.938 ROWS=1 loops=1)
   ->  Seq Scan ON test  (cost=0.00..1423.67 ROWS=1086 width=0) (actual TIME=0.004..1.851 ROWS=1089 loops=1)
         FILTER: (file_size < 700)
         ROWS Removed BY FILTER: 10885
 Planning TIME: 0.041 ms
 Execution TIME: 1.956 ms
(6 ROWS)

And finally, the most interesting query:

-- number of documents smaller than 700 characters, and containing work "jpeg"
$ EXPLAIN analyze SELECT COUNT(*) FROM test WHERE file_size < 700 AND file_content ~* 'jpeg';
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1453.61..1453.62 ROWS=1 width=0) (actual TIME=5.160..5.160 ROWS=1 loops=1)
   ->  Seq Scan ON test  (cost=0.00..1453.61 ROWS=1 width=0) (actual TIME=1.569..5.158 ROWS=2 loops=1)
         FILTER: ((file_size < 700) AND (file_content ~* 'jpeg'::text))
         ROWS Removed BY FILTER: 11972
 Planning TIME: 0.907 ms
 Execution TIME: 5.177 ms
(6 ROWS)

OK. All times were taken by running the query 3 times and picking best one.

Anyway. Let's assume we want the size limiting queries to run fast. Obviously, we should create index on file_size:

$ CREATE INDEX i1 ON test (file_size);
CREATE INDEX
 
$ vacuum analyze test;
VACUUM

Now, the queries that use file_size, should be faster:

$ EXPLAIN analyze SELECT COUNT(*) FROM test WHERE file_size < 700;
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=38.00..38.01 ROWS=1 width=0) (actual TIME=0.626..0.626 ROWS=1 loops=1)
   ->  INDEX ONLY Scan USING i1 ON test  (cost=0.29..35.29 ROWS=1086 width=0) (actual TIME=0.039..0.415 ROWS=1089 loops=1)
         INDEX Cond: (file_size < 700)
         Heap Fetches: 0
 Planning TIME: 0.068 ms
 Execution TIME: 0.657 ms
(6 ROWS)
 
$ EXPLAIN analyze SELECT COUNT(*) FROM test WHERE file_size < 700 AND file_content ~* 'jpeg';
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1320.78..1320.79 ROWS=1 width=0) (actual TIME=4.422..4.422 ROWS=1 loops=1)
   ->  Bitmap Heap Scan ON test  (cost=24.43..1320.77 ROWS=1 width=0) (actual TIME=1.163..4.419 ROWS=2 loops=1)
         Recheck Cond: (file_size < 700)
         FILTER: (file_content ~* 'jpeg'::text)
         ROWS Removed BY FILTER: 1087
         Heap Blocks: exact=596
         ->  Bitmap INDEX Scan ON i1  (cost=0.00..24.43 ROWS=1086 width=0) (actual TIME=0.127..0.127 ROWS=1089 loops=1)
               INDEX Cond: (file_size < 700)
 Planning TIME: 1.012 ms
 Execution TIME: 4.446 ms
(10 ROWS)

Again, please disregard the fact that we have very small time differences – it's very small dataset, and I just want to show “a thing", and not necessarily how great it is – as this will directly depend on your data size.

What about the “jpeg" queries – well, we can use tsearch indexes.

For now, let's assume I'll just use GIN indexing:

$ CREATE INDEX i2 ON test USING gin (to_tsvector('english', file_content));
...
CREATE INDEX
 
$ vacuum analyze test;
...
VACUUM

Now, the query that is searching for jpeg (after slight modification) will use the index:

$ EXPLAIN analyze SELECT COUNT(*) FROM test WHERE to_tsvector('english', file_content) @@ 'jpeg';
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=215.43..215.44 ROWS=1 width=0) (actual TIME=0.178..0.178 ROWS=1 loops=1)
   ->  Bitmap Heap Scan ON test  (cost=16.47..215.28 ROWS=60 width=0) (actual TIME=0.063..0.151 ROWS=119 loops=1)
         Recheck Cond: (to_tsvector('english'::regconfig, file_content) @@ '''jpeg'''::tsquery)
         Heap Blocks: exact=104
         ->  Bitmap INDEX Scan ON i2  (cost=0.00..16.45 ROWS=60 width=0) (actual TIME=0.043..0.043 ROWS=119 loops=1)
               INDEX Cond: (to_tsvector('english'::regconfig, file_content) @@ '''jpeg'''::tsquery)
 Planning TIME: 0.185 ms
 Execution TIME: 0.210 ms
(8 ROWS)

But what will happen if I'll try to use both of the conditions?

$ EXPLAIN analyze SELECT COUNT(*) FROM test WHERE to_tsvector('english', file_content) @@ 'jpeg' AND file_size < 700;
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=60.29..60.30 ROWS=1 width=0) (actual TIME=0.330..0.330 ROWS=1 loops=1)
   ->  Bitmap Heap Scan ON test  (cost=41.13..60.28 ROWS=5 width=0) (actual TIME=0.327..0.327 ROWS=1 loops=1)
         Recheck Cond: ((to_tsvector('english'::regconfig, file_content) @@ '''jpeg'''::tsquery) AND (file_size < 700))
         Heap Blocks: exact=1
         ->  BitmapAnd  (cost=41.13..41.13 ROWS=5 width=0) (actual TIME=0.323..0.323 ROWS=0 loops=1)
               ->  Bitmap INDEX Scan ON i2  (cost=0.00..16.45 ROWS=60 width=0) (actual TIME=0.044..0.044 ROWS=119 loops=1)
                     INDEX Cond: (to_tsvector('english'::regconfig, file_content) @@ '''jpeg'''::tsquery)
               ->  Bitmap INDEX Scan ON i1  (cost=0.00..24.43 ROWS=1086 width=0) (actual TIME=0.251..0.251 ROWS=1089 loops=1)
                     INDEX Cond: (file_size < 700)
 Planning TIME: 0.197 ms
 Execution TIME: 0.365 ms
(11 ROWS)

Both indexes are used, but there is some additional logic that's needed to “join" results from them.

It would be better to have single, multi-column, index that would satisfy this query. Unfortunately you can't:

$ CREATE INDEX i3 ON test USING gin (file_size, to_tsvector('english', file_content));
ERROR:  DATA TYPE INTEGER has no DEFAULT operator class FOR access method "gin"
HINT:  You must specify an operator class FOR the INDEX OR define a DEFAULT operator class FOR the DATA TYPE.

or:

$ CREATE INDEX i3 ON test USING gin (to_tsvector('english', file_content), file_size);
ERROR:  DATA TYPE INTEGER has no DEFAULT operator class FOR access method "gin"
HINT:  You must specify an operator class FOR the INDEX OR define a DEFAULT operator class FOR the DATA TYPE.

Luckily, there are extensions. And one of them is gin_btree (there is also gist_btree if your “base" index has to be gist).

What it is – it adds “btree" type of operators support to gin index. So we can:

CREATE extension btree_gin;
CREATE EXTENSION
CREATE INDEX i3 ON test USING gin (file_size, to_tsvector('english', file_content));
...
CREATE INDEX
analyze test;
...
ANALYZE

And now:

EXPLAIN analyze SELECT COUNT(*) FROM test WHERE to_tsvector('english', file_content) @@ 'jpeg' AND file_size < 700;
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=59.22..59.23 ROWS=1 width=0) (actual TIME=0.510..0.511 ROWS=1 loops=1)
   ->  Bitmap Heap Scan ON test  (cost=40.06..59.21 ROWS=5 width=0) (actual TIME=0.508..0.508 ROWS=1 loops=1)
         Recheck Cond: ((to_tsvector('english'::regconfig, file_content) @@ '''jpeg'''::tsquery) AND (file_size < 700))
         Heap Blocks: exact=1
         ->  Bitmap INDEX Scan ON i3  (cost=0.00..40.06 ROWS=5 width=0) (actual TIME=0.503..0.503 ROWS=1 loops=1)
               INDEX Cond: ((to_tsvector('english'::regconfig, file_content) @@ '''jpeg'''::tsquery) AND (file_size < 700))
 Planning TIME: 0.241 ms
 Execution TIME: 0.549 ms
(8 ROWS)

As you can see the timing is not so good now (worse than 0.365ms earlier, with two separate indexes), but that could be due to number of reasons.

The important part is: with btree_gin gin indexes can be used to do standard comparisons, so you can make single index that handles both conditions on integers and full text searches. Same goes with GiST, but then the extension is named btree_gist.

I just thought that you might have a use-case for it, and perhaps you never knew about this feature. Have fun 🙂

=$
|

One thought on “Joining BTree and GIN/GiST indexes”