Using JSON: json vs. jsonb, pglz vs. lz4, key optimization, parsing speed?

Recently(ish) I had a conversation on one of PostgreSQL support chats (IRC, Slack, or Discord) about efficient storage of JSON data, which compression to use, which datatype.

Unrelated to this, some people (at least two over the last year or so) said that they aren't sure if PostgreSQL doesn't optimize storage between columns, for example, storing attribute names once per column, and not once per value.

Decided to investigate…

First, let's look at compression. By default my PostgreSQL is using:

=$ show default_toast_compression ;
 default_toast_compression
───────────────────────────
 pglz
(1 row)

Which is the older compression format. I can change it, since PostgreSQL 14 to use LZ4.

Simple table for testing:

=$ create table compression_test (test_name text, text_pglz text, text_lz4 text);
=$ alter table compression_test alter column text_pglz set compression pglz;
=$ alter table compression_test alter column text_lz4 set compression lz4;
 
=$ \d+ compression_test
                                     Table "public.compression_test"
  Column   | Type | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
-----------+------+-----------+----------+---------+----------+-------------+--------------+-------------
 test_name | text |           |          |         | extended |             |              | 
 text_pglz | text |           |          |         | extended | pglz        |              | 
 text_lz4  | text |           |          |         | extended | lz4         |              | 
Access method: heap

In there I loaded:

  • four random strings, each consisting of characters: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz, with lengths of: 5, 505, 1005, and 1505 characters.
  • ten documents from PostgreSQL documentation, picked so that they will differ in size by at least 400 characters, starting with legalnotice.html (2065 characters), and ending with pgfreespacemap.html (6779 characters).

Let's see how much disk space is used by each of these:

=$ SELECT
    test_name,
    length( text_pglz ) AS text_length,
    pg_column_compression( text_pglz ) AS pglz_comp,
    pg_column_size( text_pglz ) AS pglz_size,
    format(
        '%7s %%',
        round( ( 100.0 * pg_column_size( text_pglz ) ) / length( text_pglz ), 2 )
    ) AS pglz_comp,
    pg_column_compression( text_lz4 ) AS lz4_comp,
    pg_column_size( text_lz4 ) AS pglz_size,
    format(
        '%6s %%',
        round( ( 100.0 * pg_column_size( text_lz4 ) ) / length( text_lz4 ), 2 )
    ) AS lz4_comp
FROM
    compression_test
ORDER BY
    text_length;
            test_name            | text_length | pglz_comp | pglz_size | pglz_comp | lz4_comp | pglz_size | lz4_comp 
---------------------------------+-------------+-----------+-----------+-----------+----------+-----------+----------
 random:5                        |           5 | [null]    |         6 |  120.00 % | [null]   |         6 | 120.00 %
 random:505                      |         505 | [null]    |       509 |  100.79 % | [null]   |       509 | 100.79 %
 random:1005                     |        1005 | [null]    |      1005 |  100.00 % | [null]   |      1009 | 100.40 %
 random:1505                     |        1505 | [null]    |      1505 |  100.00 % | [null]   |      1509 | 100.27 %
 legalnotice.html                |        2065 | [null]    |      2071 |  100.29 % | lz4      |      1630 |  78.93 %
 release-prior.html              |        2655 | pglz      |      1214 |   45.73 % | lz4      |      1302 |  49.04 %
 sql-dropuser.html               |        3188 | pglz      |      1462 |   45.86 % | lz4      |      1624 |  50.94 %
 pltcl-procnames.html            |        3670 | pglz      |      1811 |   49.35 % | lz4      |      1970 |  53.68 %
 spi-spi-getvalue.html           |        4216 | pglz      |      1771 |   42.01 % | lz4      |      2020 |  47.91 %
 catalog-pg-db-role-setting.html |        4715 | pglz      |      1811 |   38.41 % | lz4      |      2088 |  44.28 %
 supported-platforms.html        |        5235 | pglz      |      2584 |   49.36 % | lz4      |      2839 |  54.23 %
 datatype-money.html             |        5749 | pglz      |      2814 |   48.95 % | lz4      |      3066 |  53.33 %
 sql-dropsubscription.html       |        6289 | pglz      |      2838 |   45.13 % | lz4      |      3056 |  48.59 %
 pgfreespacemap.html             |        6779 | pglz      |      3134 |   46.23 % | lz4      |      3248 |  47.91 %
(14 rows)

Couple of interesting notes:

  • For small texts, compression doesn't kick in. This is because of default storage type, which doesn't compress small values.
  • Size disrepancy for pglz column, when pglz is not actually used (pglz_comp is null) is simply overhead of storing length of string. Which is 1 for up to 127 characters, and then switched to 4 bytes.
  • These is interesting disrepancy for legalnotice.html – it's not compressed, but size difference is 6 bytes (2071 – 2065). Why? Not sure, to be honest. I thought that it's because of switching to external storage, but this has happened already for random:1005. It's not a big difference, so I can say that it doesn't really matter.

OK. So we see compression details. Specifically, I see that pglz compresses a bit better, but I know that lz4 is simply faster.

How much faster? Let's see. Loaded 16MB of text (concatenated all of html files in PostgreSQL docs) to my test table. Then ran:

=$ select length(text_pglz) from compression_test ;
=$ select length(text_lz4) from compression_test ;

with \timing, multiple times, and got best time. Result:

  • pglz: 37.705ms ( +47% )
  • lz4: 25.560ms ( -32% )

OK. Compression is covered. Now, how about storing data. First thing to check will be trivial: does PostgreSQL optimize away simple things like key names? Let's make two simple tables:

=$ create table json_key_optimization_test (
    test_json json
);
=$ create table jsonb_key_optimization_test (
    test_jsonb jsonb
);

Now, I need to generate some random, non-trivially sized json objects. Will use this approach:

=$ CREATE temp TABLE sample_keys as
SELECT
    get_random_string(2) as k1,
    get_random_string(2) as k2,
    get_random_string(2) as k3,
    get_random_string(2) as k4,
    get_random_string(2) as k5;
 
=$ CREATE temp TABLE sample_jsons as
SELECT
    json_build_object(k1, k1, k2, k2, k3, k3, k4, k4, k5, k5) AS json_val,
    jsonb_build_object(k1, k1, k2, k2, k3, k3, k4, k4, k5, k5) AS jsonb_val
FROM
    sample_keys;
SELECT 1
 
=$ select * from sample_jsons \gx
-[ RECORD 1 ]----------------------------------------------------------------
json_val  | {"Mn" : "Mn", "ff" : "ff", "fh" : "fh", "Tw" : "Tw", "k9" : "k9"}
jsonb_val | {"Mn": "Mn", "Tw": "Tw", "ff": "ff", "fh": "fh", "k9": "k9"}

Will just make the length of string, let's say 500 characters. That should be enough.

Let's see table sizes after inserting single such row:

=$ SELECT
    pg_table_size('json_key_optimization_test'::regclass) as json_table_size,
    pg_table_size('jsonb_key_optimization_test'::regclass) as jsonb_table_size;
 json_table_size | jsonb_table_size 
-----------------+------------------
          7372881920
(1 row)

Tested also sizes of generated values:

  • JSON: length(test_json::text) is 5045 bytes, and pg_column_size is 2878
  • JSONB: length(test_jsonb::text) is 5040 bytes, and pg_column_size is 2865

Which shows that jsonb seems to be marginally smaller for such extreme case.

Great. Now, for the fun part. Will insert 19 more rows, gathering sizes after each insert. Results (all sizes in bytes):

rows JSON JSONB
size per row size per row
1 32,768 32,768 32,768 32,768
2 32,768 16,384 16,384 32,768
3 65,536 21,845 21,845 65,536
4 65,536 16,384 16,384 65,536
5 65,536 13,107 13,107 65,536
6 73,728 12,288 12,288 73,728
7 73,728 10,532 10,532 73,728
8 73,728 9,216 9,216 73,728
9 81,920 9,102 9,102 81,920
10 81,920 8,192 8,192 81,920
11 90,112 8,192 8,192 90,112
12 90,112 7,509 7,509 90,112
13 90,112 6,931 6,931 90,112
14 98,304 7,021 7,021 98,304
15 98,304 6,553 6,553 98,304
16 98,304 6,144 6,144 98,304
17 106,496 6,264 6,264 106,496
18 106,496 5,916 5,916 106,496
19 114,688 6,036 6,036 114,688
20 114,688 5,734 5,734 114,688

So, there size increase is not linear, which seems to point to optimization, but, The average value per row is still significantly larger than size of single value. So I assume that there is no cross-row optimizations, and the nonlinear size scaling simply happens because of uneven filling of table/toast pages.

Now, let's look as space usage of JSON vs. JSONB. In example above we've seen that jsonb values are slightly smaller, both on disk and as length(val::text). Is it always the case?

Let's start with very simple json value:

=$ create table size_comparison (json_val json, jsonb_val jsonb);
 
=$ with s as materialized ( select 1::int4 as v )
insert into size_comparison select to_json(v), to_jsonb(v) from s;
 
=$ SELECT
    *,
    pg_column_size( json_val ) AS json_size,
    pg_column_size( jsonb_val ) AS jsonb_size
FROM
    size_comparison;
 json_val | jsonb_val | json_size | jsonb_size 
----------+-----------+-----------+------------
 1        | 1         |         2 |         17
(1 row)

As you can see in case of single-digit integer, jsonb uses MUCH more space. How about different values?

Value JSON size (bytes) JSONB size (bytes)
["Joe", "Elon", "Jeff", "Mia", "Jenna", "Louis"] 49 53
[12, 56, 64, 123, 2, 489] 26 77
{"isActive": true, "createdAt": "2022-05-15T08:30:00Z", "lastLogin": "2025-01-10T12:45:00Z"} 93 95
~ 1.5kB json 828 976
~ 4.5kB json 1666 2066

So, it looks that JSON values use slightly less space. Personally, I think it's a cost worth paying, given enormous functionality increase.

Finally, one last thing I want to touch. Speed of parsing values. Recently someone somewhere commented that parsing JSON values was huge part of inserting data to tables. In his case, if I recall correctly he was inserting json values in size of ~ 50kB.

So, let's try. Downloaded huge JSON, and extracted 50k out of it. Then made simple file that contained the same json, as single line, repeated 4100 times, so that full dataset was ~ 200MB.

Then I made six separate tables:

=$ create table copy_text_pglz  (v text  compression pglz );
=$ create table copy_json_pglz  (v json  compression pglz );
=$ create table copy_jsonb_pglz (v jsonb compression pglz );
=$ create table copy_text_lz4   (v text  compression lz4 );
=$ create table copy_json_lz4   (v json  compression lz4 );
=$ create table copy_jsonb_lz4  (v jsonb compression lz4 );

And then ran, for each table:

truncate copy_text_pglz;
checkpoint;
\timing on
\copy copy_text_pglz from test-200mb.json
checkpoint;
\timing off
select pg_table_size('copy_text_pglz'::regclass);

Then I summed time used by \copy and the immediate checkpoint afterwards.

Each test was repeated 6 times, and best time picked. Results:

Datatype: Compression: Time to load (ms): Table size (MB):
text lz4 779.105 54
json lz4 1,155.428 54
jsonb lz4 2,186.738 65
text pglz 2,498.587 65
json pglz 2,917.188 67
jsonb pglz 4,418.037 73

It clearly shows, that while jsonb has some non-trivial overhead over json, and this, in turn over plain text, overhead of pglz is much greater.

So, these are writes.

How about reads? First, let's just assume I want to export the whole thing, using:

=$ select v from some_table;

Tested it using explain (analyze, serialize binary), and compared execution time (which includes serialization). Results are somewhat surprising:

Datatype: Compression: Time to select (ms):
text lz4 74.374
json lz4 90.509
text pglz 120.669
json pglz 131.659
jsonb lz4 609.387
jsonb pglz 656.079

So, to store json data in db, and exporting it as one big data dump, and assuming you trust your inputs, just use text datatype with lz4 compression.

This doesn't looks like sensible use for json in database, though. Normally people would want to select rows based on some parts of the jsons, or extract information, or aggregate, or whatever. And for this – jsonb has no real competition.

To give you some perspective, timed query that returns number of elements in the array that is stored.

Queries looked like this:

=$ select jsonb_array_length(v) from copy_jsonb_lz4;
=$ select jsonb_array_length(v) from copy_jsonb_pglz;
=$ select jsonb_array_length(v::jsonb) from copy_text_lz4;
=$ select jsonb_array_length(v::jsonb) from copy_text_pglz;
=$ select json_array_length(v) from copy_json_lz4;
=$ select json_array_length(v) from copy_json_pglz;
=$ select json_array_length(v::json) from copy_text_lz4;
=$ select json_array_length(v::json) from copy_text_pglz;

For text, I had to first cast to json, or jsonb, so there are two tests for each text-based table.

Results aren't surprising:

Datatype: Compression: Time to select (ms):
jsonb lz4 57.420
jsonb pglz 112.459
json lz4 396.273
json pglz 432.536
text::json lz4 730.255
text::json pglz 785.524
text::jsonb lz4 1,600.397
text::jsonb pglz 1,640.635

Uff, that took a while to research, test, and write up. Hope you'll find it useful in some way 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.