Recently(ish) I had a conversation on one of PostgreSQL support chats (IRC, Slack, or Discord) about efficient storage of JSON data, which compression to use, which datatype.
Unrelated to this, some people (at least two over the last year or so) said that they aren't sure if PostgreSQL doesn't optimize storage between columns, for example, storing attribute names once per column, and not once per value.
Decided to investigate…
First, let's look at compression. By default my PostgreSQL is using:
=$ show default_toast_compression ; default_toast_compression ─────────────────────────── pglz (1 row)
Which is the older compression format. I can change it, since PostgreSQL 14 to use LZ4.
Simple table for testing:
=$ create table compression_test (test_name text, text_pglz text, text_lz4 text); =$ alter table compression_test alter column text_pglz set compression pglz; =$ alter table compression_test alter column text_lz4 set compression lz4; =$ \d+ compression_test Table "public.compression_test" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description -----------+------+-----------+----------+---------+----------+-------------+--------------+------------- test_name | text | | | | extended | | | text_pglz | text | | | | extended | pglz | | text_lz4 | text | | | | extended | lz4 | | Access method: heap
In there I loaded:
- four random strings, each consisting of characters: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz, with lengths of: 5, 505, 1005, and 1505 characters.
- ten documents from PostgreSQL documentation, picked so that they will differ in size by at least 400 characters, starting with legalnotice.html (2065 characters), and ending with pgfreespacemap.html (6779 characters).
Let's see how much disk space is used by each of these:
=$ SELECT test_name, length( text_pglz ) AS text_length, pg_column_compression( text_pglz ) AS pglz_comp, pg_column_size( text_pglz ) AS pglz_size, format( '%7s %%', round( ( 100.0 * pg_column_size( text_pglz ) ) / length( text_pglz ), 2 ) ) AS pglz_comp, pg_column_compression( text_lz4 ) AS lz4_comp, pg_column_size( text_lz4 ) AS pglz_size, format( '%6s %%', round( ( 100.0 * pg_column_size( text_lz4 ) ) / length( text_lz4 ), 2 ) ) AS lz4_comp FROM compression_test ORDER BY text_length; test_name | text_length | pglz_comp | pglz_size | pglz_comp | lz4_comp | pglz_size | lz4_comp ---------------------------------+-------------+-----------+-----------+-----------+----------+-----------+---------- random:5 | 5 | [null] | 6 | 120.00 % | [null] | 6 | 120.00 % random:505 | 505 | [null] | 509 | 100.79 % | [null] | 509 | 100.79 % random:1005 | 1005 | [null] | 1005 | 100.00 % | [null] | 1009 | 100.40 % random:1505 | 1505 | [null] | 1505 | 100.00 % | [null] | 1509 | 100.27 % legalnotice.html | 2065 | [null] | 2071 | 100.29 % | lz4 | 1630 | 78.93 % release-prior.html | 2655 | pglz | 1214 | 45.73 % | lz4 | 1302 | 49.04 % sql-dropuser.html | 3188 | pglz | 1462 | 45.86 % | lz4 | 1624 | 50.94 % pltcl-procnames.html | 3670 | pglz | 1811 | 49.35 % | lz4 | 1970 | 53.68 % spi-spi-getvalue.html | 4216 | pglz | 1771 | 42.01 % | lz4 | 2020 | 47.91 % catalog-pg-db-role-setting.html | 4715 | pglz | 1811 | 38.41 % | lz4 | 2088 | 44.28 % supported-platforms.html | 5235 | pglz | 2584 | 49.36 % | lz4 | 2839 | 54.23 % datatype-money.html | 5749 | pglz | 2814 | 48.95 % | lz4 | 3066 | 53.33 % sql-dropsubscription.html | 6289 | pglz | 2838 | 45.13 % | lz4 | 3056 | 48.59 % pgfreespacemap.html | 6779 | pglz | 3134 | 46.23 % | lz4 | 3248 | 47.91 % (14 rows)
Couple of interesting notes:
- For small texts, compression doesn't kick in. This is because of default storage type, which doesn't compress small values.
- Size disrepancy for pglz column, when pglz is not actually used (pglz_comp is null) is simply overhead of storing length of string. Which is 1 for up to 127 characters, and then switched to 4 bytes.
- These is interesting disrepancy for legalnotice.html – it's not compressed, but size difference is 6 bytes (2071 – 2065). Why? Not sure, to be honest. I thought that it's because of switching to external storage, but this has happened already for random:1005. It's not a big difference, so I can say that it doesn't really matter.
OK. So we see compression details. Specifically, I see that pglz compresses a bit better, but I know that lz4 is simply faster.
How much faster? Let's see. Loaded 16MB of text (concatenated all of html files in PostgreSQL docs) to my test table. Then ran:
=$ select length(text_pglz) from compression_test ; =$ select length(text_lz4) from compression_test ;
with \timing, multiple times, and got best time. Result:
- pglz: 37.705ms ( +47% )
- lz4: 25.560ms ( -32% )
OK. Compression is covered. Now, how about storing data. First thing to check will be trivial: does PostgreSQL optimize away simple things like key names? Let's make two simple tables:
=$ create table json_key_optimization_test ( test_json json ); =$ create table jsonb_key_optimization_test ( test_jsonb jsonb );
Now, I need to generate some random, non-trivially sized json objects. Will use this approach:
=$ CREATE temp TABLE sample_keys as SELECT get_random_string(2) as k1, get_random_string(2) as k2, get_random_string(2) as k3, get_random_string(2) as k4, get_random_string(2) as k5; =$ CREATE temp TABLE sample_jsons as SELECT json_build_object(k1, k1, k2, k2, k3, k3, k4, k4, k5, k5) AS json_val, jsonb_build_object(k1, k1, k2, k2, k3, k3, k4, k4, k5, k5) AS jsonb_val FROM sample_keys; SELECT 1 =$ select * from sample_jsons \gx -[ RECORD 1 ]---------------------------------------------------------------- json_val | {"Mn" : "Mn", "ff" : "ff", "fh" : "fh", "Tw" : "Tw", "k9" : "k9"} jsonb_val | {"Mn": "Mn", "Tw": "Tw", "ff": "ff", "fh": "fh", "k9": "k9"}
Will just make the length of string, let's say 500 characters. That should be enough.
Let's see table sizes after inserting single such row:
=$ SELECT pg_table_size('json_key_optimization_test'::regclass) as json_table_size, pg_table_size('jsonb_key_optimization_test'::regclass) as jsonb_table_size; json_table_size | jsonb_table_size -----------------+------------------ 73728 │ 81920 (1 row)
Tested also sizes of generated values:
- JSON: length(test_json::text) is 5045 bytes, and pg_column_size is 2878
- JSONB: length(test_jsonb::text) is 5040 bytes, and pg_column_size is 2865
Which shows that jsonb seems to be marginally smaller for such extreme case.
Great. Now, for the fun part. Will insert 19 more rows, gathering sizes after each insert. Results (all sizes in bytes):
| rows | JSON | JSONB | ||
|---|---|---|---|---|
| size | per row | size | per row | |
| 1 | 32,768 | 32,768 | 32,768 | 32,768 |
| 2 | 32,768 | 16,384 | 16,384 | 32,768 |
| 3 | 65,536 | 21,845 | 21,845 | 65,536 |
| 4 | 65,536 | 16,384 | 16,384 | 65,536 |
| 5 | 65,536 | 13,107 | 13,107 | 65,536 |
| 6 | 73,728 | 12,288 | 12,288 | 73,728 |
| 7 | 73,728 | 10,532 | 10,532 | 73,728 |
| 8 | 73,728 | 9,216 | 9,216 | 73,728 |
| 9 | 81,920 | 9,102 | 9,102 | 81,920 |
| 10 | 81,920 | 8,192 | 8,192 | 81,920 |
| 11 | 90,112 | 8,192 | 8,192 | 90,112 |
| 12 | 90,112 | 7,509 | 7,509 | 90,112 |
| 13 | 90,112 | 6,931 | 6,931 | 90,112 |
| 14 | 98,304 | 7,021 | 7,021 | 98,304 |
| 15 | 98,304 | 6,553 | 6,553 | 98,304 |
| 16 | 98,304 | 6,144 | 6,144 | 98,304 |
| 17 | 106,496 | 6,264 | 6,264 | 106,496 |
| 18 | 106,496 | 5,916 | 5,916 | 106,496 |
| 19 | 114,688 | 6,036 | 6,036 | 114,688 |
| 20 | 114,688 | 5,734 | 5,734 | 114,688 |
So, there size increase is not linear, which seems to point to optimization, but, The average value per row is still significantly larger than size of single value. So I assume that there is no cross-row optimizations, and the nonlinear size scaling simply happens because of uneven filling of table/toast pages.
Now, let's look as space usage of JSON vs. JSONB. In example above we've seen that jsonb values are slightly smaller, both on disk and as length(val::text). Is it always the case?
Let's start with very simple json value:
=$ create table size_comparison (json_val json, jsonb_val jsonb); =$ with s as materialized ( select 1::int4 as v ) insert into size_comparison select to_json(v), to_jsonb(v) from s; =$ SELECT *, pg_column_size( json_val ) AS json_size, pg_column_size( jsonb_val ) AS jsonb_size FROM size_comparison; json_val | jsonb_val | json_size | jsonb_size ----------+-----------+-----------+------------ 1 | 1 | 2 | 17 (1 row)
As you can see in case of single-digit integer, jsonb uses MUCH more space. How about different values?
| Value | JSON size (bytes) | JSONB size (bytes) |
|---|---|---|
["Joe", "Elon", "Jeff", "Mia", "Jenna", "Louis"] |
49 | 53 |
[12, 56, 64, 123, 2, 489] |
26 | 77 |
{"isActive": true, "createdAt": "2022-05-15T08:30:00Z", "lastLogin": "2025-01-10T12:45:00Z"} |
93 | 95 |
| ~ 1.5kB json | 828 | 976 |
| ~ 4.5kB json | 1666 | 2066 |
So, it looks that JSON values use slightly less space. Personally, I think it's a cost worth paying, given enormous functionality increase.
Finally, one last thing I want to touch. Speed of parsing values. Recently someone somewhere commented that parsing JSON values was huge part of inserting data to tables. In his case, if I recall correctly he was inserting json values in size of ~ 50kB.
So, let's try. Downloaded huge JSON, and extracted 50k out of it. Then made simple file that contained the same json, as single line, repeated 4100 times, so that full dataset was ~ 200MB.
Then I made six separate tables:
=$ create table copy_text_pglz (v text compression pglz ); =$ create table copy_json_pglz (v json compression pglz ); =$ create table copy_jsonb_pglz (v jsonb compression pglz ); =$ create table copy_text_lz4 (v text compression lz4 ); =$ create table copy_json_lz4 (v json compression lz4 ); =$ create table copy_jsonb_lz4 (v jsonb compression lz4 );
And then ran, for each table:
truncate copy_text_pglz; checkpoint; \timing on \copy copy_text_pglz from test-200mb.json checkpoint; \timing off select pg_table_size('copy_text_pglz'::regclass);
Then I summed time used by \copy and the immediate checkpoint afterwards.
Each test was repeated 6 times, and best time picked. Results:
| Datatype: | Compression: | Time to load (ms): | Table size (MB): |
|---|---|---|---|
| text | lz4 | 779.105 | 54 |
| json | lz4 | 1,155.428 | 54 |
| jsonb | lz4 | 2,186.738 | 65 |
| text | pglz | 2,498.587 | 65 |
| json | pglz | 2,917.188 | 67 |
| jsonb | pglz | 4,418.037 | 73 |
It clearly shows, that while jsonb has some non-trivial overhead over json, and this, in turn over plain text, overhead of pglz is much greater.
So, these are writes.
How about reads? First, let's just assume I want to export the whole thing, using:
=$ select v from some_table;
Tested it using explain (analyze, serialize binary), and compared execution time (which includes serialization). Results are somewhat surprising:
| Datatype: | Compression: | Time to select (ms): |
|---|---|---|
| text | lz4 | 74.374 |
| json | lz4 | 90.509 |
| text | pglz | 120.669 |
| json | pglz | 131.659 |
| jsonb | lz4 | 609.387 |
| jsonb | pglz | 656.079 |
So, to store json data in db, and exporting it as one big data dump, and assuming you trust your inputs, just use text datatype with lz4 compression.
This doesn't looks like sensible use for json in database, though. Normally people would want to select rows based on some parts of the jsons, or extract information, or aggregate, or whatever. And for this – jsonb has no real competition.
To give you some perspective, timed query that returns number of elements in the array that is stored.
Queries looked like this:
=$ select jsonb_array_length(v) from copy_jsonb_lz4; =$ select jsonb_array_length(v) from copy_jsonb_pglz; =$ select jsonb_array_length(v::jsonb) from copy_text_lz4; =$ select jsonb_array_length(v::jsonb) from copy_text_pglz; =$ select json_array_length(v) from copy_json_lz4; =$ select json_array_length(v) from copy_json_pglz; =$ select json_array_length(v::json) from copy_text_lz4; =$ select json_array_length(v::json) from copy_text_pglz;
For text, I had to first cast to json, or jsonb, so there are two tests for each text-based table.
Results aren't surprising:
| Datatype: | Compression: | Time to select (ms): |
|---|---|---|
| jsonb | lz4 | 57.420 |
| jsonb | pglz | 112.459 |
| json | lz4 | 396.273 |
| json | pglz | 432.536 |
| text::json | lz4 | 730.255 |
| text::json | pglz | 785.524 |
| text::jsonb | lz4 | 1,600.397 |
| text::jsonb | pglz | 1,640.635 |
Uff, that took a while to research, test, and write up. Hope you'll find it useful in some way 🙂