Waiting for 9.4 – Introduce jsonb, a structured format for storing json.

Portuguese Brazil Version

On 23rd of March, Andrew Dunstan committed patch:

Introduce jsonb, a structured format for storing json.
 
The new format accepts exactly the same data as the json type. However, it is
stored in a format that does not require reparsing the orgiginal text in order
to process it, making it much more suitable for indexing and other operations.
Insignificant whitespace is discarded, and the order of object keys is not
preserved. Neither are duplicate object keys kept - the later value for a given
key is the only one stored.
 
The new type has all the functions and operators that the json type has,
with the exception of the json generation functions (to_json, json_agg etc.)
and with identical semantics. In addition, there are operator classes for
hash and btree indexing, and two classes for GIN indexing, that have no
equivalent in the json type.
 
This feature grew out of previous work by Oleg Bartunov and Teodor Sigaev, which
was intended to provide similar facilities to a nested hstore type, but which
in the end proved to have some significant compatibility issues.
 
Authors: Oleg Bartunov,  Teodor Sigaev, Peter Geoghegan and Andrew Dunstan.
Review: Andres Freund

After it got committed, it got covered pretty well, but I decided to write about it too, with some examples.

First, let's see how it works.

I'll start with some test values:

{"a":"abc","d":"def","z":[1,2,3]}
{"a":"abc","d";"def","z":[1x2,3]}
{
    "a": "abc",
    "d": "def",
    "z": [1, 2, 3]
}

and

{"a":"abc","d":"def","z":[1,2,3],"d":"overwritten"}

First, let's see what will happen after casting these to json and jsonb:

SELECT '{"a":"abc","d":"def","z":[1,2,3]}'::json;
               json                
-----------------------------------
 {"a":"abc","d":"def","z":[1,2,3]}
(1 ROW)
 
SELECT '{"a":"abc","d":"def","z":[1,2,3]}'::jsonb;
                  jsonb                   
------------------------------------------
 {"a": "abc", "d": "def", "z": [1, 2, 3]}
(1 ROW)

All fine in here, but jsonb output is reformatted. In here it didn't do much, but it did add some whitespaces. That's nice.

SELECT '{"a":"abc","d";"def","z":[1x2,3]}'::json;
ERROR:  invalid INPUT syntax FOR TYPE json
LINE 1: SELECT '{"a":"abc","d";"def","z":[1x2,3]}'::json;
               ^
DETAIL:  Token ";" IS invalid.
CONTEXT:  JSON DATA, line 1: {"a":"abc","d";...
 
SELECT '{"a":"abc","d";"def","z":[1x2,3]}'::jsonb;
ERROR:  invalid INPUT syntax FOR TYPE json
LINE 1: SELECT '{"a":"abc","d";"def","z":[1x2,3]}'::jsonb;
               ^
DETAIL:  Token ";" IS invalid.
CONTEXT:  JSON DATA, line 1: {"a":"abc","d";...

In both cases it correctly reported error, but in the jsonb case it said
“invalid input syntax for type json". It's probably due to cast order, and
should be OK normally. And anyway – JSON and JSONB are similar enough not to
cause problems.

SELECT '{
    "a": "abc",
    "d": "def",
    "z": [1, 2, 3]
}'::json;
        json        
--------------------
 {                 +
     "a": "abc",   +
     "d": "def",   +
     "z": [1, 2, 3]+
 }
(1 ROW)
 
SELECT '{
    "a": "abc",
    "d": "def",
    "z": [1, 2, 3]
}'::jsonb;
                  jsonb                   
------------------------------------------
 {"a": "abc", "d": "def", "z": [1, 2, 3]}
(1 ROW)

In here we see the whitespace removal. I'd say it's pretty cool. Of course not if (for whatever reason) you want to retain the spaces, but these shouldn't be meaningful, so depending on them being there doesn't sound wise.

SELECT '{"a":"abc","d":"def","z":[1,2,3],"d":"overwritten"}'::json;
                        json                         
-----------------------------------------------------
 {"a":"abc","d":"def","z":[1,2,3],"d":"overwritten"}
(1 ROW)
 
SELECT '{"a":"abc","d":"def","z":[1,2,3],"d":"overwritten"}'::jsonb;
                      jsonb                       
--------------------------------------------------
 {"a": "abc", "d": "overwritten", "z": [1, 2, 3]}
(1 ROW)

And the value being overwritten. I'd say that it's great.

As for disk space usage – it will definitely depend on a case, but quick test shows that jsonb can use significantly more disk space:

$ SELECT pg_column_size('{"a":"abc","d":"def","z":[1,2,3]}'::jsonb);
 pg_column_size 
----------------
             84
(1 ROW)
 
$ SELECT pg_column_size('{"a":"abc","d":"def","z":[1,2,3]}'::json);
 pg_column_size 
----------------
             37
(1 ROW)

On the other hand for this JSON:

{"widget": {
    "debug": "on",
    "window": {
        "title": "Sample Konfabulator Widget",
        "name": "main_window",
        "width": 500,
        "height": 500
    },
    "image": { 
        "src": "Images/Sun.png",
        "name": "sun1",
        "hOffset": 250,
        "vOffset": 250,
        "alignment": "center"
    },
    "text": {
        "data": "Click Here",
        "size": 36,
        "style": "bold",
        "name": "text1",
        "hOffset": 250,
        "vOffset": 100,
        "alignment": "center",
        "onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;"
    }
}}

JSON value takes 605 bytes, and JSONb – 524.

What's more…

For jsonb we have more operators. For example – equality:

$ SELECT '{"a":1, "b":2}'::jsonb = '{"b":2, "a":1}'::jsonb;
 ?COLUMN? 
----------
 t
(1 ROW)

More operators are described the docs.

What's more – the new datatype can use indexes to search for elements.

I made test table:

CREATE TABLE test (v jsonb);

and loaded there 100k rows, which look like:

{"i": 42, "s": "ryzdaoop"}

Some of the rows (~ 1%) have additional value in the json – key “r" with value 1.

Now, I can create index on this:

CREATE INDEX whatever ON test USING gin (v);
CREATE INDEX

and now:

EXPLAIN analyze SELECT * FROM test WHERE v ? 'r';
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON test  (cost=16.77..307.23 ROWS=100 width=42) (actual TIME=0.554..2.670 ROWS=1024 loops=1)
   Recheck Cond: (v ? 'r'::text)
   Heap Blocks: exact=644
   ->  Bitmap INDEX Scan ON whatever  (cost=0.00..16.75 ROWS=100 width=0) (actual TIME=0.416..0.416 ROWS=1024 loops=1)
         INDEX Cond: (v ? 'r'::text)
 Planning TIME: 0.475 ms
 Total runtime: 2.788 ms
(7 ROWS)

I could have also did:

EXPLAIN analyze SELECT * FROM test WHERE v @> '{"i":42}';
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON test  (cost=28.77..319.23 ROWS=100 width=42) (actual TIME=1.132..1.707 ROWS=103 loops=1)
   Recheck Cond: (v @> '{"i": 42}'::jsonb)
   Heap Blocks: exact=99
   ->  Bitmap INDEX Scan ON whatever  (cost=0.00..28.75 ROWS=100 width=0) (actual TIME=1.089..1.089 ROWS=103 loops=1)
         INDEX Cond: (v @> '{"i": 42}'::jsonb)
 Planning TIME: 0.482 ms
 Total runtime: 1.783 ms
(7 ROWS)

Or even:

EXPLAIN analyze SELECT * FROM test WHERE v @> '{"i":42, "r":1}';
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON test  (cost=52.77..343.23 ROWS=100 width=42) (actual TIME=1.171..1.191 ROWS=3 loops=1)
   Recheck Cond: (v @> '{"i": 42, "r": 1}'::jsonb)
   Heap Blocks: exact=3
   ->  Bitmap INDEX Scan ON whatever  (cost=0.00..52.75 ROWS=100 width=0) (actual TIME=1.143..1.143 ROWS=3 loops=1)
         INDEX Cond: (v @> '{"i": 42, "r": 1}'::jsonb)
 Planning TIME: 0.530 ms
 Total runtime: 1.256 ms
(7 ROWS)

Now, that's really cool.

Unfortunately you can't use the indexes to search “deep". What does it mean?

Let's assume you have JSON value like:

{"a": [1,2,3,4]}

You can index it, and search for " value ? ‘a' “, or " value @> ‘{“a":[1,2,3,4]}', but you can't search (using index) for “rows, which have 3 in the array that is under key “a" in the json".

Of course – you can work around it by creating index on ((value -> ‘a')) if that's what you really need.

Anyway – I really like it. It seems to work nicely, and I hope to get more features in there in the future.

Thanks guys.

8 thoughts on “Waiting for 9.4 – Introduce jsonb, a structured format for storing json.”

  1. “You can index it, and search for ” value ? ‘a’ “, or ” value @> ‘{“a”:[1,2,3,4]}’, but you can’t search for “rows, which have 3 in the array that is under key “a” in the json”.”

    Actually, yes, you can, and if containment is tested the query may use a non-expression index:

    [local]/postgres=# create table w as select ‘{“a”: [1,2,3,4]}’::jsonb t;
    SELECT 1
    [local]/postgres=# select * from w where t @> ‘{“a”:[3]}’;
    t
    ———————
    {“a”: [1, 2, 3, 4]}
    (1 row)

    What’s really cool about the jsonb_hash_ops opclass is that it allows for really flexible and effective indexing when the jsonb column is directly indexed, without ending up with gigantic index.

  2. @Peter:
    I meant that you can’t search using the index. Corrected. Searching without usage of index is of course possible.

  3. Such queries can use an index:

    [local]/postgres=# \d+ w
    Table “public.w”
    Column | Type | Modifiers | Storage | Stats target | Description
    ——–+——-+———–+———-+————–+————-
    t | jsonb | | extended | |
    Indexes:
    “ff” gin (t jsonb_hash_ops)
    Has OIDs: no

    [local]/postgres=# explain analyze select * from w where t @> ‘{“a”:[1]}’;
    QUERY PLAN
    ———————————————————————————————————–
    Bitmap Heap Scan on w (cost=8.00..12.01 rows=1 width=32) (actual time=0.053..0.056 rows=1 loops=1)
    Recheck Cond: (t @> ‘{“a”: [1]}’::jsonb)
    Heap Blocks: exact=1
    -> Bitmap Index Scan on ff (cost=0.00..8.00 rows=1 width=0) (actual time=0.027..0.027 rows=1 loops=1)
    Index Cond: (t @> ‘{“a”: [1]}’::jsonb)
    Planning time: 0.162 ms
    Total runtime: 0.124 ms
    (7 rows)

  4. I understand your confusion. Perhaps that could stand to be clearer.

    The nesting restriction referred to is really just that you cannot query against a “sub-document” extracted using the -> operator, for example, unless you have an appropriate expression index. This might be a particular problem if you were checking existence in your predicate (the ‘?’ operator), because that simply doesn’t work past the first nesting level (those are the semantics, but it would be hard to make “nested existence” checking work with the current approach to GIN indexing, plus that mostly isn’t too compelling anyway). There may be some minor advantages to testing “existence” rather than “containment”. Perhaps even the term “containment” is misleading, since it’s a kind of nested containment, but that’s a holdover from hstore.

    We want to discourage people from indexing everything, which is more or less what a straight GIN index on a jsonb column represents. People should continue to make informed decisions about what to index in a way consistent with actual querying patterns, since presumably in general it isn’t all that useful to be able to use an index scan to find *any* one thing.

    Having said all that, if you’re using a jsonb_hash_ops GIN index, this approach actually works surprisingly well.

  5. No doubt that JSONB will be useful to web developers, but it seems useless or annoying to everyone else as a regular document store. One of the attractions to PostgreSQL is its comprehensive data type support, as well as its user defined data types. If it requires casting dates and others into strings or integers, then those types have to be cast back for each query that reads them. Besides being very tedious to code around, this negates any benefit from being stored as binary.

    I suppose if one wants a hierarchical data store, the xml type may be preferable as XML can be used in various ways to deal with rich data types. But that is still a pain. I would rather have something like hstore2, but with all data types supported. That way, the I/O through the database driver is just like any column in the database.

Comments are closed.