Text comparisons that does automatic trim()

SoftNum asked on irc:

< SoftNum> does postgresql have a config option to automatically trim (both ' ' from blah) on string compares?

So, can you?

Of course there is no such option, but maybe there is a way to tell PostgreSQL to do this trim for given field? Sure there is 🙂

To do it, we will need another datatype. Don't worry – it is not complicated, and requires only some copy/paste abilities.

So, first let's create simple domain:

CREATE DOMAIN trimmed_text AS TEXT;

Doesn't look scary, does it?

Now. We will need to add some operators. Basically the only ones that are important for me now are “=" and “<>“.

Since usually people use text datatype (or something else that can be casted to text), I will need to create 6 separate operators:

text = trimmed_text
text <> trimmed_text
trimmed_text = text
trimmed_text <> text
trimmed_text = trimmed_text
trimmed_text <> trimmed_text

Luckily, it is very simple and requires only SQL language. First operator:

CREATE FUNCTION trimmed_text_req(TEXT, trimmed_text) RETURNS bool AS $$
    SELECT btrim($1) = btrim($2);
$$ LANGUAGE SQL immutable;
 
CREATE OPERATOR = (
    leftarg = text,
    rightarg = trimmed_text,
    negator = <>,
    PROCEDURE = trimmed_text_req
);

I hope the code is self-explanatory. If not, please feel free to ask questions in comments. Or just bug me on freenode #postgresql.

Using this as a template, I quickly wrote the rest of operators:

CREATE FUNCTION trimmed_text_rne(TEXT, trimmed_text) RETURNS bool AS $$
    SELECT btrim($1) <> btrim($2);
$$ LANGUAGE SQL immutable;
 
CREATE OPERATOR <> (
    leftarg = text,
    rightarg = trimmed_text,
    negator = =,
    PROCEDURE = trimmed_text_rne
);

CREATE FUNCTION trimmed_text_leq(trimmed_text, TEXT) RETURNS bool AS $$
    SELECT btrim($1) = btrim($2);
$$ LANGUAGE SQL immutable;
 
CREATE OPERATOR = (
    leftarg = trimmed_text,
    rightarg = text,
    negator = <>,
    PROCEDURE = trimmed_text_leq
);

CREATE FUNCTION trimmed_text_lne(trimmed_text, TEXT) RETURNS bool AS $$
    SELECT btrim($1) <> btrim($2);
$$ LANGUAGE SQL immutable;
 
CREATE OPERATOR <> (
    leftarg = trimmed_text,
    rightarg = text,
    negator = =,
    PROCEDURE = trimmed_text_lne
);

CREATE FUNCTION trimmed_text_beq(trimmed_text, trimmed_text) RETURNS bool AS $$
    SELECT btrim($1) = btrim($2);
$$ LANGUAGE SQL immutable;
 
CREATE OPERATOR = (
    leftarg = trimmed_text,
    rightarg = trimmed_text,
    negator = <>,
    PROCEDURE = trimmed_text_beq
);

CREATE FUNCTION trimmed_text_bne(trimmed_text, trimmed_text) RETURNS bool AS $$
    SELECT btrim($1) <> btrim($2);
$$ LANGUAGE SQL immutable;
 
CREATE OPERATOR <> (
    leftarg = trimmed_text,
    rightarg = trimmed_text,
    negator = =,
    PROCEDURE = trimmed_text_bne
);

All done.

Now, let's test if it really works. To do it, I will need test table:

CREATE TABLE test (
    id serial PRIMARY KEY,
    val trimmed_text
);

With some test data:

INSERT INTO test (val) VALUES
    ('depesz'), (' depesz'), ('depesz '), (' depesz '), ('NOT depesz');

This is how it looks:

# SELECT id, '[' || val || ']' FROM test;
 id |   ?COLUMN?
----+--------------
  1 | [depesz]
  2 | [ depesz]
  3 | [depesz ]
  4 | [ depesz ]
  5 | [NOT depesz]
(5 ROWS)

I added [ and ] to show the spaces.

So, let's check if simple select will work:

# SELECT id, '[' || val || ']' FROM test WHERE val = 'depesz';
 id |  ?COLUMN?
----+------------
  1 | [depesz]
  2 | [ depesz]
  3 | [depesz ]
  4 | [ depesz ]
(4 ROWS)

YES! Works.

But … what about indexing. Will index on the field work? Let's test.

First, I'll need more data. 10,000 new records should be enough:

INSERT INTO test (val) SELECT i::TEXT FROM generate_series(1,100000) i;

Now, let's create index, and analyze the table:

# CREATE INDEX q ON test (val);
CREATE INDEX
 
# vacuum analyze test;
VACUUM

OK. So, let's check if the index will be used:

# EXPLAIN analyze SELECT * FROM test WHERE val = 'depesz';
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Seq Scan ON test  (cost=0.00..2240.09 ROWS=500 width=9) (actual TIME=0.032..189.363 ROWS=4 loops=1)
   FILTER: (btrim((val)::text) = btrim(('depesz'::text)::text))
 Total runtime: 189.437 ms
(3 ROWS)

Unfortunatelly it doesn't use the index. But, as you can see, PostgreSQL is smart enough to see what we do to modify the field (i.e. call to btrim()).

Knowing this, perhaps another index can help us …

# DROP INDEX q;
DROP INDEX
 
# CREATE INDEX q ON test (btrim(val));
CREATE INDEX
 
# vacuum analyze test;
VACUUM

And, how about index usage now?

# EXPLAIN analyze SELECT * FROM test WHERE val = 'depesz';
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 INDEX Scan USING q ON test  (cost=0.00..8.28 ROWS=1 width=9) (actual TIME=0.114..0.124 ROWS=4 loops=1)
   INDEX Cond: (btrim((val)::text) = btrim(('depesz'::text)::text))
 Total runtime: 0.223 ms
(3 ROWS)

Great. To sum it all up:

query doesn't have to be modified
datatype conversion is mostly painless
indexing works

Did I mention that I love PostgreSQL?

Of course the same method can be applied to create case insensitive text fields, or even field which do some more advanced things – like format normalization.

5 thoughts on “Text comparisons that does automatic trim()”

siooh says:

2008-10-17 at 00:48

Great post. I’ve never thought about OPERATORs in this way that they can solve such ‘problem’. PG rules ;). I love it.
depesz says:

2008-10-17 at 07:06

@siooh:
have you seen this? https://www.depesz.com/2007/11/05/encrypted-passwords-in-database/
Vincenzo Romano says:

2008-10-18 at 12:03

It’s another great post, Depesz!
As usual!
How does your last remark compare to the new CITEXT feature?
depesz says:

2008-10-18 at 13:39

@Vincenzo Romano:
citext will be faster, but creating your own datatype with domains and sql functions don’t require compilation of external modules.
also – you can create your own datatype in 8.3 and previous versions, while citext will be available as contrib from 8.4
Eugene says:

2008-10-21 at 19:21

I am seaching for some idea to write in my blog… somehow come to your blog. best of luck. Eugene

Comments are closed.