PostgreSQL + Perl + Unicode == confusion. Why?

Yesterday I had an interesting discussion on irc.

A guy wanted to know why Perl script is causing problems when dealing with Pg
and unicode characters.

The discussion went sideways, I got (a bit) upset, and had to leave anyway, so
I didn't finish it. But it did bother me, as for me the reasons of the problem
seem obvious, yet the person I talked with was very adamant that I have the
whole thing wrong.

So, I figured I'll use my blog to elaborate a bit…

Let's start with the original problem (edited a bit to work on my machine,
remove identifiable names, and make it shorter):

#!/usr/bin/perl
 
use strict;
use warnings;
 
use DBI;
use Test::More;
 
my $dbh = DBI->connect(
    "dbi:Pg:dbname=depesz;host=localhost;port=5930",
    'depesz',
    undef,
    {
        pg_enable_utf8 => 1,
        pg_server_prepare => 0,
    }
);
 
my $sth = $dbh->prepare( "SELECT length('\x{e2}\x{98}\x{83}') AS snowman" );
$sth->execute;
is( $sth->fetchrow_hashref->{ 'snowman' }, 1 );
 
$sth = $dbh->prepare( "SELECT length('\x{e9}') AS snowman" );
$sth->execute;
is( $sth->fetchrow_hashref->{ 'snowman' }, 1 );
 
done_testing;

What happens when you run it?

ok 1
DBD::Pg::st EXECUTE failed: ERROR:  invalid byte SEQUENCE FOR encoding "UTF8": 0xe9 0x27 0x29 at z.pl line 23.
DBD::Pg::st fetchrow_hashref failed: no statement executing at z.pl line 24.
Can't use an undefined value as a HASH reference at z.pl line 24.
# Tests were run but no plan was declared and done_testing() was not seen.

First test succeeds (i.e. length of the thing showed 1), but second fails with invalid byte sequence error.

Let's see how it was logged in Pg logs:

2014-10-19 14:32:49.380 CEST depesz@depesz 14032 ::1(60683) LOG:  connection authorized: USER=depesz DATABASE=depesz
2014-10-19 14:32:49.381 CEST depesz@depesz 14032 ::1(60683) LOG:  duration: 0.270 ms  statement: SELECT LENGTH('☃') AS snowman
2014-10-19 14:32:49.382 CEST depesz@depesz 14032 ::1(60683) ERROR:  invalid byte SEQUENCE FOR encoding "UTF8": 0xe9 0x27 0x29
2014-10-19 14:32:49.382 CEST depesz@depesz 14032 ::1(60683) LOG:  disconnection: SESSION TIME: 0:00:00.002 USER=depesz DATABASE=depesz host=::1 port=60683

So, what has happened?

Well, there are actually many things wrong in here. What doesn't work (2nd test) doesn't work because of faulty assumption, and what works – works just “accidentally".

Based on the irc discussion, author is/was under impression that \x{NUM} generates character with unicode codepoint NUM. But this is clearly not true.

In perldoc perluniintro one can find:

Note that “\x.." (no “{}" and only two hexadecimal digits), “\x{…}", and “chr(…)" for arguments less than 0x100 (decimal 256) generate an eight-bit character for backward compatibility with older Perls. For arguments of 0x100 or more, Unicode characters are always produced. If you want to force the production of Unicode characters regardless of the numeric value, use “pack(“U", …)" instead of “\x..", “\x{…}", or “chr()".

This is interesting. “\x{2603}" will give you back snowman character. But “\x{e9}" gives you basically byte (I know it's not really that simple, but I'm simplifying it a bit to skip over uninteresting things) with value of 233 (hex(“e9")).

This can be verified by checking variable internals:

=$ perl -le 'use Devel::Peek; Dump( "\x{e9}" )'
SV = PV(0x1c04cb0) at 0x1c2d098
  REFCNT = 1
  FLAGS = (PADTMP,POK,READONLY,pPOK)
  PV = 0x1c32000 "\351"\0
  CUR = 1
  LEN = 16
 
=$ perl -le 'use Devel::Peek; Dump( "\x{2603}" )'
SV = PV(0x1b94cb0) at 0x1bbd098
  REFCNT = 1
  FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
  PV = 0x1bc2000 "\342\230\203"\0 [UTF8 "\x{2603}"]
  CUR = 3
  LEN = 16
 
=$ perl -le 'use Devel::Peek; Dump( pack("U", hex("e9") ) )'
SV = PV(0x23f5cb0) at 0x241e0c0
  REFCNT = 1
  FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
  PV = 0x23fa270 "\303\251"\0 [UTF8 "\x{e9}"]
  CUR = 2
  LEN = 16

Please note the PV values.

In case of literal \x{e9} we get “PV = 0x1c32000 “\351″\0". But when we provide the character properly as either “high" codepoint (snowman) or properly pack'ed single, low codepoint unicode character the PV is very different, and contains UTF8 representation of the string.

So, in the original problem what happened was that the query:

$sth = $dbh->prepare( "SELECT length('\x{e9}') AS snowman" );

Sent to database query with single byte “e9" inside apostrophes, and since Pg was expecting UTF8 it got disoriented, as it's not valid sequence (“e9", apostrophe, closing-paren).

The more interesting is not-so-subtle bug in the first example:

my $sth = $dbh->prepare( "SELECT length('\x{e2}\x{98}\x{83}') AS snowman" );

It worked – Pg returned value 1 – i.e. it understood that it is single character, and not 3. But, let's see something weird:

=$ perl -le 'print length("\x{e2}\x{98}\x{83}")'
3

Perl doesn't seem to think that it's one character – instead it thinks this string has 3 characters (or bytes).

Why? And why Pg got the length correctly?

Well – \x{..} with 2 hex digits creates bytes. So the string that was created contained 3 bytes, which just so happens are what is the UTF-8 representation of snowman. When the value was put in the query, Pg got it, and since it was expecting utf8 – it understood it as utf8 character. But Perl – perl doesn't know what encoding it's in, so it treats it as bytes.

To get the proper value, we'd have to get unicode codepoint (0x2603) and use it:

=$ perl -le 'print length("\x{2603}")'
1

Of course, using \x{2603} will also greatly work with Pg:

=$ cat z.pl 
#!/usr/bin/perl
 
use strict;
use warnings;
 
use DBI;
use Test::More;
 
my $dbh = DBI->connect(
    "dbi:Pg:dbname=depesz;host=localhost;port=5930",
    'depesz',
    undef,
    {
        pg_enable_utf8 => 1,
        pg_server_prepare => 0,
    }
);
 
my $sth = $dbh->prepare( "SELECT length('\x{e2}\x{98}\x{83}') AS snowman" );
$sth->execute;
is( $sth->fetchrow_hashref->{ 'snowman' }, 1 );
 
$sth = $dbh->prepare( "SELECT length('\x{2603}') AS snowman" );
$sth->execute;
is( $sth->fetchrow_hashref->{ 'snowman' }, 1 );
 
# $sth = $dbh->prepare( "SELECT length('\x{e9}') AS snowman" );
# $sth->execute;
# is( $sth->fetchrow_hashref->{ 'snowman' }, 1 );
 
done_testing;
 
=$ perl z.pl 
ok 1
ok 2
1..2

Interestingly, it leads to interesting error if you'd like to print debug messages to terminal:

=$ perl -le 'print "\x{e2}\x{98}\x{83}"'
☃
 
=$ perl -le 'print "\x{2603}"'
Wide character in print at -e line 1.
☃

The first, bad, example – works great. The correct one doesn't. Why?

Well, first one works just for the same reason Pg example worked – console is expecting utf8, so it gets 3 bytes so it shows it as single unicode character. Perl doesn't care as for it, the string just has 3 bytes so it doesn't have (and even can't) to do any kind of recoding.

But then the string contains proper unicode character – \x{2603} situation is problematic. Perl doesn't know what encoding terminal is using. It guesses to use utf8, but it's not sure. So it shows a warning. How to get rid of it? Make sure Perl knows your terminal is expecting UTF8. Like this:

=$ perl -le 'binmode STDOUT, ":utf8"; print "\x{2603}"'

Now – Perl knows, for a fact, that it should encode the string to utf8 (and not, for example, to latin1, or GB18030), so when it encounters unicode characters – it now knows what to do with them, without any ambiguity, so it works.

Side note – if you think: “well, I don't care about length, I just care that it shows correctly, so I can use \x{..}\x{..} notation, and be happy with it, consider this example:

=$ perl -le '$x="\x{c4}\x{85}"; print $x; print uc($x)'
ą
ą
 
=$ perl -le 'binmode STDOUT, ":utf8"; $x="\x{105}"; print $x; print uc($x)'
ą
Ą

As you can see – uc/lc will not really work if your string is made of bytes that are utf8. It has to understand that these are characters. Of course, for some characters (basically from the latin1 subset) it will work with “bytes", but don't count on your strings to always contain just latin1 characters.

It's complicated. For a long time I thought I understood how encodings work. Then I found some corner cases, and lost my understanding. After some more time I got faith that one day I will understand it. At the moment, I think I have some grasp on what's it all about. The single most helpful thing was reading this stackoverflow answer and doing some experiments that showed exactly the problem tchrist was describing.

Now, all my non-trivial perl scripts contain this boilerplate:

#!/usr/bin/env perl
use strict;
use warnings;
use warnings qw( FATAL utf8 );
use utf8;
use open qw( :std :utf8 );
use Unicode::Normalize qw( NFC );
use Unicode::Collate;
use Encode qw( decode );
if ( grep /\P{ASCII}/ => @ARGV ) {
    @ARGV = map { decode( 'UTF-8', $_ ) } @ARGV;
}

In some cases it's overkill. But at least – I can say that I'm generally well protected in most common and semi-common cases.

6 thoughts on “PostgreSQL + Perl + Unicode == confusion. Why?”

  1. Three years ago I spent with this issue few days, than I found a solution and knowledge so pg_enable_utf8 just don’t work. But good news, a new version of DBI, where pg_enable_utf8 is deprecated should to handle UTF8 well (but I didn’t test it).

  2. My solution for older driver is based on function utf8::upgrade, that enforces utf bit for string. Just take a utf8 string from database and enforce utf8 bit.

  3. Best remark:
    “For a long time I thought I understood how encodings work. Then I found some corner cases, and lost my understanding. ”
    Thats how I feel all the time 🙂

Comments are closed.