March 7th, 2010 by depesz | Tags: , , , , , | 19 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

One common problem that a lot of people seem to have is when they encounter error message like this:

# \i test.sql
psql:test.sql:1: ERROR: invalid byte sequence for encoding "UTF8": 0xb3

Why it happens? What can be done about it? Let's see.

First, we need some theory, so you will understand background.

As you perhaps know, computers generally use bytes, not characters. And each byte is simply integer value in range 0-255.

There are no letters. Just numbers. To be able to understand what letter is, we need some way to convert letters to number, and the other way around.

First (at least first important to us, now) such conversion has been made by ‘American Standards Association' and is generally known as ASCII.

This conversion (code) states that (for example) – letter ‘A' has code 65, letter ‘B' 66, and so on. Lower case letters have values being value of upper case letter + 32.

So, upper case ‘D' is 68, so ‘d' will be 32 + 68 = 100. Which we can easily check:

=$ perl -le 'print ord("d")'
100

There is one big problem with ASCII. It deals only with first 128 values (integer numbers from 0 to 127). Which was fine for all of you living in countries which do not have any accented letters. Or simply – do not have their own alphabet, but use Latin one (a-z).

This made extremely big problem in all of the countries that happen to have accented letters or different alphabets.

Why? Because (at some point in time) there were a lot of different standards of coding these letters to bytes.

In Poland, we had:

  • ISO-8859-2
  • Windows-1250
  • Mazovia
  • AmigaPL
  • TAG
  • CSK

And that's not all!

Fortunately, till recently only 2 of them really stayed alive ISO-8859-2 (also known as Latin2), and Windows-1250 (developed, and used basically “only" on Windows).

What does it mean in practice? Well. Let's take the word ‘dźwięcznością' as example.

If we'll encode the word according to ISO8859-2, we will get the data like this:

=$ echo -n 'dźwięcznością' | iconv -f utf-8 -t iso8859-2 | hexdump -C
00000000 64 bc 77 69 ea 63 7a 6e 6f b6 63 69 b1 |d.wi.czno.ci.|
0000000d

If you're not familiar with hexdump program – please notice the line which starts with '00000000′, and then there are 4 pairs of hex digits. Each pair is single byte.

I use iconv, as my terminal works in UTF, so I need to convert it to other encodings when I want to demonstrate something.

Now, the same word in Windows-1250 look like this:

=$ echo -n 'dźwięcznością' | iconv -f utf-8 -t windows-1250 | hexdump -C
00000000 64 9f 77 69 ea 63 7a 6e 6f 9c 63 69 b9 |d.wi.czno.ci.|
0000000d

Please notice that ‘ź', ‘ś' and ‘ą' have different codes in both encodings, while ‘ę' has the same value.

This leads to very important problem: Given bytes:

“64 9f 77 69 ea 63 7a 6e 6f 9c 63 69 b9″

computer cannot know what I mean. It can be either:

=$ perl -e 'print "\x64\x9f\x77\x69\xea\x63\x7a\x6e\x\x6f\x9c\x63\x69\xb9"' | iconv -f iso8859-2 -t utf8
dwięcznociš

or:

=$ perl -e 'print "\x64\x9f\x77\x69\xea\x63\x7a\x6e\x\x6f\x9c\x63\x69\xb9"' | iconv -f windows-1250 -t utf8
dźwięcznością

Or basically anything else – it's all just a matter of encoding.

And why would computer need to know which byte(s) is which letter?

Reason is pretty simple – if program doesn't know that given byte is (for example) letter ‘ą' – it can't make upper/lower case of it. Which means – it can't do anything like case insensitive comparison.

It also doesn't know whether it is letter, digit, or some other class of characters. Or perhaps it's not even a character, but broken part of multi-byte character?

Some databases, are very relaxed about it. They accept whatever you will pass to them, and then try to do something about it.

For example, let's assume we have database, which got configured that it's working as ISO8859-2. But we did put there Windows-1250 text. What will happen if we'll try to uppercase work ‘dźwięcznością'? We would want it to be ‘DŹWIĘCZNOŚCIĄ'. But we'll get:

DWIĘCZNOCIŠ.

Which is pretty bad.

Now, let's get back to PostgreSQL.

PostgreSQL required every database, to have specified encoding. This is to know which sequence of bytes, is what.

And PostgreSQL is actually very strict about it. If you say that your database is UTF8 (which you should, as it is the best way to handle encoding), and you will feed it something that is not UTF8 – it will complain.

Wait a minute. Couple of paragraphs above, I said that computer cannot know whether given stream of bytes is Windows-1250 or iso8859-2, or anything else. So, how can it know that it's not UTF8?

Well, it's pretty simple. UTF-8 has some internal structure (I will not go into details, if you need them – there are much better sources than my blog to get more info about it), which lets us know whether given stream of bytes is correct UTF-8 or not.

Correct doesn't mean that it makes any sense. For example the word ‘DWIĘCZNOCIŠ' it technically correct, but it doesn't make sense (at least in Polish).

So, what should you do, if you're trying to load some data to UTF-8 PostgreSQL database, and it complains? Well, tell PostgreSQL what encoding it is.

For example.

I have file, which looks like this (I will show it via hexdump, as it contains non-utf characters):

=$ cat test2.sql | hexdump -C
00000000 49 4e 53 45 52 54 20 49 4e 54 4f 20 74 65 73 74 |INSERT INTO test|
00000010 20 28 73 6f 6d 65 5f 74 65 78 74 29 20 56 41 4c | (some_text) VAL|
00000020 55 45 53 20 28 27 52 6f 7a 6d 61 77 69 61 b3 61 |UES ('Rozmawia.a|
00000030 20 67 ea b6 20 7a 20 70 72 6f 73 69 ea 63 69 65 | g.. z prosi.cie|
00000040 6d 27 29 3b 0a |m');.|

When I'll try to load this file to UTF-8 database, I will get:

# \i test2.sql
psql:test2.sql:1: ERROR: invalid byte sequence for encoding "UTF8": 0xb3

Now. I happen to know that this particular file is in ISO-8859-2. So what can/should I do? It's pretty simple:

# set client_encoding = 'iso8859-2';
SET
 
# \i test2.sql
INSERT 0 1

Great. It loaded. Now, let's see how it looks:

# select * from test;
some_text
-----------------------------
 
(1 row)

That's weird. Or is it? Well, the problem now is that my psql session has still “client_encoding" being iso8859-2, but my terminal works in utf8. And psql simply doesn't show the data (most likely to prevent problems).

So, let's set client_encoding back to real value:

# set client_encoding = 'utf-8';
SET
 
# select * from test;
some_text
-----------------------------
Rozmawiała gęś z prosięciem

This means that also conversions should work:

# select upper(some_text), lower(some_text) from test;
upper | lower
-----------------------------+-----------------------------
ROZMAWIAŁA GĘŚ Z PROSIĘCIEM | rozmawiała gęś z prosięciem
(1 row)

Sweet.

As you can see – all you need to do, is to let PostgreSQL know what encoding the data really are. And it (PostgreSQL) will convert it to server encoding transparently.

The problem might be when you don't know what encoding it is in.

In such case, you can use iconv and some common sense to find out.

For example – assuming that you're using utf-8 console, like you should :), you can think of several possible encodings for language you're trying to import. For Polish, I would most likely check iso8859-2, windows-1250, utf-8 and utf-16.

For all of them you can do something like this:

=$ for a in windows-1250 iso8859-2 utf8 utf16
-$ do
-$ echo $a
-$ cat test2 | iconv -f $a -t utf8
-$ done
windows-1250
INSERT INTO test (some_text) VALUES ('Rozmawiała gę¶ z prosięciem');
iso8859-2
INSERT INTO test (some_text) VALUES ('Rozmawiała gęś z prosięciem');
utf8
INSERT INTO test (some_text) VALUES ('Rozmawiaiconv: illegal input sequence at position 46
utf16
义䕓呒䤠呎⁏整瑳⠠潳敭瑟硥⥴嘠䱁䕕⁓✨潒浺睡慩憳朠뛪稠瀠潲楳揪敩❭㬩iconv: incomplete character or shift sequence at end of buffer

And from here, I can see that it's most likely iso8859-2.

Of course – all above is valid not only for Polish. Just recently I have been given task of loading the data to database, when the dump contained some Spanish words with accented letters, and I had exactly the same problem. With this is was even worse, as some specific byte was decoded to very similar (optically), but different letters in windows-1252, and iso-8859-1. Luckily we were able to find someone who could read Spanish, and he told us which letter is correct – so we were able to find out what is the correct encoding.

Now, for the last part – several times during this blogpost I mentioned that You should be using UTF8. There is a really good reason for it.

Encodings (like windows-1250 and iso8859-2) differ not only in what numbers are assigned to which characters, but also – which characters are at all possible to write in it.

For example:

Character: ± exists in windows-1250, and has code 177. But it doesn't exist at all in iso8859-2!

For virtually any pair of encodings that could be used for given language, there will be practically always some characters that exist only in one of them, and some others that exist only in the second encoding.

This means, for example – that if you create your database in ISO-8859-2 (LATIN2 in PostgreSQL), and somebody will load Windows-1250 text, which will contain the ± (plus/minus) character – data will not load (if he/she set correctly client_encoding), or will load, but the data will get damaged – i.e. PostgreSQL will think that you mean iso's character #177 (which is polish ą letter). So, if somebody meant to say “Value is 100 ± 3″, you will get “Value is 100 ą 3″.

The good thing about UTF-8, is that it's representation of Unicode (I will skip details, not important at the moment), which is superset of all other charactersets. That is: every character from every other characterset/encoding (there is difference between those terms, but it's not crucial at the moment), it can be represented using UTF-8.

And even if you application “speak" only using encoding “xxx" – make the database UTF8, and just make the application set correct client_encoding.

Uff. I'm not sure it the above information is understandable, or not. I hope that it will help whoever will stumble on the “Invalid byte sequence for encoding" error. In case you wouldn't understand something – please let me know in comments, so I can fix or add some more details.

  1. 19 comments

  2. Mar 7, 2010

    Some time ago I had a problem with the dump and restore. I don’t remember details. After dump almost all data was encoded propertly – UTF8, but the other not. After dump I got an UTF8 encoded file with same characters in Latin2 or Windows-1250. The restore procedure fails. I had to corerct the dump file manualy.

  3. Mar 8, 2010

    > The problem might be when you don’t know what encoding it is in.

    you can use enca (http://gitorious.org/enca) or chardet (http://chardet.feedparser.org/)

    on Debian: enca command/package or chardet command from python-chardet package

  4. # alec
    Mar 8, 2010

    Depesz, I like new look of your blog, but css for code listings is bad. Please, work on better contrast in them.

  5. Mar 8, 2010

    @ALEC:
    Any suggestions? My own aesthetic sense doesn’t work at all.

  6. # alec
    Mar 8, 2010

    If you like console style, we need at least darker background and brighter font, so:
    code { font-family: Courier New, Courier, monospace; background-color: #4b4a4a; }

  7. Mar 8, 2010

    @ALEC:
    lepiej?

  8. # alec
    Mar 8, 2010

    Tak, lepiej, ale Courier New jest lepiej widoczne w tym wypadku, dlatego odwróciłem kolejność, w font-family.

  9. Mar 8, 2010

    @ALEC:
    nie widzę różnicy, ale proszę Cię bardzo. Courier New pierwszy.

  10. # alec
    Mar 8, 2010

    Dzięki, dużo lepiej.

  11. # Nicklas Avén
    Jan 23, 2011

    Thanks a lot

    I think this is complicated things even if I have been struggling and solved it before. But this blog post cleared a lot of things so maybe I will not have to scratch my head in hours next time :-)
    The hard thing is to understand when you tell the system (OS or database) what you have, and when you tell what you want to get.

    Thanks
    Nicklas

  12. Jan 26, 2011

    great informations for me – many thanks

  13. # Jack Douglas
    Aug 19, 2011

    Thanks for the useful post. If I somehow have a table with a text column containing invalid utf8 byte sequences, how can I find which rows are the problem ones (my curiosity is inspired by this dba.se question: http://dba.stackexchange.com/questions/4777/how-to-solve-utf8-invalid-byte-sequence-copy-errors-on-a-restore-when-the-source/4790#4790)

  14. Aug 19, 2011

    @Jack Douglas:
    never seen *table* contain invaliud byte sequences, because pg validates input.

    You might have copy data that does contain bad bytes.

    In such case you can simply:

    perl -ne ‘print if /\xb3/’ input_file

    to see the bad line.

  15. # Sim
    Sep 5, 2011

    I’m using an 8.2 database with UTF-8 encoding and I did a dump and restore and it gave invalid encoding errors on 2 tables. I don’t know how the data got in, but both the dump and restore were both UTF-8 encoded. Could be a bug in that version. I tried restoring to both another 8.2 and a 9.0 and they both gave the invalid encoding error.

  16. Jan 26, 2012

    Hi,
    I am newbie in PostgreSQL, but from my personal experience from Oracle I know that people very often confuse “database encoding” and “client encoding” :-(. It seems, that it also happen in PostgreSQL world…

    Regards, Piotr

  17. # Piotr
    May 8, 2012

    Dzięki za przydatną pigułkę. Walczyłem z kodowaniem w Delphi+Zeos+Postgresql i bardzo mi pomogłeś w ogarnięciu i rozwiązaniu problemu.

  18. # Claudia
    Sep 11, 2013

    hi ,I have a problem I have a web page with encoding Utf8, this page use a database with the same encoding , but when I see this page , I don’t see the double quote (“) , but when I see my field in the database using command line, I see \u0093 instead of double quotes,I have tried to change with the command replace of postgres , but doesn’t work,
    somebody help me please

  19. Sep 11, 2013

    @Claudia:
    Sorry, but based on your description i have no idea what the problem is. At the very least I would need to know how you’re inserting to database, and how you’re selecting (using what language/tool, exact commands, environment variables).

    Also – please note – that my blog is not really a support channel for PostgreSQL. I do help occasionally, but generally you will get better, and faster, help from some of Pg mailing list – like pgsql-general.

  20. # Claudia
    Sep 11, 2013

    Thank you , finally I resolved my problem, my sintaxis of command replace of postgres wasn’t well, i had to scape with the letter E and the character \
    example:
    update Tabla set campo1 = replace(campo1,E’\u0093′,’”‘);

Leave a comment