One common problem that a lot of people seem to have is when they encounter error message like this:
# \i test.sql psql:test.sql:1: ERROR: invalid byte sequence for encoding "UTF8": 0xb3
Why it happens? What can be done about it? Let's see.
First, we need some theory, so you will understand background.
As you perhaps know, computers generally use bytes, not characters. And each byte is simply integer value in range 0-255.
There are no letters. Just numbers. To be able to understand what letter is, we need some way to convert letters to number, and the other way around.
First (at least first important to us, now) such conversion has been made by ‘American Standards Association' and is generally known as ASCII.
This conversion (code) states that (for example) – letter ‘A' has code 65, letter ‘B' 66, and so on. Lower case letters have values being value of upper case letter + 32.
So, upper case ‘D' is 68, so ‘d' will be 32 + 68 = 100. Which we can easily check:
=$ perl -le 'print ord("d")' 100
There is one big problem with ASCII. It deals only with first 128 values (integer numbers from 0 to 127). Which was fine for all of you living in countries which do not have any accented letters. Or simply – do not have their own alphabet, but use Latin one (a-z).
This made extremely big problem in all of the countries that happen to have accented letters or different alphabets.
Why? Because (at some point in time) there were a lot of different standards of coding these letters to bytes.
In Poland, we had:
And that's not all!
Fortunately, till recently only 2 of them really stayed alive ISO-8859-2 (also known as Latin2), and Windows-1250 (developed, and used basically “only" on Windows).
What does it mean in practice? Well. Let's take the word ‘dźwięcznością' as example.
If we'll encode the word according to ISO8859-2, we will get the data like this:
=$ echo -n 'dźwięcznością' | iconv -f utf-8 -t iso8859-2 | hexdump -C 00000000 64 bc 77 69 ea 63 7a 6e 6f b6 63 69 b1 |d.wi.czno.ci.| 0000000d
If you're not familiar with hexdump program – please notice the line which starts with ‘00000000', and then there are 4 pairs of hex digits. Each pair is single byte.
I use iconv, as my terminal works in UTF, so I need to convert it to other encodings when I want to demonstrate something.
Now, the same word in Windows-1250 look like this:
=$ echo -n 'dźwięcznością' | iconv -f utf-8 -t windows-1250 | hexdump -C 00000000 64 9f 77 69 ea 63 7a 6e 6f 9c 63 69 b9 |d.wi.czno.ci.| 0000000d
Please notice that ‘ź', ‘ś' and ‘ą' have different codes in both encodings, while ‘ę' has the same value.
This leads to very important problem: Given bytes:
“64 9f 77 69 ea 63 7a 6e 6f 9c 63 69 b9"
computer cannot know what I mean. It can be either:
=$ perl -e 'print "\x64\x9f\x77\x69\xea\x63\x7a\x6e\x\x6f\x9c\x63\x69\xb9"' | iconv -f iso8859-2 -t utf8 dwięcznociš
=$ perl -e 'print "\x64\x9f\x77\x69\xea\x63\x7a\x6e\x\x6f\x9c\x63\x69\xb9"' | iconv -f windows-1250 -t utf8 dźwięcznością
Or basically anything else – it's all just a matter of encoding.
And why would computer need to know which byte(s) is which letter?
Reason is pretty simple – if program doesn't know that given byte is (for example) letter ‘ą' – it can't make upper/lower case of it. Which means – it can't do anything like case insensitive comparison.
It also doesn't know whether it is letter, digit, or some other class of characters. Or perhaps it's not even a character, but broken part of multi-byte character?
Some databases, are very relaxed about it. They accept whatever you will pass to them, and then try to do something about it.
For example, let's assume we have database, which got configured that it's working as ISO8859-2. But we did put there Windows-1250 text. What will happen if we'll try to uppercase work ‘dźwięcznością'? We would want it to be ‘DŹWIĘCZNOŚCIĄ'. But we'll get:
Which is pretty bad.
Now, let's get back to PostgreSQL.
PostgreSQL required every database, to have specified encoding. This is to know which sequence of bytes, is what.
And PostgreSQL is actually very strict about it. If you say that your database is UTF8 (which you should, as it is the best way to handle encoding), and you will feed it something that is not UTF8 – it will complain.
Wait a minute. Couple of paragraphs above, I said that computer cannot know whether given stream of bytes is Windows-1250 or iso8859-2, or anything else. So, how can it know that it's not UTF8?
Well, it's pretty simple. UTF-8 has some internal structure (I will not go into details, if you need them – there are much better sources than my blog to get more info about it), which lets us know whether given stream of bytes is correct UTF-8 or not.
Correct doesn't mean that it makes any sense. For example the word ‘DWIĘCZNOCIŠ' it technically correct, but it doesn't make sense (at least in Polish).
So, what should you do, if you're trying to load some data to UTF-8 PostgreSQL database, and it complains? Well, tell PostgreSQL what encoding it is.
I have file, which looks like this (I will show it via hexdump, as it contains non-utf characters):
=$ cat test2.sql | hexdump -C 00000000 49 4e 53 45 52 54 20 49 4e 54 4f 20 74 65 73 74 |INSERT INTO test| 00000010 20 28 73 6f 6d 65 5f 74 65 78 74 29 20 56 41 4c | (some_text) VAL| 00000020 55 45 53 20 28 27 52 6f 7a 6d 61 77 69 61 b3 61 |UES ('Rozmawia.a| 00000030 20 67 ea b6 20 7a 20 70 72 6f 73 69 ea 63 69 65 | g.. z prosi.cie| 00000040 6d 27 29 3b 0a |m');.|
When I'll try to load this file to UTF-8 database, I will get:
# \i test2.sql psql:test2.sql:1: ERROR: invalid byte sequence for encoding "UTF8": 0xb3
Now. I happen to know that this particular file is in ISO-8859-2. So what can/should I do? It's pretty simple:
# set client_encoding = 'iso8859-2'; SET # \i test2.sql INSERT 0 1
Great. It loaded. Now, let's see how it looks:
# SELECT * FROM test; some_text ----------------------------- (1 ROW)
That's weird. Or is it? Well, the problem now is that my psql session has still “client_encoding" being iso8859-2, but my terminal works in utf8. And psql simply doesn't show the data (most likely to prevent problems).
So, let's set client_encoding back to real value:
# SET client_encoding = 'utf-8'; SET # SELECT * FROM test; some_text ----------------------------- Rozmawiała gęś z prosięciem
This means that also conversions should work:
# SELECT UPPER(some_text), LOWER(some_text) FROM test; UPPER | LOWER -----------------------------+----------------------------- ROZMAWIAŁA GĘŚ Z PROSIĘCIEM | rozmawiała gęś z prosięciem (1 ROW)
As you can see – all you need to do, is to let PostgreSQL know what encoding the data really are. And it (PostgreSQL) will convert it to server encoding transparently.
The problem might be when you don't know what encoding it is in.
In such case, you can use iconv and some common sense to find out.
For example – assuming that you're using utf-8 console, like you should :), you can think of several possible encodings for language you're trying to import. For Polish, I would most likely check iso8859-2, windows-1250, utf-8 and utf-16.
For all of them you can do something like this:
=$ FOR a IN windows-1250 iso8859-2 utf8 utf16 -$ do -$ echo $a -$ cat test2 | iconv -f $a -t utf8 -$ done windows-1250 INSERT INTO test (some_text) VALUES ('Rozmawiała gę¶ z prosięciem'); iso8859-2 INSERT INTO test (some_text) VALUES ('Rozmawiała gęś z prosięciem'); utf8 INSERT INTO test (some_text) VALUES ('Rozmawiaiconv: illegal input sequence at position 46 utf16 义䕓呒䤠呎⁏整瑳⠠潳敭瑟硥⥴嘠䱁䕕⁓✨潒浺睡慩憳朠뛪稠瀠潲楳揪敩❭㬩iconv: incomplete character or shift sequence at end of buffer
And from here, I can see that it's most likely iso8859-2.
Of course – all above is valid not only for Polish. Just recently I have been given task of loading the data to database, when the dump contained some Spanish words with accented letters, and I had exactly the same problem. With this is was even worse, as some specific byte was decoded to very similar (optically), but different letters in windows-1252, and iso-8859-1. Luckily we were able to find someone who could read Spanish, and he told us which letter is correct – so we were able to find out what is the correct encoding.
Now, for the last part – several times during this blogpost I mentioned that You should be using UTF8. There is a really good reason for it.
Encodings (like windows-1250 and iso8859-2) differ not only in what numbers are assigned to which characters, but also – which characters are at all possible to write in it.
Character: ± exists in windows-1250, and has code 177. But it doesn't exist at all in iso8859-2!
For virtually any pair of encodings that could be used for given language, there will be practically always some characters that exist only in one of them, and some others that exist only in the second encoding.
This means, for example – that if you create your database in ISO-8859-2 (LATIN2 in PostgreSQL), and somebody will load Windows-1250 text, which will contain the ± (plus/minus) character – data will not load (if he/she set correctly client_encoding), or will load, but the data will get damaged – i.e. PostgreSQL will think that you mean iso's character #177 (which is polish ą letter). So, if somebody meant to say “Value is 100 ± 3", you will get “Value is 100 ą 3".
The good thing about UTF-8, is that it's representation of Unicode (I will skip details, not important at the moment), which is superset of all other charactersets. That is: every character from every other characterset/encoding (there is difference between those terms, but it's not crucial at the moment), it can be represented using UTF-8.
And even if you application “speak" only using encoding “xxx" – make the database UTF8, and just make the application set correct client_encoding.
Uff. I'm not sure it the above information is understandable, or not. I hope that it will help whoever will stumble on the “Invalid byte sequence for encoding" error. In case you wouldn't understand something – please let me know in comments, so I can fix or add some more details.