What I learned recently about Unicode and Ruby. There is a TL;DR.
At nine.ch, we have a webinterface to administrate our mailboxes. It also can be used to configure (Sieve-)filters for incoming mails. These filters are persisted in a database and uploaded to the storage via the Managesieve protocol.
As we wanted to migrate to a new version of storage software, the uploading of these filters started to fail for some mailboxes.
The error was something like SieveError PUTSCRIPT: Too many arguments
.
Yeah, what does this mean…? The RFC says
that the syntax for PUTSCRIPT
is like the following:
A failing upload looked like this:
So this should work, no? Nope, doesn’t. Hmm, there are umlauts in this vacation message… Trial and error shows, that removing them helps and the script is accepted by the server.
Cool, looks like the server does something strange with the string containing the script and thinks you give him more arguments than allowed.
Now, how shall I fix this? During my research I stumbled over the extension list for Pigeonhole Sieve.
There is an extension called encoded-character
, maybe this helps? Let’s try.
We “just” have to escape these special chars according the RFC.
A first try with the following code gave me the broken characters (ü, ä, …) in the vacation answer, known from UTF-8/ISO problems.
Relevant part from the sieve script:
I shortly googled an utf-8 chartable and checked the content of the script.
Looks like my simple ü
and ä
are two bytes in Unicode? And when we just translate one byte at a time, this gives us
these (hated) character-combinations as each byte is interpreted as a single Unicode character?
Heard about this, but never really thought about it before.
Okay, lets consult the Ruby String documentation and check whether there is a better method to get these characters: String#each_codepoint.
Now it works!
Final version of the method to use each_codepoint
:
Relevant part from sieve script:
TL;DR
Use String#each_codepoint
if you read a string and want to use the hex representation of its characters. Otherwhise you create
an encoding problem without changing the encoding.