Encodings. They may seem to some as horribly boring and bland. Nobody really wants to have to worry about the details. Encodings are how we communicate, transmit data. Their purpose is to be understood by others. When they work, nobody should even know they are there. When they don’t — and everyone can probably remember a time when they have tried to open a poorly encoded or corrupted file and been disappointed — they cannot be ignored.
Here I talk about the work I have done in the past and work still in progress to improve the support of current encodings, add new encodings, and add new features and new options to allow Perl 6 and MoarVM to support more encodings and in a way which better achieves the goals encodings were designed to solve.
Which Encodings Should We Add?
I started looking at the most common encodings on the Web. We supported the two most common, UTF-8 and ISO-8859-1 but we did not support windows-1251 (Cyrillic) or ShiftJIS. This seemed like the makings of a new project so I got to work.
I decided to start with windows-1251 since it was an 8 bit encoding while Shift-JIS was a variable length one or two byte encoding (the shift in the name comes from the second byte 'shift'). While the encoding itself was simpler than Shift-JIS, I was soon realizing some issues with both windows-1251 and our already supported windows-1252 encoding.
One of the first major issues I found in windows-1252, you could create files which could not be decoded by other programs. An example of this is codepoint 0x81 (129) which does not have a mapping in the windows-1252 specification. This would cause many programs to not detect the file as windows-1252, or to fail saying the file was corrupted and could not be decoded.
While our non-8bit encodings would throw an error if asked to encode text which did not exist in that codemapping, our 8 bit encodings would happily pass through invalid codepoints, as long as they fit in 8 bits.
As I said at the start, encodings are a way for us to communicate, to exchange data and to be understood. This to me indicated a major failing. While the solution could be to never attempt to write codepoints that don’t exist in the target encoding, that was not an elegant solution and would just be a workaround.
To remedy this I had to add new ops, such as decodeconf to replace decode op, and encodeconf to replace encode (plus many others). This would allow us to specify a configuration for our encodings and allow Perl 6 to tell MoarVM if it wanted to encode strictly according to the standard or to be more lenient.
I added a new
:strict option to
encode and a few others to allow you
to specify if it should be encoding strict or not strict. In the needs of
backwards compatibility it still defaults to leniently encoding and decoding.
Strict is planned to become the default for Perl 6 specification 6.e.
Some of our other encodings had support for 'replacements'. If you tried to encode something that would not fit in the target encoding, this allowed you to supply a string that could be one or more characters which would but substituted instead of causing MoarVM to throw an exception.
Once I had the strictness nailed down I was able to add support for replacements so one could have the ability to write data in a strict mode while still ensuring all compatible characters would get written properly, and you did not have to choose between unstrict and writing incompatible files and strict and having the encoding or file write failing when you really needed it not to.
Encodings are not without difficulty. As with the previous encodings I talked about, Shift-JIS would not be without decisions that had to be made. With Shift-JIS, the question became, "which Shift-JIS?".
You see, there are dozens of different extensions to Shift-JIS, in addition to original Shift-JIS encoding. As a variable length encoding, most of the one byte codepoints had been taken while there were many hundreds of open and unallocated codepoints in the two byte range. This saw the wide proliferation of manufacturer specific extensions which other manufacturers may adopt while adding their own extensions to the format.
Which of the dozens of different Shift-JIS encodings should we support? I eventually decided on windows-932 because that is the standard that is used by browsers when they encounter Shift-JIS text and encompasses a great many symbols. It is the most widely used Shift-JIS format. This would allow us to support the majority of Shift-JIS encoded documents out there, and the one which web browsers use on the web. Most but not all the characters in it are compatible with the other extensions, so it seemed like it was the sensible choice.
The only notable exception to this is that windows-932 is not totally compatible with the original Shift-JIS encoding. While the original windows-932 encoding (and some extensions to it) map the ASCII’s backslash to the yen symbol ¥, windows-932 keeps it to the backslash. This was to retain better compatibility with ASCII and other formats.
While MoarVM had a UTF-16 encoder and decoder, it was not fully featured.
You could encode a string into a utf16 buffer:
"hello".encode('utf16') #> utf16:0x<68 65 6c 6c 6f>
You could decode a utf16 buffer into a string:
utf16.new(0x68, 0x65, 0x6C, 0x6C, 0x6F).decode('utf16') #> hello
That was about it. You couldn’t open a filehandle as UTF-16
and write to it. You couldn’t even do
as the write function did not know how to deal with writing a 16 bit buffer
(it expected an 8-bit buffer).
In addition there was no streaming UTF-16 decoder, so there was no way to read a
UTF-16 file. So I set out to work, first allowing us to write 16-bit buffers to
a file and then being able to
At this point I knew that I would have to confront the
BOM in the room, but decided
to first implement the streaming decoder and ensure all of our file handle operations
The BOM in the Room
You may be noticing a theme here. You do not just implement an encoding. It does not exist in a vacuum. While there may be standards, there may be multiple standards. And those standards may or may not be followed in practice or it may not be totally straightforward.
Endianess defines which order the bytes are in a number. Big endian machines will store a 16-bit number with the larger section first, similar to how we write numbers, while little endian machines will put the smallest section first. This only matters for encoding numbers that are more than one byte. UTF-16 can be either big endian or little endian. Meaning big and little endian files will contain the same bytes, but the order is different.
Since swapping the two bytes of a 16-bit integer causes a different integer instead of an invalid one, the creators of UTF-16 knew it was not possible to determine with certainty the endianess of a 16 bit number. And so the byte order mark was created, a codepoint that would be added at the start of a file to signify which endianess the file was. Since they didn’t want to break already existing software, they used a "Zero width no-break space" (ZWNBSP) and designated that a ZWNBSP with bytes reversed would be invalid in UTF-16, a "non-character".
If the BOM was not removed, it would not be visible since it is zero width. If your program opens a file and sees a byte order mark it knew it was in the correct endianess. If it’s the "non-character" then it knew it had to swap the bytes when it read the file.
The standard states that utf16 should always use a BOM and that when reading a file as utf16 the BOM should be used to detect the endianess. It states the BOM is not passed through as it is assumed to not be part of the actual data. If it doesn’t exist then the system is supposed to decode as the endianess that the context suggests. So you get a format where you may lose the first codepoint of your file if that codepoint happens to be a zero width non-break space.
For utf16be (big endian) and utf16le (little endian) the standard states the byte order mark should not exist, and that any zero width no-break space at the start of a file is just that, and should be passed through.
Standards and practice are not identical and in practice many programs will remove a leading zero width non-break space even if set explicitly to utf16be or utf16le. I was unsure which route I should take. I looked at how Perl 5 handled it and after thinking for a long time I decided that we would follow the spec. Even though other software will get rid of any detected byte order mark, I think it’s important that if a user specifies utf16be or utf16le, they will always get back the same codepoints that they write to a file.
I am currently working on adding support so that when you write a file in 'utf16' it will add a byte order mark, as utf16 is supposed to always have one. Until this is added you can write a file with Perl 6 on one computer using the 'utf16' encoding which will not decode correctly using the same 'utf16' setting if the two computers are of different endianess.
Since there was no functionality for writing utf-16 to a file previously, there should be no backwards compatibility issues to worry about there and we can set it to add a byte order mark if the utf16 encoder is used.