This is my second grant progress report for my Perl Foundation grant entitled "Improving the Robustness of Unicode Support in Rakudo on MoarVM".
I got to working on collation this month. I'm going to explain a bit of how the Unicode Collation Algorithm works.
Unicode Collation Algorithm
The collation data for UCA is made up of arrays like so:
Each one is an integer. Primary is different for different letters, 'a' vs 'z'. secondary is differences such as diacritics, 'a' vs 'á'. And tertiary is case, 'a' vs 'A'. While it works different for non-Latin characters, that is the gist of what they represent. In most cases you have one codepoint mapped to one collation array. Though in many cases, this is not true.
Single codepoints can map to multiple collation array elements. Sequences of codepoints can also map to one or more than one collation array elements.
Some sequences also can exist inside others.
So the string
xyz may have one set of collation elements but
another, where x y and z are codepoints in a three codepoint sequence with its
own set of multiple collation keys.
So, how do these collation elements translate into sorting the codepoints?
You take the two primary values, then append a
0 as a seperator. Then push the secondary,
0 as a separator and then push on the tertiary:
0707, 06D9, 0, 020, 020, 0, 02, 02
Now this would pose a problem since we would need to traverse the entire string before making any decisions. Instead what I have decided to do is to use the arrays with [primary.secondary.tertiary] and push them onto a stack instead of changing them into a linear progression, and iterate through the primary's, and then grab more collation elements as they are needed to resolve ties.
Also when collation data for the next codepoint is added to the stack, if it is a starter is a sequence we will also pull the next codepoint going through a linked list stored in C arrays as needed. If the next codepoint ends up not being a part of a sequence we just push the codepoint we just "peeked" at onto the stack as well, so don't have to go back over codepoints.
Now this improved Unicode Collation Algorithm is not complete, but I am continuing to work on integrating the new C data structure I've created into MoarVM, and it currently works partially, but not as well as the current implementation.
Improvements to the Current UCA Implementation
In the meantime I have made improvements to the current implementation of the
Unicode Collation Algorithm. Previously it was possible to enable or disable
the primary, secondary or tertiary levels. This allowed you to do things such
as ignore diacritics when sorting or ignore casing. What you are now able to
do is to reverse the sorting of different levels. This allows you to for example
sort uppercase letters before lowercase (default UCA sorts lowercase before
uppercase, since lowercase < uppercase). It can also let you put diacritic mark
containing characters before the ordinary letters. Any of the three levels
can be either enabled, disabled, or reversed. For anybody already using it,
supplying True or False to set
$*COLLATION still works the same as before,
but you are now able to supply
-1 to enable, disable or reverse
the collation for specific levels.
Grapheme Cluster Break
As I said last week I made improvements to the script that tests our breakup of
Now we have full support for the
Prepend property that was added in Unicode 9.0,
as well as passing all the tests for regional indicators. The only tests we now
don't pass in GraphemeClusterBreakTest.t
are a few emoji tests, and I believe we only fail 3 or so of these! The
mark fixes needed us to save more state across parsing the code, as
is different from all other Unicode grapheme break logic in that it comes before
not after a base character.
The longstanding bug I mentioned in my previous status report has now been fixed. The bug was in regex when both ignorecase and ignoremark adverbs were used.
say "All hell is breaking loose" ~~ m:i:m/"All is fine, I am sure of it"/ # OUTPUT«｢All hell is breaking loose｣␤» Output before the fix. This should not have matched.
This bug occurred when the entire length of the haystack was searched and all of the graphemes matched the needle.
If the needle exceeded the length of the haystack past that point, it would erroneously think there was a match there, as it only checked that it matched the whole length of the haystack.
Would cause 'fgh' to be found in: 'abcdefg'. This only occurred at the very end of the haystack.
string_equal_at_ignore_case_INTERNAL_loop now returns
-1 if there was no match and 0 or more if there was a match at that index.
This return value provides new information which is 0 if there was a match and some positive integer when the haystack was expanded when casefolding it.
As explained by my previous post, information about when characters expand when foldcased must be retained.
This information had been planned to be exposed in some way at a future date,
as if we are searching for 'st' inside a string 'ﬆabc',
(index ignorecase) will indicate that it is located at index 0, and in Perl 6
Rakudo it will return 'ﬆa' when it should instead have returned 'ﬆ'.
For now this additional information is only internal and the return values
nqp::equatic_s ops have not changed.
NQP Codepath Problems…
Previously there were way too many different codepaths handling different variations of no regex adverbs, ignorecase, ignoremark, ignorecase+ignoremark. Problematically each combination had their own codepath. To really solve this bug and improve the code quality I decided to clean it up and correct this.
In my past work I had already added a
nqp::indexic op, so it was time to add
another! I added a
nqp::indexicim op and a
nqp::eqaticim op and was able to
reuse most of the code and not increase our code burden much on the MoarVM side,
and greatly reduce the possibility for bugs to get in on varying combinations
of ignorecase/ignoremark ops.
This is was a very longstanding Unicode bug (I don't think both adverbs together ever worked) so it's great that it is now fixed :).
I will be continuing to fix out the issues in the new Unicode Collation Algorithm implementation as I described earlier in this post. I also plan on taking stock of all of the current Grapheme Cluster Break issues, which only exist now for certain Emoji (though the vast majority of Emoji work properly).
I will also be preparing my talks for the Amsterdam Perl conference as well!
I released a new module,
Font::QueryInfo which allows you to query font information
using FreeType. It can even return the codepoints a font supports as a list