This month I accomplished a lot. We now support full Unicode 9.0/Emoji 4.0 text segmentation (putting codepoints into graphemes), some really awesome concatenation improvements (these are both in master). Also a fully functional Unicode Collation Algorithm (this not yet merged into master)

Unicode Collation Algorithm

In my collation-arrays branch I now have implemented a fully working Unicode Collation Algorithm. Only 82/190,377 of the Unicode collation tests are failing (99.956% passing).

What I did:

  • Collation keys that need to be generated on the fly are working

  • Characters that need to be decomposed for collation (mainly Hangul characters) are working

  • The script that generates the linked list (used for codepoints with more than one collation array in them or for sequences of more than one codepoint with collation keys) was rewritten and I have not discovered any issues with the data

I was also able to properly keep the ability to adjust each of the collation levels (reversing or disabling primary, secondary or tertiary levels).

Since primary > secondary > tertiary for all collation elements we are able to have a default array:

{-1, 0, 1} /* aka Less, Same, More */

This is the default array that is used in case we are comparing between different levels (primary vs secondary for instance).

If we are comparing between the same level, we use a 3x3 array which holds the return values based on whether the collation values are either Less, More or Same. This is the array that is customized to allow you to reverse or disable different collation levels.

 {-1, 0, 1}, {-1, 0, 1}, {-1, 0, 1}
/*  primary,  secondary,   tertiary */

Below, pos_a and pos_b track moving between keys on the collation stack while level_a and level_b track comparing the levels between each other. For more information about how this works, see my [previous grant status update](/perl6/grant-status-update-2).

if (level_a == level_b)
    effective_level_eval = level_eval_settings.a[level_a];
else
    effective_level_eval = level_eval_default

rtrn =
stack_a.keys[pos_a].a[level_a] < stack_b.keys[pos_b].a[level_b] ?  effective_level_eval.s2.Less :
stack_a.keys[pos_a].a[level_a] > stack_b.keys[pos_b].a[level_b] ?  effective_level_eval.s2.More :
                                                                   effective_level_eval.s2.Same ;

Need to do:

  • Either use MVM_COLLATION_QC property to determine whether to look up a character in the linked list or instead store the index of the value of the main node in the Unicode database instead

    • Currently it checks all of the main nodes before then reverting to the values stored in the Unicode Database or based on generated values

  • Investigate country/language selective sorting

Other than needing to improve the speed of finding the main nodes, the implementation is feature complete.

The currently failing tests mostly relate to codepoints whose collation values are different in NFD form compared to NFC form. This happens if there are combining accent/extending characters after it, which would be reordered in NFD form. As the number of tests failing is quite low and there is only a very slight variation in sorting order because of this, I find it acceptable in its current form. I may find a work around to fix it, but since 99.956% of the tests pass and the incorrectness is only a slight variation of sorting.

Text Segmentation/Normalization

Emoji Fixes

Improvements to segmentation of Emoji w/ GCB=Other

Not all Emoji Modifiers have Grapheme_Cluster_Break = E_Base or E_Base_GAZ. In these cases we need to check the Emoji_Modifier_Base property. This fixed 25 more emoji than before.

Commit: 4ff2f1f9

Don’t break after ZWJ for Emoji=True + GCB=Other

With this change we are now counting 100% of the Emoji v4.0 emoji as a single grapheme. Since ASCII numbers and the pound symbol are also counted as Emoji, we disregard any codepoints in the ASCII range. I updated the MoarVM Readme to show we have full Unicode 9.0/Emoji 4.0 text segmentation support!

Other Improvements

Avoid property lookups for Hangul and just use the Grapheme_Cluster_Break property.

Lookup Canonical_Combining_Class as the raw integer property value instead of having to get the string and then converting to an integer. Even though the numbers we get back are not identical to the actual CCC, when doing the Unicode algorithms we only care about relative CCC. This results in lower CPU usage during normalization.

Fix cmp/leg to support deterministic comparing of synthetics

Fixes the cmp/leg operator to have deterministic results when the string contains a synthetic. Previously it compared based on the value the synthetic was stored as, which is based on which synthetic was allocated first. This makes it so cmp/leg always have the same result for the same string, no matter the order the synthetics are allocated. It also makes it so the codepoints the synthetic is made up of are compared instead of the value of the synthetic (whose value has no relation to the codepoints in the synthetic).

Previously we compared naïvely by grapheme, and ended up comparing synthetic codepoints with non-synthetics. This would cause synthetics to be sorted incorrectly, in addition to it making comparing things non-deterministic; if the synthetics were added in a different order, you would get a different result with MVM_string_compare.

Now, we compare down the string the same way we did before, but if the first non-matching grapheme is a synthetic, we iterate over the codepoints in the synthetic, so it returns the result based on codepoint. If the two non-matching graphemes contain the same initial codepoints but different in length, the grapheme with less codepoints in it is determined as less than the other.

Concatenation Improvements

When we concatenate strings normally, we create a new string which contains references to the string or strands of string a and the string or strands of string b.

Previously if there were any of the following conditions for the last grapheme of string a (last_a) or the first grapheme of string b (first_b), we would renormalize string a and string b in their entirety:

  • last_a or first_b were synthetics (contained multiple codepoints in one grapheme) other than the CRLF synthetic

  • Didn’t pass NFG Quickcheck

  • Had a non-zero Canonical_Combining_Class

The first improvement I made was to use same the code that we use during normalizations (should_break) to find out if two adjacent codepoints should be broken up or if they are part of the same grapheme (user visible character unit). This first improvement had a very big effect since previously we were renormalizing even if renormalization would not have caused there to be any change in the final output. This change reduced the amount we had to do full renormalization by about 80% in my tests of various different languages' scripts.

The second change I made was more technically challenging. While the first change reduces how often we would do full renormalization, the second change’s goal was to only renormalize a small section of string a and string b instead of string a and b in their entirety.

With the new code, we either copy the strands or set string a as the first strand of the new string (as we did previously). Then the should_break function we use for checking if we break between two codepoints or not is used. If we break between last_a and first_b we continue as I explained at the very start of this section. If should_break returns that we need to combine the two codepoints, then we create a renormalized_section strand for those two graphemes. Since last_a is now inside the renormalized_section, we adjust the reference to string a (similar to how .substr creates a substring) or if string a was made up of multiple strands, we adjust the last strand of string a.

We then insert the renormalized section. String b or its strands are then set after the renormalized section. The same way we adjusted a, we adjust b or its first strand to be a substring.

Since it’s possible for the adjusted strings or strands to be fully consumed (i.e. the substring is 0 in length), we "drop" the copied strand or string (for a or b) if the adjusted substring has no more graphemes in it. This currently can only happen if string a or the last strand of string a is a single grapheme or the first strand of string b or string b is a single grapheme.

Other minor fixes

Fix overflow in uniprop lookups

Makes excessively large uniprop queries not overflow into negative numbers, which would cause it to throw. Closes MoarVM issue #566.

Commit: a3e98692.

Made string_index 16% faster for incompatible string types

For incompatible string types use string_equal_at_ignore_case_INTERNAL_loop which results in a 16% speed boost. Only a minor change to the INTERNAL_loop function and it works for non-ignorecase/ignoremark as well as with ignorecase/ignoremark functionality.

Commit: 161ec639.

Add nqp::codes op to MoarVM for 3.5x faster doing .codes for short strings

Added a nqp::codes which gets the number of codepoints in the string. We had nqp::chars to get the number of graphemes already, but for getting the number of codepoints in the string we called .NFC.codes which is 3.5x slower for short strings and 2x slower for longer strings.

Tests

Added in tests to make sure that concat is stable when concatenating things which change under normalization. This was done to make sure there were no issues with the concatenation improvement.

Next Month

In the next month I need to get the Collation stuff merged and work on fixing some very longstanding bugs in our Unicode Database implementation. I may not be fully changing out our Unicode Database implementation as I had planned, but I will still at least fix the majority of the issues.

This includes ucd2c.pl not being deterministic and changing every time it is generated. I plan to make it generate the same code on every run. It was originally written to assumes properties and property values are unique. I need to make it so that when the script is generated, all of the properties and values are functional. Currently, when the script is regenerated, it is random which property values are functional and which are not; depending on which ends up last in the hash initializer array (since the property values where occur last overwrite the identical names which appear first).