Work Since the Last Report

Merged in Unicode Collation Algorithm

I merged the Unicode Collation Algorithm branch into MoarVM. Now that the sort is stable, the coll, unicmp and .collate operators in Rakudo are no longer experimental so use experimental :collation no longer is needed to use them.

The $*COLLATION dynamic variable is still hidden under experimental, since it is possible design changes could be made to them.

Prepend

In some of my other posts I talked about the difficulties of getting Prepend codepoints working properly. To do this I had changed how we store synthetics so as not to assume that the first codepoint of a synthetic is always the base character. This month I merged in change in synthetic representation and implemented the features which were now possible with the new representation.

The feature was to detect which character is the base character and store its index in the synthetic. Most combiners, such as diacritics come after the base character and are Extend codepoints: a + ◌́. Prepend has the reverse functionality and comes before: ؀◌ + 9 (Arabic number sign + number).

This required many assumptions our code rightfully made before Unicode 9.0 to be abandoned. When a synthetic is created, we now check to see if the first codepoint is a Prepend codepoint. If so, we keep checking until we find a codepoint that is not a Prepend. In most cases, the base character is the codepoint following the last Prepend mark.

In degenerate cases there is no base character, which could be a grapheme composed of all Prepend’s or only Prepend’s and Extend’s. In these degenerate cases we set the first codepoint as the base character.

Once I had that done, I was able to fix some of our ops which did not work correctly if there were Prepend characters. This included fixing our case changing op so it would now work on graphemes with Prepend marks. Since the case change ops apply the case change to the base codepoint, it is necessary for us to have the correct base character. Similarly, ordbaseat which gets the base codepoint also needed to be fixed. This allowed ignoremark to now work for graphemes with Prepend marks.

Documentation

I wrote documentation on our Unicode Collation Algorithm, which explains to the reader why the UCA solves, with examples of different single to multiple or multiple to single mappings of codepoints. It goes in a fair amount of detail on how it was implemented.

UTF8-C8

Bugs with Encoding into UTF8-C8

Since MoarVM normalizes all input text, our way of dealing with not normalizing, is important to people who want their strings to be unchanged unchanged. Previously there was a bug where if something was a valid UTF8 storable value, such as a Surrogate or a value higher than 0x10FFFF, it would create a Str type with that value, even though it was not valid Unicode. It would then throw when this value was attempted to be used (since the Str type shouldn’t hold values higher than 0x10FFFF or Surrogates). As this is the only way we have of dealing with text unaltered, this seemed like a serious issue that would prevent UTF8-C8 from being usable in a production environment attempting to encode arbitrary data into UTF8-C8. [f112fbcf]

Bugs While Working with UTF8-C8 Strings

Another issue I fixed was that under concatenation, text replacement or renormalization, the UTF8-C8 codepoints would be "flattened". They would lose their special properties and instead start acting like any other set of Unicode codepoints (although unusual since it contains a private use character and a hex code of the stored value). I changed our codepoint iterator so that optionally you can choose to pass back UTF8-C8 Synthetics unchanged. We use Synthetics to store both UTF8-C8 values as well as storing graphemes which contain multiple codepoints. When iterating by codepoint on an already existing arbitrary string, we want to retain the UTF8-C8 codepoints and make sure they are not changed during the renormalization process. This has been fixed, and UTF8-C8 strings are now drastically more reliable, and hopefully, much more production ready. [2f71945d]

Grapheme Caching and Move To

The function which moves a grapheme iterator forward a specified number of graphemes now works even if we aren’t starting from the very start of the string. In this function we have a first loop which locates the correct strand, and had a second loop which would find the correct grapheme inside that strand. I refactored the grapheme locating code and was able to remove the loop.

In the grapheme caching implementation we can save a lot of work by not creating a new iterator for every grapheme access. Not only that, I also sped it the move_to function about 30%. While the cached iterator reduces access to this function for the functions I added it to, there are still many which may seek for each grapheme requested, this will speed that up.

Other MoarVM Changes

I setup automated Appveyor builds for MoarVM so we get automated builds on Windows (Travis CI only builds MacOS and Linux builds).

I fixed a segfault that occurred when compiling nqp on Alpine Linux which uses musl as its libc. I ended up reducing the depth of recursion in the optimize_bb() function when compiling nqp from 99 to 29 (3x reduction). In the Rakudo build, we have a 5x reduction in the depth of recursion.

Final Report

As I’ve already said many things in my previous grant reports (1, 2, 3, 4) I will iterate on some of the big things I did do, which is not exhaustive. For full details of all the changes please see my other grant reports as well as a partial list of bonus changes I made in MoarVM during the grant at the bottom of the page.

The only thing not completed was implementing a whole new UCD backend. While I planned on doing this, I ended up choosing not to do so. I realized that it would not have been the best use of my time on the grant, as there were many more user noticeable changes I could do. Despite this, I did achieve the goals that the rewrite was intended to solve; namely making property values distinct for each property and making the UCD database generation more reproducible. While it is not exactly the same on different runs, the only thing that changes is the internal property codes which does not affect anything adversely. It is fully functional every time instead of database regeneration breaking our tests most of the time. Once the database was stable, I was then able to update our database to Unicode 10. Without my improvements regarding reproducibility and property values becoming distinct for each property, updating to Unicode 10 would have not been possible. In addition all Hangul (Korean) characters now have names in the Unicode database.

A big thing I wanted to implement was the Unicode Collation Algorithm, which ended up being a total success. I was able to still retain the ability to choose which collation levels the user wanted to sort with as well as reverse the sort order of individual collation levels.

Yet I did not only implement one algorithm, I also implemented the Knuth-Morris-Prat string search algorithm which can take advantage of repeated characters in the needle (can be multiple times faster if you have sections repeating). The Knuth-Morris-Pratt algorithm was adjusted to use either the new cached grapheme iterator or the simple lookup depending on if it was a flat or strand haystack as well. Indexing a strand based string with a one grapheme long needle was sped up by 9x by making a special case for this.

Practically all string ops were sped up, often by multiple times due to getting MVM_string_get_grapheme_at_nocheck inlined. In addition to this, I changed the way we access strings in many of our most used string ops, intelligently using either grapheme iterators, cached grapheme iterators or direct access depending on circumstances. With the MVM_string_get_grapheme_at_nocheck inlined, the time to accessing graphemes with this function was sped up between 1.5x for strands and up to 2x for flat strings. Ops we use a lot, like the op that backs eq and nqp::eqat was given special casing for Strand ←→ Strand, Flat ←→ Flat and Strand ←→ Flat (which also covers Flat ←→ Strand as well). This special casing spec up eq by 1.7x when one is a strand and one is flat, and 2.5x when both strings are flat. Applying similar optimizations to index made a 2x speedup when haystack is flat and 1.5x speedup when haystack is a strand (on top of the previous improvements due to the Knuth-Morris-Pratt algorithm)

I fixed a longstanding bug in NQP which caused 'ignoremark+ignorecase' operations to be totally broken. I fixed this by adding more MoarVM ops and refactoring the code to have many less branches. In MoarVM we now use a centralized function to do each variation of with/without ignorecase and ignore mark which is also fully compatible with foldcase operations as well as igoremark.

Doing the ignoremark/ignorecase indexing work sped them up by multiple times, but then in addition to that, it became 2x faster when the haystack was made up of 10 strands by implementing a cached grapheme iterator.

I implemented full Unicode 9.0 support not just in our grapheme segmentation, but also in our other ops, refactoring how we store synthetic codepoints to allow us to have the 'base' codepoint be a codepoint other than the 1st in the synthetic to support Prepend codepoints.

Our concatenation was improved so as to make full renormalization of both input strings no longer needed in almost all cases. The x repeat operator was fixed so it always creates normalized strings. Previously it could create unnormalized strings instead, causing issues when it was used.

I believe I have more than accomplished what I set out to do in this grant. I made tons of user facing changes: to speed, Unicode normalization support, full Unicode 9.0 support. I added awesome collation features and fixed all the major issues with decoding and working with UTF8-C8 representations. I have listed an incomplete list of bonus deliverables below the deliverables which were part of this project.

Deliverables

  • I documented MoarVM’s string representation, with lots of good information for future developers as well as interested users.

  • Hangul syllables now have Unicode names in our database, with a test added in roast.

  • I implemented the Unicode Collation Algorithm [866623d9]

    • Tests have been added in roast for the UCA and the unicmp op

    • I wrote documentation on our Unicode Collation Algorithm implementation

    • Regarding language specific sort. This would involve us using data from the Unicode CLDR. Once we have this data available from MoarVM, it simply requires a conditional to override DUCET and check a different set of data before checking the DUCET data. This information is in our documentation for collation.

  • Text normalization

    • Speed of normalization was improved

    • Full Unicode 9.0 support for text segmentation and normalization was added

  • While I did not fully rewrite the database, I did solve the needed issues:

    • Property values are now unique for each property

    • Running the generation script creates a functional database every time it is run, rather than only some of the time.

    • I added Unicode Collation data to our database, generated from a Perl 6 script, which happened to be the only property required to complete my deliverables

Bonus Deliverables

Here is a somewhat complete list of bonus deliverables:

  • Updated our database to Unicode 10. This was only possible once I had fixed the problems with the database generation, and made property values unque.

  • Implemented Knuth-Morris-Pratt string search

  • Set up Appveyor builds. Appveyor builds and tests MoarVM on Windows, similar to Travis CI.

  • Fixed ignoremark+ignorecase regex when used together as well as huge speed increases.

UTF8-C8/UTF-8

  • Fix UTF8-C8 encoding so it can encode values > 0x10FFFF as well as surrogates

  • Fix UTF8-C8 strings so they do not get corrupted and flattened when string operations are performed on them.

  • MVM_string_utf8_decodestream: free the buffer on malformed UTF-8 [a22f98db]

String Ops

  • Have MVM_string_codes iterate the string with codepoint iterator [ed84a632]

  • Make eqat 1.7x-2.5x faster [3e823659]

  • Speed up index 9x when Haystack is strand, needle is 1 long [0b364cb8]

  • Implement the Knuth-Morris-Pratt string search algorithm [6915d80e]

  • Add indexim_s op and improve/fix bugs in eqatim_s [127fa2ce]

  • Fix a bug in index/eqat(im) and in ord_getbasechar causing us to not decompose the base character when the grapheme was a synthetic [712cff33]

  • Fix MVM_string_compare to support deterministic comparing of synthetics [abc38137]

  • Added a codes op which gets the number of codepoints in a string rather than the number of graphemes. Rakudo is now multipe times faster doing the .codes op now. Before it would request an array of all the codepoints and then get number of elements, which was much slower.

Fix string ops with Prepend characters

  • Rework MVMNFGSynthetic to not store base separately [3bd371f1]

  • Fix case change when base cp isn’t the first cp in synthetic [49b90b99]

  • For degenerate Synth’s with Prepend and Extend set base cp to 1st cp [db3102c4]

  • Fix ignoremark with Prepend characters and ordbaseat op [f8a639e2]

Memory/GC/Build Fixes

  • Fix segfault when compiling nqp with musl as libc [5528083d]

  • Avoid recursion in optimize_bb() when only 1 child node [6d844e00]

  • Fix various memory access/garbage collection issues in some string ops that were showing up when running in Valgrind or using Address Sanitizer

Grapheme Iteration

  • Ensure we can move forward in a grapheme iterator even if we aren’t starting at the very beginning of a string.

  • Use grapheme iterator cached for ignorecase/ignoremark index ops [b2770e27]

  • Optimize MVM_string_gi_move_to. Optimize the loop which finds the correct location within a strand so that it isn’t a loop and is just conditionals.[c2fc14dd]

  • Use MVMGraphemeIter_cached for strands in KMP index [ce76c994]

  • Allow MVM_string_get_grapheme_at_nocheck to be inlined

  • Refactor code into iterate_gi_into_string() to reduce code duplication [1e92fc96]

Tests Added to Roast

  • Add tests for testing collation. Tests for the unicmp operator [5568a0d63]

  • Test that Hangul syllables return the correct Unicode name [6a4bc5450]

  • Add tests for case changes when we have Prepend codepoints [00554ccbd]

  • Add tests for x operator to ensure normalization retained [1e4fd2148] [59909ca9a]

  • Add a large number of string comparison tests [51c4ac08b]

    • Add tests to make sure synthetics compare properly with cmp [649d9dc50]

  • Improve ignoremark tests to cover many different cases [810e218c8]

    • Add ignoremark tests to cover JSON::Tiny regression + other issue [c185acc57]

  • Add generated tests (from UCD data) and manually created ones to ensure strings concatenation is stable, when the concatenated string would change the normalization. [2e1f7d92a][9f304b070] [64e385faf][0976d07b9][59909ca9a][2e1f7d92a] [88635564e] [a6bbc73cf] [a363e3ff1]

  • Add test for MoarVM Issue #566 .uniprop overflow [9227bc3d8]

  • Add tests to cover RT #128875 [0a80d0d2e]

  • Make new version of GraphemeBreakTest.t to better test grapheme segmentation [54c96043c]

NQP Work

Below is a listing of some of the commits I made to NQP. This included adding the ops I created over the course of the grant: eqatim, indexim, indexicim, eqaticim, and codes (gets the number of codepoints in a string rather than graphemes).

The testing in NQP was inadequate for our string ops, so I added hundreds of tests for practically all of the string ops, so we could properly test the different variations of index* and eqat*.

NQP Documentation

  • Add docs for a few variations of index/eqat [589a3dd5c] Bring the unicmp_s op docs up to date [091625799]

  • Document hasuniprop moar op [650840d74]

NQP Tests

  • Add more index* tests to test empty string paths [8742805cb]

  • run indexicim through all of the indexic tests [26adef6be]

  • Add tests for RT #128875, ignorecase+ignoremark false positives [b96a0afe7]

  • Add tests for nqp::eqatim op on MoarVM [6fac0036e]

Other Work

  • Added script to find undocumented NQP ops [0ead16794]

  • Added nqp::codes to QASTOperationsMAST.nqp [59421ffe1]

  • Update QASTRegexCompilerMAST to use new indexicim and eqaticim ops [18e40936a]

  • Added eqatim and indexim ops. Fix a bug when using ignoremark [9b94aae27]

  • Added nqp::hasuniprop op to QASTOperationsMAST.nqp [d03c4023a]