This is my fourth grant update. A productive and noteworthy month. I gave two presentations at YAPC-EU in the Netherlands earlier this month, on High End Unicode and MoarVM Internals (links to my slides). It was my first Perl conference and I had a great time meeting people in the Perl 6 and Perl community!

Despite the conference, I made some big changes this month: big speedups for indexing operations, as well as an update to the Unicode 10 database. Before I could regenerate the database I had to fix ucd2c.pl, the script that generates the database, to be more deterministic and have unique property values per property (without this regenerating it with the Unicode 10 data would result in partially broken Unicode properties). I also implemented the Knuth-Morris-Pratt string search algorithm for our index function.

I have added documentation to MoarVM which is an overview of how our strings are represented as well as details about normalization, synthetics and other topics. On Hacker News someone noted that this was not documented anywhere, so I made sure to add documentation for this. If you are interested in some of the internals of MoarVM, I’m hoping the document should be pretty accessible even if you are not a MoarVM developer.

Index

Knuth-Morris-Pratt String Search Algorithm

Previously we did not have any natively implemented efficient string search algorithms, we only had memmem which was optimized, but would only work if both strings were flat and both the same bitwidth per grapheme.

Now all index operations with a 4096 or under length needle are optimized and no longer use brute force methods for searching (this does not include case insensitive or ignoremark indexing operations).

When my KMP implementation in MVM_string_index is used:
  • Strings with non-matching types

    • That have more than one codepoint in the needle

    • That don’t have a needle more than 4096 graphemes long

  • Speedup can be small, or large depending on the pattern of the needle

    • Repeating letters can cause multiple times speed increase depending on the haystack

We still use memmem when both strings are both flat strings and both the same data type (which use Knuth-Morris-Pratt or Booyer-Moore depending on platform). Most of the strings we will work with — especially once the strings have been modified — are strands.

Grapheme Caching

Since the Knuth-Morris-Pratt algorithm often will request the same grapheme again, but will never request an earlier point in the haystack, I was able to optimize the KMP string search function I added to cache the graphemes so we can use a grapheme iterator instead of using MVM_string_get_grapheme_at_nocheck which for strands, will have to find its place in the haystack from the beginning each time. What this grapheme caching does, is it caches the last returned grapheme, so if the same grapheme is requested again it continues to return that without requesting a new grapheme from the grapheme iterator. If it requests the next grapheme, or some number of graphemes after the current position, it will either grab the next grapheme or move the grapheme iterator forward (skip a certain number of graphemes) and then get a grapheme from the grapheme iterator.

Here are some timings I got before and after the grapheme iterator for an English language book text file, searching for English words misspelled by one letter (so it would search the entire file but be representative of searching for actual words).

Description

caching

master

index with needle (strand haystack)

0.832730

2.67385181

index with word needle (flat haystack)

0.8357464

1.01405131

As you can see from the table, we actually even got savings when the haystack was flat. This surprised me, since getting a grapheme with a flat haystack points to a function with one switch and then returning the integer of the blob array at the specified position. I am guessing this is likely caused by the caching function generating more efficient machine code, since the savings can’t be only explained by the savings from caching the grapheme — the speed up was seen even when I manipulated the needle so that there were no cache “hits” and it always requested different graphemes.

ℹ️
The grapheme caching has not yet been merged, but is ready to go after fixing some merge conflicts

Inlining MVM_string_get_grapheme_at_nocheck

After I finished writing the previous section, I was able to discover the reason for the speed difference with flat haystacks. By getting MVM_string_get_grapheme_at_nocheck to inline I was able to speed up index operations for a flat haystack by 2x. This is on top of the speedups of about we got from the KMP algorithm! This should affect any code which uses this function, making the function 2x as fast for flat strings, and likely a slight speedup for strands as well. This has huge implications as this function is used extensively throughout our string ops. This may change what I do with the grapheme caching code. It is likely I will change it so that it uses the grapheme caching for strands, and uses MVM_string_get_grapheme_at_nocheck for flat haystacks.

Single Grapheme Needle Index

I sped up string index by 9x when the haystack is a strand and the needle is 1 grapheme long. For this we use a grapheme iterator when looking for a single grapheme inside of a strand, and use a simpler faster loop since we are only looking for a single grapheme.

Unicode

Property Values

Property values are now unique for each property in the Unicode Database. Since property values are non-unique, we must store them uniquely. Previously this would cause only the property whose value was last in the C data array to work. Now that property values are unique for each property code, that should no longer cause breakage.

Unicode Database Updated to Unicode 10

The Unicode database has been updated for Unicode 10. Now that the previous breaking point — the database not always generating properly each time — was solved, I was able to make changes to the Unicode database, including updating it to version 10. The main significant changes in this release were the addition of new scripts, symbols and Emoji. You can see the full list of changes here.

Unicode Names

Hangul (Korean) characters now have names in our name database. These are generated algorithmically on database creation by decomposing the characters and concatening the 2 or 3 resulting codepoint’s Jamo names. This is needed since the Unicode data file does not include the name for these codepoints and leaves it to the implementer to create them. A roast test has been added to check for support of these Korean characters.

Fixed a bug in ignoremark/ordbaseat

Ignoremark relys on the MoarVM ordbaseat op under the hood. For graphemes made up of a single character, we decompose the character and get the resulting base charactewhich it relays on assumed once we had gotten a synthetic’s base character that we already had the final base character. For non-synthetics we would decompose This wasn’t true for example with: "\c[LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW]" The "j with caron" would not be decomposed because it was the base character of a synthetic.

This also fixes a bug in indexim and eqatim ops which caused it to fail when encountering a synthetic.

We now have ord_getbasechar handle synthetic and non-synthetic’s and have MVM_string_ord_basechar_at just handle the needed string based checks, then grabbing the grapheme from the string and passing its value on to ord_getbasechar.

Synthetic Grapheme Rework

I reworked MVMNFGSynthetic, the C struct which represents our synthetic graphemes. This is done to (eventually) get Prepend support working for things like ignoremark regex and the ordbaseat op which gets the base codepoint of a grapheme.

Unlike all other marks which come after a base character, Prepend characters come before the base character. All of our current code assumed the first codepoint of the synthetic is the base character.

For now, we also assume the first codepoint of the synthetic is the base character, but we now have a base_index which will be able to hold the index of the base character in the codepoint array.

While this doesn’t add Prepend support to everything, this is one step toward getting that working, and decoupling base codepoints from being the first codepoint of a synthetic.

ℹ️
This is not yet merged, but ready to go

Collation

There’s not as much to say about this, since it is almost ready. As I said last month it is fully functional, and I have since done some work cleaning it up. Left to be done is integrating the data generation into the other Unicode database generation script. Although the file and code it uses to generate is different, ideally we will have only one script to run to generate and update all of our files on new Unicode releases.

Previously I created a script which downloads all of the Unicode files needed in the generation, so an update should hopefully only require a run of the download script to fetch the UCD, the UCA (Unicode Collation Algorithm), and Emoji data.

One thing I had not talked about previously was some of the ways I have sped up the UCA implementation to be quite fast and efficient, only having to evaluate the collation arrays efficiently even for very large strings. If you are interested, the details are in my slides.

NQP Documentation

In addition to the MoarVM string documentation I mentioned in the introduction, I also ensured all of the ops I have added over my project are documented in NQP.

Ensured all the NQP index, eqat variations (indexic, indexim, indexicim, eqatic, eqaticim) are documented, and added the couple that had not yet been added to the ops list.

Added a Perl 6 program which gets a list of ops which are not mentioned in the oplist which will hopefully be useful to other developers.

The Goal is in Sight

The grant is finally winding down. I have all significant things implemented although not everything has been merged yet. I also have implemented additional things that were not part of the grant (Knuth-Morris-Pratt string search algorithm).

Left to be Done

To finally complete my work and fullfill my objectives I will make any additional documentation or tests that need to be made. If other Perl 6 devs or users of the community want to make any requests for Unicode or string related documentation, you can send me an email or send me a message on freenode IRC (nick samcv).

Other than this, I only need to clean up and merge the collation arrays branch, merge the synthetic grapheme reword, and update the grapheme caching for KMP branch to use caching for strand haystacks and use the now inlined MVM_string_get_grapheme_at_nocheck for flat haystacks.