This is my fourth grant update. A productive and noteworthy month. I gave two presentations at YAPC-EU in the Netherlands earlier this month, on High End Unicode and MoarVM Internals (links to my slides). It was my first Perl conference and I had a great time meeting people in the Perl 6 and Perl community!
Despite the conference, I made some big changes this month: big speedups for indexing
operations, as well as an update to the Unicode 10 database. Before I could regenerate
the database I had to fix
ucd2c.pl, the script that generates the database,
to be more deterministic and have unique property values per property (without this
regenerating it with the Unicode 10 data would result in partially broken Unicode
properties). I also implemented the Knuth-Morris-Pratt string search algorithm
I have added documentation to MoarVM which is an overview of how our strings are represented as well as details about normalization, synthetics and other topics. On Hacker News someone noted that this was not documented anywhere, so I made sure to add documentation for this. If you are interested in some of the internals of MoarVM, I’m hoping the document should be pretty accessible even if you are not a MoarVM developer.
Knuth-Morris-Pratt String Search Algorithm
Previously we did not have any natively implemented efficient string search
algorithms, we only had
memmem which was optimized, but would only work if
both strings were flat and both the same bitwidth per grapheme.
index operations with a 4096 or under length needle are optimized
and no longer use brute force methods for searching (this does not include
case insensitive or ignoremark indexing operations).
Strings with non-matching types
That have more than one codepoint in the needle
That don’t have a needle more than 4096 graphemes long
Speedup can be small, or large depending on the pattern of the needle
Repeating letters can cause multiple times speed increase depending on the haystack
We still use
memmem when both strings are both flat strings and both the
same data type (which use Knuth-Morris-Pratt or Booyer-Moore depending on platform).
Most of the strings we will work with — especially once the strings have been
modified — are strands.
Since the Knuth-Morris-Pratt algorithm often will request the same grapheme again,
but will never request an earlier point in the haystack, I was able to optimize
the KMP string search function I added to cache the graphemes so we can use a
grapheme iterator instead of using
for strands, will
have to find its place in the haystack from the beginning each time. What this
grapheme caching does, is it caches the last returned grapheme, so if the same
grapheme is requested again it continues to return that without requesting a new
grapheme from the grapheme iterator. If it requests the next grapheme, or some
number of graphemes after the current position, it will either grab the next
grapheme or move the grapheme iterator forward (skip a certain number of graphemes)
and then get a grapheme from the grapheme iterator.
Here are some timings I got before and after the grapheme iterator for an English language book text file, searching for English words misspelled by one letter (so it would search the entire file but be representative of searching for actual words).
index with needle (strand haystack)
index with word needle (flat haystack)
As you can see from the table, we actually even got savings when the haystack was flat. This surprised me, since getting a grapheme with a flat haystack points to a function with one switch and then returning the integer of the blob array at the specified position. I am guessing this is likely caused by the caching function generating more efficient machine code, since the savings can’t be only explained by the savings from caching the grapheme — the speed up was seen even when I manipulated the needle so that there were no cache “hits” and it always requested different graphemes.
|The grapheme caching has not yet been merged, but is ready to go after fixing some merge conflicts|
After I finished writing the previous section, I was able to discover the reason
for the speed difference with flat haystacks. By getting
to inline I was able to speed up
index operations for a flat haystack by 2x.
This is on top of the speedups of about we got from the KMP algorithm! This should
affect any code which uses this function, making the function 2x as fast for
flat strings, and likely a slight speedup for strands as well. This
has huge implications as this function is used extensively throughout our string
ops. This may change what I do with the grapheme caching code. It is likely I
will change it so that it uses the grapheme caching for strands, and uses
MVM_string_get_grapheme_at_nocheck for flat haystacks.
Single Grapheme Needle Index
I sped up string
index by 9x when the haystack is a strand and the needle is 1
grapheme long. For this we use a grapheme iterator when looking for a single
grapheme inside of a strand, and use a simpler faster loop since we are only
looking for a single grapheme.
Property values are now unique for each property in the Unicode Database. Since property values are non-unique, we must store them uniquely. Previously this would cause only the property whose value was last in the C data array to work. Now that property values are unique for each property code, that should no longer cause breakage.
Unicode Database Updated to Unicode 10
The Unicode database has been updated for Unicode 10. Now that the previous breaking point — the database not always generating properly each time — was solved, I was able to make changes to the Unicode database, including updating it to version 10. The main significant changes in this release were the addition of new scripts, symbols and Emoji. You can see the full list of changes here.
Hangul (Korean) characters now have names in our name database. These are generated algorithmically on database creation by decomposing the characters and concatening the 2 or 3 resulting codepoint’s Jamo names. This is needed since the Unicode data file does not include the name for these codepoints and leaves it to the implementer to create them. A roast test has been added to check for support of these Korean characters.
Fixed a bug in ignoremark/ordbaseat
Ignoremark relys on the MoarVM
ordbaseat op under the hood. For graphemes made
up of a single character, we decompose the character and get the resulting base
charactewhich it relays on
assumed once we had gotten a synthetic’s base character
that we already had the final base character. For non-synthetics we would decompose
This wasn’t true for example
"\c[LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW]"
The "j with caron" would not be decomposed because it was the base character
of a synthetic.
This also fixes a bug in indexim and eqatim ops which caused it to fail when encountering a synthetic.
We now have ord_getbasechar handle synthetic and non-synthetic’s and have
MVM_string_ord_basechar_at just handle the needed string based checks,
then grabbing the grapheme from the string and passing its value on to
Synthetic Grapheme Rework
MVMNFGSynthetic, the C struct which represents our synthetic graphemes.
This is done to (eventually) get Prepend support working for things like
ignoremark regex and the
ordbaseat op which gets the base codepoint of a grapheme.
Unlike all other marks which come after a base character, Prepend characters come before the base character. All of our current code assumed the first codepoint of the synthetic is the base character.
For now, we also assume the first codepoint of the synthetic is the base character, but we now have a base_index which will be able to hold the index of the base character in the codepoint array.
While this doesn’t add Prepend support to everything, this is one step toward getting that working, and decoupling base codepoints from being the first codepoint of a synthetic.
|This is not yet merged, but ready to go|
There’s not as much to say about this, since it is almost ready. As I said last month it is fully functional, and I have since done some work cleaning it up. Left to be done is integrating the data generation into the other Unicode database generation script. Although the file and code it uses to generate is different, ideally we will have only one script to run to generate and update all of our files on new Unicode releases.
Previously I created a script which downloads all of the Unicode files needed in the generation, so an update should hopefully only require a run of the download script to fetch the UCD, the UCA (Unicode Collation Algorithm), and Emoji data.
One thing I had not talked about previously was some of the ways I have sped up the UCA implementation to be quite fast and efficient, only having to evaluate the collation arrays efficiently even for very large strings. If you are interested, the details are in my slides.
In addition to the MoarVM string documentation I mentioned in the introduction, I also ensured all of the ops I have added over my project are documented in NQP.
Ensured all the NQP index, eqat variations (indexic, indexim, indexicim, eqatic, eqaticim) are documented, and added the couple that had not yet been added to the ops list.
The Goal is in Sight
The grant is finally winding down. I have all significant things implemented although not everything has been merged yet. I also have implemented additional things that were not part of the grant (Knuth-Morris-Pratt string search algorithm).
Left to be Done
To finally complete my work and fullfill my objectives I will make any additional documentation or tests that need to be made. If other Perl 6 devs or users of the community want to make any requests for Unicode or string related documentation, you can send me an email or send me a message on freenode IRC (nick samcv).
Other than this, I only need to clean up and merge the collation arrays branch,
merge the synthetic grapheme reword, and update the grapheme caching for KMP
branch to use caching for strand haystacks and use the now inlined
MVM_string_get_grapheme_at_nocheck for flat haystacks.