This is my first grant progress report for my Perl Foundation grant entitled "Improving the Robustness of Unicode Support in Rakudo on MoarVM".

I was not able to work quite as many hours as I would have liked this month, but I still made quite a lot of progress.

Improvement for Tests

Merged In

In Roast there is a new version of GraphemeBreakTest.t.

The script tests the contents of each grapheme individually from the GraphemeClusterBreak.txt file from the Unicode 9.0 test suite.

Previously we only checked the total number of ‘.chars’ each for the string as a whole. Obviously we want something more precise than that, since the test specifies the location of each of the breaks between codepoints. The new code checks that codepoints are put in the correct graphemes in the proper order. In addition we also check the string length as well.

This new test uses a grammar to parse the file and generally is much more robust than the previous script.

Running the parse class generates an array of arrays. The index of the outer array indicates the grapheme, while the inner arrays indicate which codepoints should be in that grapheme.

[[10084, 776], [9757]]

The array above would indicate that the 1st grapheme is made up of codepoint's 10084 and 776 while the 2nd grapheme is made up codepoint 9757. This allows us to easily test the contents of each grapheme.

The array shown above corresponds to the following line from the Unicode data file:

÷ 2764 × 0308 ÷ 261D ÷ where × means break and ÷ means no-break

Work in Progress

I have some currently unmerged tests which need to wait to be merged, although sections of it are complete and are being incorporated into the larger Unicode Database Retrofit, reusing this code.

I have written grammars and modules to process and provide data on the PropertyValueAliases and PropertyAliases. They will be used for testing that all of the canonical property names and all the property values themselves properly resolve to separate property codes, as well as that they are usable in regex.

Work on the Unicode Database Retrofit

As part of my grant work I am working on making Unicode property values distinct per property, and also on allowing all canonical Unicode property values to work. For a background on this see my previous post about Unicode Property Names. The WIP generated code can be seen in this gist here and was generated from UCD-gen.p6. The code resolves property name and property value command line arguments and matches them with property codes and property value codes. It is also case insensitive and ignores underscores as Unicode spec says is permissible. In addition it is also deduplicated, meaning we only store one hash per unique set of property values.

For example: Script and Script_Extensions both have the same values, so we don't store these more than once; likewise for the Boolean property values. The C program resolves the property string to a unique property code, and from there is able to look up the property value code. Note: aside from the property values which specify the lack of a property, these codes are internal and have no relation to the Unicode spec, for example Grapheme_Cluster_Break=Other is designated as property value 0.

Docs

I've also started adding some documentation to my Unicode-Grant wiki with information about what is enclosed in each Unicode data file; there are a few other pages as well. This wiki is planned to be expanded to have many more sections than it does currently. https://github.com/samcv/Unicode-Grant/wiki/All-Unicode-Files

Future Work

Next I must integrate the property name/value alias resolving code with UCD-gen.p6. UCD-gen.p6 already has a mostly functional Unicode database with a fair number of properties. When these two are integrated, the next step will be to start integrating it with the MoarVM codebase, making any changes to MoarVM or the database retrofit codebase as needed along the way.

I will also be exploring ways of compressing the mapping of codepoints to unique combinations of Unicode property data in the bitfield. Due to the vast number of codepoints within Unicode, currently the mapping of codepoints to rows in the bitfield takes up many times more space than the actual property value data itself.

For compressing the Unicode names, it is planned to use base 40 encoding with some additional tricks to save additional space for repeated words. I plan on making a blog post where I go into the details of the compression scheme.

I am considering also rolling in the ignorecase/ignoremark bug into my grant. Even though it was not originally planned to be part of the Grant, I think it is important enough to warrant inclusion. Currently, using regex using both ignorecase and ignoremark together is completely broken.

Note

The work described above has been commited to the two repositories as listed below (in addition to the test work described which was merged into Roast).

https://github.com/samcv/UCD

https://github.com/samcv/Unicode-Grant