In the 2017.04 release of Rakudo under the MoarVM backend, there will be some substantial improvements to regex speed.

I have been meaning to make a new blog post for some time about my work on Unicode in Perl 6. This is going to be the first post of several that I have been meaning to write. As a side note, let me mention that I have a Perl Foundation grant proposal which is related to working on Unicode for Perl 6 and MoarVM.

The first of these improvements I'm going to write about is case insensitive regex m:i/ /. MoarVM had formerly lowercased the haystack and the needle whenever nqp::indexic was called. The new code now also uses foldcase instead of lowercasing.

It ended up 1.8-3.3x faster than before, but it began when MasterDuke submitted a Pull Request which changed the underlying MoarVM function behind nqp::indexic (index ignorecase) to use foldcase instead of lowercase.

At first this seemed like a great and easy improvement, but shortly after there were some serious problems. You see, when you foldcase a string, sometimes the number of graphemes can change. One grapheme can become up to 3 new codepoints! For example the ligature ‘st’ will foldcase to ‘st’, even though it may lowercase to 'st'. The issue was, if the string to be searched contained any of these expanding characters, the nqp::indexic command's would be off by however many codepoints were increased!

’.fc.say; # st’.chars.say; # 1
'ffi'.fc.say; # ffi
'ffi'.chars.say; # 1
'ffi'.fc.chars.say; # 3ß’.fc.say; #  ss

So this was a real problem.

On the bright side of things, this allowed me to make many great changes in how case insensitive strings are searched for under Perl 6/MoarVM.

The nqp::index command does a lot of the effort when searching for a string. I discovered we had a nqp::indexic operation that searched for a string but ignored case, but it was not used everywhere. nqp::index was still used extensively and this required us changing case in both in Perl 6 to use nqp::index and also doing it when using the nqp::indexic (index ignore case). In addition MoarVM changed the case of the entire haystack and needle whenever the nqp::indexic operation was used.

On the MoarVM side I first worked on getting it working with foldcase, and quickly discovered that the only sane way to do this, was to begin foldcasing operations on the haystack only from the starting point sent to the indexic function. If you foldcased them before the requested index, it would screw up the offset. My solution was to foldcase the needle, and then foldcase each grapheme down the haystack, only as far as we needed to find our match, preventing useless work foldcasing parts of the string we did not need.

The offset of the needle that is found from the indexic op is relative to the original string, and it will expand characters as needed, but the returned offset will always be related to the original string, not its changed version, making the offsets useful and relevant information on where the needle is in the original string, not in the altered version. As with regex, we are looking for the match in the haystack, and so must be able to return the section of the string we have matched.

The end result is we now have a 1.8x to 3.3x (depending on not finding a match/finding a match at the beginning) faster case insensitive regex!