In the 2017.04 release of Rakudo under the MoarVM backend, there will be some substantial improvements to regex speed.
I have been meaning to make a new blog post for some time about my work on Unicode in Perl 6. This is going to be the first post of several that I have been meaning to write. As a side note, let me mention that I have a Perl Foundation grant proposal which is related to working on Unicode for Perl 6 and MoarVM.
The first of these improvements I'm going to write about is case
m:i/ /. MoarVM had formerly lowercased the haystack and
the needle whenever
nqp::indexic was called. The new code now also uses foldcase
instead of lowercasing.
It ended up 1.8-3.3x faster than before, but it began when MasterDuke submitted a Pull
Request which changed the underlying MoarVM function behind
nqp::indexic (index ignorecase)
to use foldcase instead of lowercase.
At first this seemed like a great and easy improvement, but shortly after there were
some serious problems. You see, when you foldcase a string, sometimes the number
of graphemes can change. One grapheme can become up to 3 new codepoints!
For example the ligature ‘ﬆ’ will foldcase to ‘st’, even though it may lowercase
to 'ﬆ'. The issue was, if the string to be searched contained any of these
expanding characters, the
would be off by however many codepoints were increased!
So this was a real problem.
On the bright side of things, this allowed me to make many great changes in how case insensitive strings are searched for under Perl 6/MoarVM.
nqp::index command does a lot of the effort
when searching for a string. I discovered we had a
that searched for a string but ignored case,
but it was not used everywhere.
nqp::index was still used extensively and this
required us changing case in both in Perl 6 to use
nqp::index and also
doing it when using the
nqp::indexic (index ignore case). In addition MoarVM
changed the case of the entire haystack and needle whenever the
operation was used.
On the MoarVM side I first worked on getting it working with foldcase, and quickly discovered that the only sane way to do this, was to begin foldcasing operations on the haystack only from the starting point sent to the indexic function. If you foldcased them before the requested index, it would screw up the offset. My solution was to foldcase the needle, and then foldcase each grapheme down the haystack, only as far as we needed to find our match, preventing useless work foldcasing parts of the string we did not need.
The offset of the needle that is found from the indexic op is relative to the original string, and it will expand characters as needed, but the returned offset will always be related to the original string, not its changed version, making the offsets useful and relevant information on where the needle is in the original string, not in the altered version. As with regex, we are looking for the match in the haystack, and so must be able to return the section of the string we have matched.
The end result is we now have a 1.8x to 3.3x (depending on not finding a match/finding a match at the beginning) faster case insensitive regex!