Currently when you do:
'word' ~~ /<:Latin>/, MoarVM looks in a hash which contains
all of the property values and looks up what property name it is associated with.
So in this case it looks up Latin, and then finds it is related to the
There is a longstanding issue in MoarVM. The Unicode database of MoarVM was created with the incorrect assumption that Unicode property values were distinct. As part of my work on the Unicode Grant this is one of the issues I am tackling. So to be better informed I generated a list of all of the overlaps. I won't paste it here because it is very long, but if you want to see a full list see my post here.
There are 68 property values which belong to multiple property names and an
additional 126 that are shared between
In addition we must also make sure that we check overlap between property names and the values themselves.
Here are all of the property names that conflict with values
Luckily these are all Bool properties and so we don't need to worry about anything complicated there.
A fun fact, currently the only reason
' ' ~~ /<:space>/ matches is because
space resolves as Line_Break=space. In fact, it should resolve as White_Space=True.
Luckily space character and a few others have Line_Break=space, though this does
not work properly
"\n" ~~ /<:space>/. I will note though, that using
<:White_Space> does work properly, as it resolves to the property name.
I would make Bool properties to be 0th in priority
- 0. Property Name (i.e.
Then similar to other regex engines, we will allow you to designate
<:Latin> <:L> # unqualified
<:Script<Latin>> <:General_Category<L>> # qualified
I propose a heirarchy as follows
- 0. Property Name (i.e.
- 1. General_Category
- 2. Script
- 3. Numeric_Type
The following below I have not decided if we want to guarantee them but they should be a part of the internal hierarchy
- 4. Grapheme_Cluster_Break
- 5. Canonical_Combining_Class
We should resolve also
Numeric_Type so that people can use
their regex (I'm sure that there must already exist code where this is used so
we need to make sure this is resolved as well).
In actuality this resolves as Numeric_Type != None. So this is covered under rule 0.
I am open to adding whichever properties people think most important to the ordered priority list as well. Due to how things are setup in MoarVM/NQP I will need to come up with some hierarchy to resolve all the properties. In addition to this we will have a Guaranteed list, where specs will specify that using them unqualified are guaranteed to work.
The ones property value names with overlap remaining after the proposed list above:
Any ideas above adding further to the hierarchy (even if they don't have any overlap presently [Unicode 9.0] it could be introduced later) will be appreciated. Either comment on this Github issue or send me an email (address at bottom of the page).