Currently when you do: 'word' ~~ /<:Latin>/, MoarVM looks in a hash which contains all of the property values and looks up what property name it is associated with. So in this case it looks up Latin, and then finds it is related to the Script property.

There is a longstanding issue in MoarVM. The Unicode database of MoarVM was created with the incorrect assumption that Unicode property values were distinct. As part of my work on the Unicode Grant this is one of the issues I am tackling. So to be better informed I generated a list of all of the overlaps. I won't paste it here because it is very long, but if you want to see a full list see my post here.

There are 68 property values which belong to multiple property names and an additional 126 that are shared between Script and Block properties.

In addition we must also make sure that we check overlap between property names and the values themselves.

Here are all of the property names that conflict with values

«« IDC Conflict with property name [blk]  is a boolean property
«« VS Conflict with property name [blk]  is a boolean property
«« White_Space Conflict with property name [bc]  is a boolean property
«« Alphabetic Conflict with property name [lb]  is a boolean property
«« Hyphen Conflict with property name [lb]  is a boolean property
«« Ideographic Conflict with property name [lb]  is a boolean property
«« Lower Conflict with property name [SB]  is a boolean property
«« STerm Conflict with property name [SB]  is a boolean property
«« Upper Conflict with property name [SB]  is a boolean property

Luckily these are all Bool properties and so we don't need to worry about anything complicated there.

A fun fact, currently the only reason ' ' ~~ /<:space>/ matches is because space resolves as Line_Break=space. In fact, it should resolve as White_Space=True. Luckily space character and a few others have Line_Break=space, though this does not work properly "\n" ~~ /<:space>/. I will note though, that using <:White_Space> does work properly, as it resolves to the property name.

I would make Bool properties to be 0th in priority

  • 0. Property Name (i.e. <:White_Space>, <:Hyphen>)

Then similar to other regex engines, we will allow you to designate General_Category and Script unqualified. <:Latin> <:L> # unqualified <:Script<Latin>> <:General_Category<L>> # qualified

I propose a heirarchy as follows

  • 0. Property Name (i.e. <:White_Space>, <:Hyphen>)
  • 1. General_Category
  • 2. Script
  • 3. Numeric_Type

The following below I have not decided if we want to guarantee them but they should be a part of the internal hierarchy

  • 4. Grapheme_Cluster_Break
  • 5. Canonical_Combining_Class

We should resolve also Numeric_Type so that people can use <:Numeric> in their regex (I'm sure that there must already exist code where this is used so we need to make sure this is resolved as well).

In actuality this resolves as Numeric_Type != None. So this is covered under rule 0.

I am open to adding whichever properties people think most important to the ordered priority list as well. Due to how things are setup in MoarVM/NQP I will need to come up with some hierarchy to resolve all the properties. In addition to this we will have a Guaranteed list, where specs will specify that using them unqualified are guaranteed to work.

The ones property value names with overlap remaining after the proposed list above:

NU => ["Word_Break", "Line_Break", "Sentence_Break"],
NA => ["Age", "Hangul_Syllable_Type", "Indic_Positional_Category"],
E => ["Joining_Group", "Jamo_Short_Name"],
SP => ["Line_Break", "Sentence_Break"],
CL => ["Line_Break", "Sentence_Break"],
D => ["Jamo_Short_Name", "Joining_Type"],
Narrow => ["East_Asian_Width", "Decomposition_Type"],
NL => ["Word_Break", "Line_Break"],
Wide => ["East_Asian_Width", "Decomposition_Type"],
Hebrew_Letter => ["Word_Break", "Line_Break"],
U => ["Jamo_Short_Name", "Joining_Type"],
LE => ["Word_Break", "Sentence_Break"],
Close => ["Bidi_Paired_Bracket_Type", "Sentence_Break"],
BB => ["Jamo_Short_Name", "Line_Break"],
HL => ["Word_Break", "Line_Break"],
Maybe => ["NFKC_Quick_Check", "NFC_Quick_Check"],
FO => ["Word_Break", "Sentence_Break"],
H => ["East_Asian_Width", "Jamo_Short_Name"],
Ambiguous => ["East_Asian_Width", "Line_Break"],

Any ideas above adding further to the hierarchy (even if they don't have any overlap presently [Unicode 9.0] it could be introduced later) will be appreciated. Either comment on this Github issue or send me an email (address at bottom of the page).