Update CharUnicodeInfo to Unicode 12.1; update StringInfo to UAX29 #328

GrabYourPitchforks · 2019-11-27T01:50:39Z

Background

This PR involves four general poles.

Update CharUnicodeInfo backing tables from Unicode 11.0 to Unicode 12.1. This also fixes some bidi bugs in the existing logic.
Modify CharUnicodeInfo backing tables to carry more information, including UAX29 text segmentation info and optional case folding information. Rewrite GenUnicodeProp to carry this information.
Optimize some of the methods in Char and CharUnicodeInfo since we were already touching these code paths.
Update StringInfo and TextElementEnumerator to use UAX29 extended grapheme cluster segmentation (fixes https://github.com/dotnet/corefx/issues/41324 and https://github.com/dotnet/corefx/issues/28416).

Updating the backing data from Unicode 11.0 to Unicode 12.1 is fairly straightforward. We do this every so often, with dotnet/coreclr#20529 being one recent example.

Prior to this PR, the CharUnicodeInfo data contained the following data for all code points U+0000..U+10FFFF:

The Unicode general category (exposed via GetUnicodeCategory)
Bidi information (not exposed publicly, but used as an internal implementation detail in IdnMapping)
Numeric information

With this PR, these tables are updated to include for every code point:

The Unicode general category
"Restricted" bidi information (more on this below)
Whitespace information
Numeric information
UAX29 grapheme segmentation boundary properties
[Optional] simple case conversion and case folding transforms

The CategoriesValues table is an 11:5:4 table that contains offsets into an array of bytes where each element has the following layout:

 bit  7   6   5   4   3   2   1   0
    +-------------------------------+
    | w | b | b | c | c | c | c | c |
    +-------------------------------+

This represents a packed structure containing the UnicodeCategory (bits 0 - 4), restricted bidi class (bits 5 - 6), and whitespace property (bit 7) of each code point.

The NumericGrapheme table is an 11:5:4 table that contains offsets into one of several different arrays. The NumericValues array consists of 8-byte elements interpreted as little-endian doubles. The DigitValues array consists of 1-byte elements where the high nibble of each element is the digit value and the low nibble of each element is the decimal digit value. The GraphemeValues array consists of 1-byte elements where the value of each element maps to an enum value from UAX#29, Table 2.

A quick note on bidi information: it turns out that our previous methodology of pulling bidi information straight from UnicodeData.txt was incomplete. Certain unassigned/reserved code points (such as U+061D) do not have entries in UnicodeData.txt but do have entries in DerivedBidiClass.txt, so we should still carry their bidi data in the CharUnicodeInfo backing tables. The update to the GenUnicodeProp generation utility takes this scenario into account.

`tools/GenUnicodeProp`

This is the logic that reads all of the Unicode data files and spits out the CharUnicodeInfoData.cs backing file. Much of this application was rewritten to remove unneeded logic and to add support for smuggling complex data through the 11:5:4 tables.

If we do want to include simple case mapping / case folding data in the future (see, e.g., https://github.com/dotnet/corefx/issues/17233), this is supported by the tool. Pass the -IncludeCasingData switch to the tool to generate this information. This switch is not enabled by default, which means that the CharUnicodeData backing tables do not carry this information.

`CharUnicodeInfo.cs`

The methods in this file were rewritten to be more optimized, and methods were moved around a bit so that logical method groups (such as GetUnicodeCategory) stay together. Of note is that there are many places where a single table or a single array holds multiple pieces of information.

For example, the Categories 11:5:4 table is used to index into the CategoriesValues array, where each byte element of that array is essentially a glorified (isWhiteSpace, bidiCategory, unicodeCategory) tuple. The NumericGrapheme 11:5:4 table is used to index into the DigitValues, NumericValues, and GraphemeSegmentationValues arrays.

`StrongBidiCategory.cs` and `IdnMapping.cs`

The previous incarnation of CharUnicodeInfo held full bidi class data. However, it turns out that we don't have a public API to get at this information, and the only consumer of this information in the entire framework is IdnMapping. Furthermore, it doesn't care about most bidi classes. It only cares about classes that are strongly left-to-right ("L") or strongly right-to-left ("R", "AL"). So I changed the data stored in CharUnicodeInfo to reflect only this very limited set of information rather than the full bidi class. See https://www.unicode.org/reports/tr44/#BC_Values_Table for more info on these values.

When generating the backing table (see the GenUnicodeProp tool), all code points marked with "L" are given a "strong left-to-right" marker in the backing data, and all code points marked with "AL" or "R" are given a "strong right-to-left" marker in the backing data. All other bidi information is thrown away when generating CharUnicodeInfoData.cs. This is referred to in source as "restricted" bidi data.

Of note is that IdnMapping under the invariant globalization mode follows IDNA2003 semantics when performing bidi processing, with the modification that the data is seeded from recent Unicode data instead of being locked to Unicode 3.2 as required by IDNA2003. We could update IdnMapping to follow strict IDNA2003 semantics if we desired, but that's out of scope of this PR. Since most runtimes won't be running under the invariant globalization addressing this would probably be low-priority.

`Rune.cs` and `Char.cs`

Slight modifications to these files to react to refactorings in CharUnicodeInfo, plus some minor optimizations in those same code paths.

`TextSegmentationUtility.cs`

This is the workhorse class that computes grapheme cluster segmentation boundaries ("display characters", essentially). We use the definition of "extended grapheme cluster" per https://www.unicode.org/reports/tr29/. See Sec. 3.1 and 3.1.1 of that document for the specific algorithm we follow.

The code is generic so that it can work with either UTF-16 or UTF-8. Currently only the UTF-16 API is public, implemented by passing the delegate Rune.DecodeFromUtf16 down to the workhorse routine. If we wanted to add UTF-8 support in the future, it would be trivial to do so by instead calling the same workhorse with the Rune.DecodeFromUtf8 delegate.

`StringInfo.cs` and `TextElementEnumerator.cs`

Rewritten to use the new "extended grapheme cluster" logic in TextSegmentationUtility.

Aside from changing the implementation to be UAX29-compliant, there's one additional behavioral change: the TextElementEnumerator.ElementIndex property getter now throws once the enumeration has completed. This brings its behavior in line with the TextElementEnumerator.Current property getter, which also throws once enumeration has completed; and it matches ElementIndex's documentation, which says that the getter throws after completion.

`CoreFx.Private.TestUtilities.Unicode/*`

Contains all the logic to process the UCD files. This isn't itself shipping code, but it's pulled in by the unit tests so that we can compare the data coming from APIs like CharUnicodeInfo.GetUnicodeCategory against the official data coming from the Unicode group.

The Data folder is unchanged by this PR, as we're now referencing static assets from the runtime-assets repo (see dotnet/runtime-assets#44). I'll send a separate PR in the future to remove the existing Unicode text files from this folder.

`CharUnicodeInfoTests.Generated.cs`

This file contains tests that iterate through all code points in the UCD files, hitting our APIs. It's intended to verify the contents of the 11:5:4 map we use in CharUnicodeInfoData.cs and of the logic which reads that data. It's not intended as a substitute for other manually-specified test data, which already exists in CharUnicodeInfoTets.cs.

`GraphemeBreakTest.cs`

Iterates through the test data pulled from https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt, which defines approx. 600 test cases for grapheme boundary calculation.

Other miscellaneous tests

Some tests had to be modified a bit, especially tests which used "\u0300\u0300" as a sequence break. (Per the most recent iteration of UAX29, those two code points should be treated as a single cluster.) I also added some more complex emoji sequences to the test case data.

VB defines a string reversal function StrReverse which had to be rewritten since it made now-incorrect assumptions about how StringInfo worked. Specifically, they had copied some of the existing StringInfo segmentation logic, which was causing problems with their other usage of TextElementEnumerator. So I removed the custom logic and have them use TextElementEnumerator always.

Benchmarks

In the earlier drafts, APIs like char.GetUnicodeCategory(char) see an approx. 20% perf improvement with this PR. Once the build stabilizes I'll rerun the tests and get more representative numbers.

I suspect this will also improve the performance of certain Regex constructs, such as "\p{L}". Will need to measure that separately to confirm.

GrabYourPitchforks · 2019-11-27T01:53:02Z

@tarekgh The coreclr bits are what you already reviewed (thanks!), with some minor fixes to the generation tool. I haven't yet gotten a chance to incorporate your feedback but will get to it soon. I believe I'm blocked waiting for #92 to go through.

ViktorHofer · 2019-11-27T11:22:17Z

As discussed on Teams, please move the testdata files to https://github.com/dotnet/runtime-assets. I renamed the repository today. cc @akoeplinger

tarekgh

Thanks @GrabYourPitchforks for getting this ready!

GrabYourPitchforks · 2019-11-28T00:48:12Z

Latest iteration reverts changes to the Data folder and uses a new runtime-assets package instead.

pentp · 2019-11-29T03:57:13Z

Is the draft PR or something else confusing GitHub because the file changes view shows GenUnicodeProp/Program.cs: No changes.? From the commits view the changes are visible.

I assume the change from 12:4:4 to 11:5:4 for the numeric tables gives better compression now and cutoff isn't useful any-more because of grapheme break data?

ufcpp · 2019-12-07T04:03:19Z

Furthermore, it doesn't care about most bidi classes. It only cares about classes that are strongly left-to-right ("L") or strongly right-to-left ("R", "AL").

I recently implemented bidi-reordering algorithm (mainly for Arabic support) for Unity game engine. I had to do so because Unity UI doesn't support it. It were very helpful if a bidi class API were public.

GrabYourPitchforks · 2020-01-13T21:38:14Z

CI failure is due to a known Helix issue, unrelated to this PR. Marking ready for review. I'm going to spend the day benchmarking.

GrabYourPitchforks · 2020-01-14T00:21:44Z

Modest gains in APIs like string.IsNullOrWhiteSpace for non-English text. APIs like Regex.IsMatch are in the range of noise so I can't get a good feel for them.

Method	Toolchain	input	Mean	Error	StdDev	Ratio	RatioSD
IsNullOrWhiteSpace	master	Κρόνος	5.690 ns	0.1147 ns	0.1072 ns	1.00	0.00
IsNullOrWhiteSpace	proto	Κρόνος	4.390 ns	0.1851 ns	0.1732 ns	0.77	0.03

IsNullOrWhiteSpace	master	你好	5.724 ns	0.0898 ns	0.0796 ns	1.00	0.00
IsNullOrWhiteSpace	proto	你好	5.315 ns	0.1398 ns	0.2336 ns	0.90	0.05

ViktorHofer · 2020-01-14T22:20:29Z

eng/Versions.props

@@ -61,6 +61,7 @@
    <SystemIOCompressionTestDataVersion>5.0.0-beta.19608.5</SystemIOCompressionTestDataVersion>
    <SystemIOPackagingTestDataVersion>5.0.0-beta.19608.5</SystemIOPackagingTestDataVersion>
    <SystemNetTestDataVersion>5.0.0-beta.19608.5</SystemNetTestDataVersion>
+    <SystemPrivateRuntimeUnicodeDataVersion>5.0.0-beta.19610.1</SystemPrivateRuntimeUnicodeDataVersion>


you need to add a dependency to eng/Version.Details.xml so that the version gets auto-updated.

#1731 is out. Thanks for catching this! :)

Update CharUnicodeInfo to Unicode 12.1; update StringInfo to UAX29

f813a2b

GrabYourPitchforks added the area-System.Globalization label Nov 27, 2019

GrabYourPitchforks requested a review from tarekgh November 27, 2019 01:50

ufcpp mentioned this pull request Nov 27, 2019

TextElementEnumerator will be UAX29 compliant ufcpp/GraphemeSplitter#4

Open

GrabYourPitchforks mentioned this pull request Nov 27, 2019

StringInfo.ParseCombiningCharacters returns incorrect values on Linux #302

Closed

tarekgh approved these changes Nov 27, 2019

View reviewed changes

GrabYourPitchforks mentioned this pull request Nov 28, 2019

Add TestData project containing Unicode UCD files dotnet/runtime-assets#44

Merged

Use runtime-assets package instead of checking in new UCD files

0a322f4

GrabYourPitchforks added 6 commits January 10, 2020 13:50

Merge remote-tracking branch 'origin/master' into charunicodeinfo

b0bfc14

React to package renames

f5533a7

Remove eager allocations in UnicodeDebug

f007969

PR feedback

596ea4c

Delete duplicate implementation in GenUnicodeProp

0d29483

Merge remote-tracking branch 'origin/master' into charunicodeinfo

838d318

GrabYourPitchforks marked this pull request as ready for review January 13, 2020 21:37

GrabYourPitchforks merged commit 422fef8 into dotnet:master Jan 14, 2020

GrabYourPitchforks deleted the charunicodeinfo branch January 14, 2020 22:08

ViktorHofer reviewed Jan 14, 2020

View reviewed changes

leppie mentioned this pull request Jan 29, 2020

.NET Core 3 changed how high surrogates are encoded / decoded dotnet/roslyn#41280

Open

ufcpp mentioned this pull request Feb 3, 2020

Type and supporting API for working with graphemes #31642

Closed

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update CharUnicodeInfo to Unicode 12.1; update StringInfo to UAX29 #328

Update CharUnicodeInfo to Unicode 12.1; update StringInfo to UAX29 #328

Uh oh!

GrabYourPitchforks commented Nov 27, 2019 •

edited

Loading

Uh oh!

GrabYourPitchforks commented Nov 27, 2019

Uh oh!

ViktorHofer commented Nov 27, 2019 •

edited

Loading

Uh oh!

tarekgh left a comment

Uh oh!

GrabYourPitchforks commented Nov 28, 2019

Uh oh!

pentp commented Nov 29, 2019

Uh oh!

ufcpp commented Dec 7, 2019

Uh oh!

GrabYourPitchforks commented Jan 13, 2020

Uh oh!

GrabYourPitchforks commented Jan 14, 2020

Uh oh!

ViktorHofer Jan 14, 2020

Uh oh!

GrabYourPitchforks Jan 14, 2020

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Update CharUnicodeInfo to Unicode 12.1; update StringInfo to UAX29 #328

Update CharUnicodeInfo to Unicode 12.1; update StringInfo to UAX29 #328

Uh oh!

Conversation

GrabYourPitchforks commented Nov 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

tools/GenUnicodeProp

CharUnicodeInfo.cs

StrongBidiCategory.cs and IdnMapping.cs

Rune.cs and Char.cs

TextSegmentationUtility.cs

StringInfo.cs and TextElementEnumerator.cs

CoreFx.Private.TestUtilities.Unicode/*

CharUnicodeInfoTests.Generated.cs

GraphemeBreakTest.cs

Other miscellaneous tests

Benchmarks

Uh oh!

GrabYourPitchforks commented Nov 27, 2019

Uh oh!

ViktorHofer commented Nov 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarekgh left a comment

Choose a reason for hiding this comment

Uh oh!

GrabYourPitchforks commented Nov 28, 2019

Uh oh!

pentp commented Nov 29, 2019

Uh oh!

ufcpp commented Dec 7, 2019

Uh oh!

GrabYourPitchforks commented Jan 13, 2020

Uh oh!

GrabYourPitchforks commented Jan 14, 2020

Uh oh!

ViktorHofer Jan 14, 2020

Choose a reason for hiding this comment

Uh oh!

GrabYourPitchforks Jan 14, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

GrabYourPitchforks commented Nov 27, 2019 •

edited

Loading

`tools/GenUnicodeProp`

`CharUnicodeInfo.cs`

`StrongBidiCategory.cs` and `IdnMapping.cs`

`Rune.cs` and `Char.cs`

`TextSegmentationUtility.cs`

`StringInfo.cs` and `TextElementEnumerator.cs`

`CoreFx.Private.TestUtilities.Unicode/*`

`CharUnicodeInfoTests.Generated.cs`

`GraphemeBreakTest.cs`

ViktorHofer commented Nov 27, 2019 •

edited

Loading