Skip to content

Update CharUnicodeInfo to Unicode 12.1; update StringInfo to UAX29 #328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 14, 2020

Conversation

GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Nov 27, 2019

Background

This PR involves four general poles.

  1. Update CharUnicodeInfo backing tables from Unicode 11.0 to Unicode 12.1. This also fixes some bidi bugs in the existing logic.
  2. Modify CharUnicodeInfo backing tables to carry more information, including UAX29 text segmentation info and optional case folding information. Rewrite GenUnicodeProp to carry this information.
  3. Optimize some of the methods in Char and CharUnicodeInfo since we were already touching these code paths.
  4. Update StringInfo and TextElementEnumerator to use UAX29 extended grapheme cluster segmentation (fixes https://github.com/dotnet/corefx/issues/41324 and https://github.com/dotnet/corefx/issues/28416).

Updating the backing data from Unicode 11.0 to Unicode 12.1 is fairly straightforward. We do this every so often, with dotnet/coreclr#20529 being one recent example.

Prior to this PR, the CharUnicodeInfo data contained the following data for all code points U+0000..U+10FFFF:

  • The Unicode general category (exposed via GetUnicodeCategory)
  • Bidi information (not exposed publicly, but used as an internal implementation detail in IdnMapping)
  • Numeric information

With this PR, these tables are updated to include for every code point:

  • The Unicode general category
  • "Restricted" bidi information (more on this below)
  • Whitespace information
  • Numeric information
  • UAX29 grapheme segmentation boundary properties
  • [Optional] simple case conversion and case folding transforms

The CategoriesValues table is an 11:5:4 table that contains offsets into an array of bytes where each element has the following layout:

 bit  7   6   5   4   3   2   1   0
    +-------------------------------+
    | w | b | b | c | c | c | c | c |
    +-------------------------------+

This represents a packed structure containing the UnicodeCategory (bits 0 - 4), restricted bidi class (bits 5 - 6), and whitespace property (bit 7) of each code point.

The NumericGrapheme table is an 11:5:4 table that contains offsets into one of several different arrays. The NumericValues array consists of 8-byte elements interpreted as little-endian doubles. The DigitValues array consists of 1-byte elements where the high nibble of each element is the digit value and the low nibble of each element is the decimal digit value. The GraphemeValues array consists of 1-byte elements where the value of each element maps to an enum value from UAX#29, Table 2.

A quick note on bidi information: it turns out that our previous methodology of pulling bidi information straight from UnicodeData.txt was incomplete. Certain unassigned/reserved code points (such as U+061D) do not have entries in UnicodeData.txt but do have entries in DerivedBidiClass.txt, so we should still carry their bidi data in the CharUnicodeInfo backing tables. The update to the GenUnicodeProp generation utility takes this scenario into account.

tools/GenUnicodeProp

This is the logic that reads all of the Unicode data files and spits out the CharUnicodeInfoData.cs backing file. Much of this application was rewritten to remove unneeded logic and to add support for smuggling complex data through the 11:5:4 tables.

If we do want to include simple case mapping / case folding data in the future (see, e.g., https://github.com/dotnet/corefx/issues/17233), this is supported by the tool. Pass the -IncludeCasingData switch to the tool to generate this information. This switch is not enabled by default, which means that the CharUnicodeData backing tables do not carry this information.

CharUnicodeInfo.cs

The methods in this file were rewritten to be more optimized, and methods were moved around a bit so that logical method groups (such as GetUnicodeCategory) stay together. Of note is that there are many places where a single table or a single array holds multiple pieces of information.

For example, the Categories 11:5:4 table is used to index into the CategoriesValues array, where each byte element of that array is essentially a glorified (isWhiteSpace, bidiCategory, unicodeCategory) tuple. The NumericGrapheme 11:5:4 table is used to index into the DigitValues, NumericValues, and GraphemeSegmentationValues arrays.

StrongBidiCategory.cs and IdnMapping.cs

The previous incarnation of CharUnicodeInfo held full bidi class data. However, it turns out that we don't have a public API to get at this information, and the only consumer of this information in the entire framework is IdnMapping. Furthermore, it doesn't care about most bidi classes. It only cares about classes that are strongly left-to-right ("L") or strongly right-to-left ("R", "AL"). So I changed the data stored in CharUnicodeInfo to reflect only this very limited set of information rather than the full bidi class. See https://www.unicode.org/reports/tr44/#BC_Values_Table for more info on these values.

When generating the backing table (see the GenUnicodeProp tool), all code points marked with "L" are given a "strong left-to-right" marker in the backing data, and all code points marked with "AL" or "R" are given a "strong right-to-left" marker in the backing data. All other bidi information is thrown away when generating CharUnicodeInfoData.cs. This is referred to in source as "restricted" bidi data.

Of note is that IdnMapping under the invariant globalization mode follows IDNA2003 semantics when performing bidi processing, with the modification that the data is seeded from recent Unicode data instead of being locked to Unicode 3.2 as required by IDNA2003. We could update IdnMapping to follow strict IDNA2003 semantics if we desired, but that's out of scope of this PR. Since most runtimes won't be running under the invariant globalization addressing this would probably be low-priority.

Rune.cs and Char.cs

Slight modifications to these files to react to refactorings in CharUnicodeInfo, plus some minor optimizations in those same code paths.

TextSegmentationUtility.cs

This is the workhorse class that computes grapheme cluster segmentation boundaries ("display characters", essentially). We use the definition of "extended grapheme cluster" per https://www.unicode.org/reports/tr29/. See Sec. 3.1 and 3.1.1 of that document for the specific algorithm we follow.

The code is generic so that it can work with either UTF-16 or UTF-8. Currently only the UTF-16 API is public, implemented by passing the delegate Rune.DecodeFromUtf16 down to the workhorse routine. If we wanted to add UTF-8 support in the future, it would be trivial to do so by instead calling the same workhorse with the Rune.DecodeFromUtf8 delegate.

StringInfo.cs and TextElementEnumerator.cs

Rewritten to use the new "extended grapheme cluster" logic in TextSegmentationUtility.

Aside from changing the implementation to be UAX29-compliant, there's one additional behavioral change: the TextElementEnumerator.ElementIndex property getter now throws once the enumeration has completed. This brings its behavior in line with the TextElementEnumerator.Current property getter, which also throws once enumeration has completed; and it matches ElementIndex's documentation, which says that the getter throws after completion.

CoreFx.Private.TestUtilities.Unicode/*

Contains all the logic to process the UCD files. This isn't itself shipping code, but it's pulled in by the unit tests so that we can compare the data coming from APIs like CharUnicodeInfo.GetUnicodeCategory against the official data coming from the Unicode group.

The Data folder is unchanged by this PR, as we're now referencing static assets from the runtime-assets repo (see dotnet/runtime-assets#44). I'll send a separate PR in the future to remove the existing Unicode text files from this folder.

CharUnicodeInfoTests.Generated.cs

This file contains tests that iterate through all code points in the UCD files, hitting our APIs. It's intended to verify the contents of the 11:5:4 map we use in CharUnicodeInfoData.cs and of the logic which reads that data. It's not intended as a substitute for other manually-specified test data, which already exists in CharUnicodeInfoTets.cs.

GraphemeBreakTest.cs

Iterates through the test data pulled from https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt, which defines approx. 600 test cases for grapheme boundary calculation.

Other miscellaneous tests

Some tests had to be modified a bit, especially tests which used "\u0300\u0300" as a sequence break. (Per the most recent iteration of UAX29, those two code points should be treated as a single cluster.) I also added some more complex emoji sequences to the test case data.

VB defines a string reversal function StrReverse which had to be rewritten since it made now-incorrect assumptions about how StringInfo worked. Specifically, they had copied some of the existing StringInfo segmentation logic, which was causing problems with their other usage of TextElementEnumerator. So I removed the custom logic and have them use TextElementEnumerator always.

Benchmarks

In the earlier drafts, APIs like char.GetUnicodeCategory(char) see an approx. 20% perf improvement with this PR. Once the build stabilizes I'll rerun the tests and get more representative numbers.

I suspect this will also improve the performance of certain Regex constructs, such as "\p{L}". Will need to measure that separately to confirm.

@GrabYourPitchforks
Copy link
Member Author

@tarekgh The coreclr bits are what you already reviewed (thanks!), with some minor fixes to the generation tool. I haven't yet gotten a chance to incorporate your feedback but will get to it soon. I believe I'm blocked waiting for #92 to go through.

@ViktorHofer
Copy link
Member

ViktorHofer commented Nov 27, 2019

As discussed on Teams, please move the testdata files to https://github.com/dotnet/runtime-assets. I renamed the repository today. cc @akoeplinger

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GrabYourPitchforks for getting this ready!

@GrabYourPitchforks
Copy link
Member Author

Latest iteration reverts changes to the Data folder and uses a new runtime-assets package instead.

@pentp
Copy link
Contributor

pentp commented Nov 29, 2019

Is the draft PR or something else confusing GitHub because the file changes view shows GenUnicodeProp/Program.cs: No changes.? From the commits view the changes are visible.

I assume the change from 12:4:4 to 11:5:4 for the numeric tables gives better compression now and cutoff isn't useful any-more because of grapheme break data?

@ufcpp
Copy link
Contributor

ufcpp commented Dec 7, 2019

Furthermore, it doesn't care about most bidi classes. It only cares about classes that are strongly left-to-right ("L") or strongly right-to-left ("R", "AL").

I recently implemented bidi-reordering algorithm (mainly for Arabic support) for Unity game engine. I had to do so because Unity UI doesn't support it. It were very helpful if a bidi class API were public.

@GrabYourPitchforks GrabYourPitchforks marked this pull request as ready for review January 13, 2020 21:37
@GrabYourPitchforks
Copy link
Member Author

CI failure is due to a known Helix issue, unrelated to this PR. Marking ready for review. I'm going to spend the day benchmarking.

@GrabYourPitchforks
Copy link
Member Author

Modest gains in APIs like string.IsNullOrWhiteSpace for non-English text. APIs like Regex.IsMatch are in the range of noise so I can't get a good feel for them.

Method Toolchain input Mean Error StdDev Ratio RatioSD
IsNullOrWhiteSpace master Κρόνος 5.690 ns 0.1147 ns 0.1072 ns 1.00 0.00
IsNullOrWhiteSpace proto Κρόνος 4.390 ns 0.1851 ns 0.1732 ns 0.77 0.03
IsNullOrWhiteSpace master 你好 5.724 ns 0.0898 ns 0.0796 ns 1.00 0.00
IsNullOrWhiteSpace proto 你好 5.315 ns 0.1398 ns 0.2336 ns 0.90 0.05

@GrabYourPitchforks GrabYourPitchforks merged commit 422fef8 into dotnet:master Jan 14, 2020
@GrabYourPitchforks GrabYourPitchforks deleted the charunicodeinfo branch January 14, 2020 22:08
@@ -61,6 +61,7 @@
<SystemIOCompressionTestDataVersion>5.0.0-beta.19608.5</SystemIOCompressionTestDataVersion>
<SystemIOPackagingTestDataVersion>5.0.0-beta.19608.5</SystemIOPackagingTestDataVersion>
<SystemNetTestDataVersion>5.0.0-beta.19608.5</SystemNetTestDataVersion>
<SystemPrivateRuntimeUnicodeDataVersion>5.0.0-beta.19610.1</SystemPrivateRuntimeUnicodeDataVersion>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to add a dependency to eng/Version.Details.xml so that the version gets auto-updated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1731 is out. Thanks for catching this! :)

@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy