Add new normalization algorithms using Standardized Variants #70

sunfishcode · 2020-12-03T18:27:43Z

The standard normalization algorithm decomposes CJK compatibility ideographs
into nominally equivalent codepoints, but which traditionally look different,
and is one of the main reasons normalization is considered destructive in
practice.

Unicode 6.3 introduced a solution for this, by providing
standardized variation sequences for these codepoints. For example, while
U+2F8A6 "CJK COMPATIBILITY-IDEOGRAPH-2F8A6" canonically decomposes to U+6148
with a different appearance, in Unicode 6.3 and later the standardized variation
sequences in the StandardizedVariants.txt file include the following:

6148 FE00; CJK COMPATIBILITY IDEOGRAPH-2F8A6;

which says that "CJK COMPATIBILITY IDEOGRAPH-2F8A6" corresponds to
U+6148 U+FE00, where U+FE00 is "VARIATION SELECTOR-1".

U+6148 and U+FE00 are both normalized codepoints, so we can transform text
containing U+2F8A6 into normal form without losing information about the
distinct appearance. At this time, many popular implementations ignore these
variation selectors, however this technique at least preserves the information
in a standardized way, so implementations could use it if they chose.

This PR adds "ext" versions of the nfd, nfc, nfkd, and nkfd
iterators, which perform the standard algorithms extended with this technique.
They don't match the standard decompositions, and don't guarantee stability,
but they do produce appropriately normalized output.

I used the generic term "ext" to reflect that other extensions could
theoretically be added in the future. The standard decomposition tables are
limited by their stability requirements, but these "ext" versions could be
free to adopt new useful rules.

This PR adds a new svar() function which returns an iterator that performs
this technique.

I'm not an expert in any of these topics, so please correct me if I'm mistaken
in any of this. Also, I'm open to ideas about how to best present this
functionality in the API.

The standard normalization algorithm decomposes CJK compatibility ideographs into nominally equivalent codepoints, but which traditionally look different, and is one of the main reasons normalization is considered destructive in practice. [Unicode 6.3] introduced a solution for this, by providing [standardized variation sequences] for these codepoints. For example, while U+2F8A6 "CJK COMPATIBILITY-IDEOGRAPH-2F8A6" canonically decomposes to U+6148 with a different appearance, in Unicode 6.3 and later the standardized variation sequences in the StandardizedVariants.txt file include the following: > 6148 FE00; CJK COMPATIBILITY IDEOGRAPH-2F8A6; which says that "CJK COMPATIBILITY IDEOGRAPH-2F8A6" corresponds to U+6148 U+FE00, where U+FE00 is "VARIATION SELECTOR-1". U+6148 and U+FE00 are both normalized codepoints, so we can transform text containing U+2F8A6 into normal form without losing information about the distinct appearance. At this time, many popular implementations ignore these variation selectors, however this technique at least preserves the information in a standardized way, so implementations could use it if they chose. This PR adds "ext" versions of the `nfd`, `nfc`, `nfkd`, and `nkfd` iterators, which perform the standard algorithms extended with this technique. They don't match the standard decompositions, and don't guarantee stability, but they do produce appropriately normalized output. I used the generic term "ext" to reflect that other extensions could theoretically be added in the future. The standard decomposition tables are limited by their stability requirements, but these "ext" versions could be free to adopt new useful rules. I'm not an expert in any of these topics, so please correct me if I'm mistaken in any of this. Also, I'm open to ideas about how to best present this functionality in the API. [Unicode 6.3]: https://www.unicode.org/versions/Unicode6.3.0/#Summary [standardized variation sequences]: http://unicode.org/faq/vs.html

Switch to a dedicated `svar()` iterator function, which just does standardized variation sequences, rather than framing this functionality as an open-ended "extended" version of the standard normalization algorithms. This makes for a more factored API, gives users more control over exactly what transformations are done, and has less impact on users that don't need this new functionality.

Manishearth · 2020-12-09T01:18:37Z

A note: I don't have time to review this right now, but if someone else can that would be great. I'm not opposed to adding this.

(@sujayakar ?)

The standardized variations sequences are standardized, so don't imply otherwise.

sunfishcode · 2021-01-06T03:10:37Z

Just a friendly ping, in case this got overlooked 🙂

sujayakar · 2021-01-06T03:48:20Z

oh, thanks for the ping, looking at this now.

I've never seen the standardized variants before; this is pretty cool!

I see that the release notes mention the application of these variants to CJK normalization, but I don't see a reference to these variants in the documentation for normalization itself (or from googling around for "variation sequence normalization"). Do you know of any other normalization implementations that handle this?

Also, we're only taking the subset of StandardizedVariants.txt that pertains to CJK compatibility, right? So, we may want to name this process something related to CJK compatibility rather than standardized variants as a whole.

I'll do a close code review too, but it overall looks good.

scripts/unicode.py

sujayakar · 2021-01-06T04:15:42Z

src/replace.rs

+
+        match self.iter.next() {
+            Some(ch) => {
+                // At this time, the longest replacement sequence has length 2.


can we assert this or codegen a constant from the python file?

It effectively is asserted by the TinyVec::<[char; 2]>, which panics if too many elements are appended. I could codegen the constant if you want, but the code is simpler this way.

I could be misreading it, but I thought TinyVec allocates if it exceeds the inline array size?

https://docs.rs/tinyvec/1.1.0/tinyvec/enum.TinyVec.html

but yeah, just asserting within the python file should be fine, agreed that codegenerating a constant is probably overkill :)

ah yeah, you can use ArrayVec which will have the panic on overflow behavior we want here.

Oops, looks like I misread the TinyVec docs. Updated to use ArrayVec, and I added an assert to the python file.

sujayakar · 2021-01-06T17:47:11Z

scripts/unicode.py

        stats("Canonical fully decomp", self.canon_fully_decomp)
        stats("Compatible fully decomp", self.compat_fully_decomp)
+        stats("CJK Compat Variants", self.cjk_compat_variants_fully_decomp)


nit: "CJK Compat Variants fully decomp" ?

sujayakar · 2021-01-06T17:50:24Z

okay overall looks good modulo those nits! thanks for the quick turnaround time on the changes.

Also, remove the non-`fully` `cjk_compat_variants_decomp` map, since it's no longer used.

sunfishcode added 3 commits December 6, 2020 08:18

Don't decompose Hangul in the svar iterator.

41dc717

sunfishcode force-pushed the ext branch from 0805b3f to 41dc717 Compare December 7, 2020 19:56

Avoid saying "non-standard" in a comment.

5aca91b

The standardized variations sequences are standardized, so don't imply otherwise.

sujayakar reviewed Jan 6, 2021

View reviewed changes

Rename svar to cjk_compat_variants.

fea4f13

sujayakar reviewed Jan 6, 2021

View reviewed changes

sunfishcode added 2 commits January 6, 2021 09:54

Use ArrayVec to panic instead of resizing on overflow.

485e9e7

Fix the CJK Compat Variants decomp stats string.

0d31e1e

Also, remove the non-`fully` `cjk_compat_variants_decomp` map, since it's no longer used.

sujayakar merged commit 685a8cc into unicode-rs:master Jan 6, 2021

sunfishcode mentioned this pull request Feb 8, 2021

crates.io packages containing the new svar() feature #71

Closed

hsivonen mentioned this pull request Dec 14, 2022

Consider providing mapping to Standardized Variants for CJK Compatibility Ideographs unicode-org/icu4x#2886

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add new normalization algorithms using Standardized Variants #70

Add new normalization algorithms using Standardized Variants #70

Uh oh!

sunfishcode commented Dec 3, 2020 •

edited

Loading

Uh oh!

Manishearth commented Dec 9, 2020

Uh oh!

sunfishcode commented Jan 6, 2021

Uh oh!

sujayakar commented Jan 6, 2021

Uh oh!

Uh oh!

Uh oh!

sujayakar Jan 6, 2021

Uh oh!

sunfishcode Jan 6, 2021

Uh oh!

sujayakar Jan 6, 2021

Uh oh!

sujayakar Jan 6, 2021

Uh oh!

sunfishcode Jan 6, 2021

Uh oh!

sujayakar Jan 6, 2021

Uh oh!

sujayakar commented Jan 6, 2021

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Add new normalization algorithms using Standardized Variants #70

Add new normalization algorithms using Standardized Variants #70

Uh oh!

Conversation

sunfishcode commented Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Manishearth commented Dec 9, 2020

Uh oh!

sunfishcode commented Jan 6, 2021

Uh oh!

sujayakar commented Jan 6, 2021

Uh oh!

Uh oh!

Uh oh!

sujayakar Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

sunfishcode Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

sujayakar Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

sujayakar Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

sunfishcode Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

sujayakar Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

sujayakar commented Jan 6, 2021

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

sunfishcode commented Dec 3, 2020 •

edited

Loading