Skip to content

Add new normalization algorithms using Standardized Variants #70

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 6, 2021

Conversation

sunfishcode
Copy link
Contributor

@sunfishcode sunfishcode commented Dec 3, 2020

The standard normalization algorithm decomposes CJK compatibility ideographs
into nominally equivalent codepoints, but which traditionally look different,
and is one of the main reasons normalization is considered destructive in
practice.

Unicode 6.3 introduced a solution for this, by providing
standardized variation sequences for these codepoints. For example, while
U+2F8A6 "CJK COMPATIBILITY-IDEOGRAPH-2F8A6" canonically decomposes to U+6148
with a different appearance, in Unicode 6.3 and later the standardized variation
sequences in the StandardizedVariants.txt file include the following:

6148 FE00; CJK COMPATIBILITY IDEOGRAPH-2F8A6;

which says that "CJK COMPATIBILITY IDEOGRAPH-2F8A6" corresponds to
U+6148 U+FE00, where U+FE00 is "VARIATION SELECTOR-1".

U+6148 and U+FE00 are both normalized codepoints, so we can transform text
containing U+2F8A6 into normal form without losing information about the
distinct appearance. At this time, many popular implementations ignore these
variation selectors, however this technique at least preserves the information
in a standardized way, so implementations could use it if they chose.

This PR adds "ext" versions of the nfd, nfc, nfkd, and nkfd
iterators, which perform the standard algorithms extended with this technique.
They don't match the standard decompositions, and don't guarantee stability,
but they do produce appropriately normalized output.

I used the generic term "ext" to reflect that other extensions could
theoretically be added in the future. The standard decomposition tables are
limited by their stability requirements, but these "ext" versions could be
free to adopt new useful rules.

This PR adds a new svar() function which returns an iterator that performs
this technique.

I'm not an expert in any of these topics, so please correct me if I'm mistaken
in any of this. Also, I'm open to ideas about how to best present this
functionality in the API.

The standard normalization algorithm decomposes CJK compatibility ideographs
into nominally equivalent codepoints, but which traditionally look different,
and is one of the main reasons normalization is considered destructive in
practice.

[Unicode 6.3] introduced a solution for this, by providing
[standardized variation sequences] for these codepoints. For example, while
U+2F8A6 "CJK COMPATIBILITY-IDEOGRAPH-2F8A6" canonically decomposes to U+6148
with a different appearance, in Unicode 6.3 and later the standardized variation
sequences in the StandardizedVariants.txt file include the following:

> 6148 FE00; CJK COMPATIBILITY IDEOGRAPH-2F8A6;

which says that "CJK COMPATIBILITY IDEOGRAPH-2F8A6" corresponds to
U+6148 U+FE00, where U+FE00 is "VARIATION SELECTOR-1".

U+6148 and U+FE00 are both normalized codepoints, so we can transform text
containing U+2F8A6 into normal form without losing information about the
distinct appearance. At this time, many popular implementations ignore these
variation selectors, however this technique at least preserves the information
in a standardized way, so implementations could use it if they chose.

This PR adds "ext" versions of the `nfd`, `nfc`, `nfkd`, and `nkfd`
iterators, which perform the standard algorithms extended with this technique.
They don't match the standard decompositions, and don't guarantee stability,
but they do produce appropriately normalized output.

I used the generic term "ext" to reflect that other extensions could
theoretically be added in the future. The standard decomposition tables are
limited by their stability requirements, but these "ext" versions could be
free to adopt new useful rules.

I'm not an expert in any of these topics, so please correct me if I'm mistaken
in any of this. Also, I'm open to ideas about how to best present this
functionality in the API.

[Unicode 6.3]: https://www.unicode.org/versions/Unicode6.3.0/#Summary
[standardized variation sequences]: http://unicode.org/faq/vs.html
Switch to a dedicated `svar()` iterator function, which just does
standardized variation sequences, rather than framing this functionality
as an open-ended "extended" version of the standard normalization
algorithms. This makes for a more factored API, gives users more control
over exactly what transformations are done, and has less impact on users
that don't need this new functionality.
@Manishearth
Copy link
Member

A note: I don't have time to review this right now, but if someone else can that would be great. I'm not opposed to adding this.

(@sujayakar ?)

The standardized variations sequences are standardized, so don't imply
otherwise.
@sunfishcode
Copy link
Contributor Author

Just a friendly ping, in case this got overlooked 🙂

@sujayakar
Copy link
Contributor

oh, thanks for the ping, looking at this now.

I've never seen the standardized variants before; this is pretty cool!

I see that the release notes mention the application of these variants to CJK normalization, but I don't see a reference to these variants in the documentation for normalization itself (or from googling around for "variation sequence normalization"). Do you know of any other normalization implementations that handle this?

Also, we're only taking the subset of StandardizedVariants.txt that pertains to CJK compatibility, right? So, we may want to name this process something related to CJK compatibility rather than standardized variants as a whole.

I'll do a close code review too, but it overall looks good.


match self.iter.next() {
Some(ch) => {
// At this time, the longest replacement sequence has length 2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assert this or codegen a constant from the python file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It effectively is asserted by the TinyVec::<[char; 2]>, which panics if too many elements are appended. I could codegen the constant if you want, but the code is simpler this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be misreading it, but I thought TinyVec allocates if it exceeds the inline array size?

https://docs.rs/tinyvec/1.1.0/tinyvec/enum.TinyVec.html

but yeah, just asserting within the python file should be fine, agreed that codegenerating a constant is probably overkill :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, you can use ArrayVec which will have the panic on overflow behavior we want here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, looks like I misread the TinyVec docs. Updated to use ArrayVec, and I added an assert to the python file.

stats("Canonical fully decomp", self.canon_fully_decomp)
stats("Compatible fully decomp", self.compat_fully_decomp)
stats("CJK Compat Variants", self.cjk_compat_variants_fully_decomp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "CJK Compat Variants fully decomp" ?

@sujayakar
Copy link
Contributor

okay overall looks good modulo those nits! thanks for the quick turnaround time on the changes.

Also, remove the non-`fully` `cjk_compat_variants_decomp` map, since
it's no longer used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy