-
Notifications
You must be signed in to change notification settings - Fork 42
Add new normalization algorithms using Standardized Variants #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The standard normalization algorithm decomposes CJK compatibility ideographs into nominally equivalent codepoints, but which traditionally look different, and is one of the main reasons normalization is considered destructive in practice. [Unicode 6.3] introduced a solution for this, by providing [standardized variation sequences] for these codepoints. For example, while U+2F8A6 "CJK COMPATIBILITY-IDEOGRAPH-2F8A6" canonically decomposes to U+6148 with a different appearance, in Unicode 6.3 and later the standardized variation sequences in the StandardizedVariants.txt file include the following: > 6148 FE00; CJK COMPATIBILITY IDEOGRAPH-2F8A6; which says that "CJK COMPATIBILITY IDEOGRAPH-2F8A6" corresponds to U+6148 U+FE00, where U+FE00 is "VARIATION SELECTOR-1". U+6148 and U+FE00 are both normalized codepoints, so we can transform text containing U+2F8A6 into normal form without losing information about the distinct appearance. At this time, many popular implementations ignore these variation selectors, however this technique at least preserves the information in a standardized way, so implementations could use it if they chose. This PR adds "ext" versions of the `nfd`, `nfc`, `nfkd`, and `nkfd` iterators, which perform the standard algorithms extended with this technique. They don't match the standard decompositions, and don't guarantee stability, but they do produce appropriately normalized output. I used the generic term "ext" to reflect that other extensions could theoretically be added in the future. The standard decomposition tables are limited by their stability requirements, but these "ext" versions could be free to adopt new useful rules. I'm not an expert in any of these topics, so please correct me if I'm mistaken in any of this. Also, I'm open to ideas about how to best present this functionality in the API. [Unicode 6.3]: https://www.unicode.org/versions/Unicode6.3.0/#Summary [standardized variation sequences]: http://unicode.org/faq/vs.html
Switch to a dedicated `svar()` iterator function, which just does standardized variation sequences, rather than framing this functionality as an open-ended "extended" version of the standard normalization algorithms. This makes for a more factored API, gives users more control over exactly what transformations are done, and has less impact on users that don't need this new functionality.
A note: I don't have time to review this right now, but if someone else can that would be great. I'm not opposed to adding this. (@sujayakar ?) |
The standardized variations sequences are standardized, so don't imply otherwise.
Just a friendly ping, in case this got overlooked 🙂 |
oh, thanks for the ping, looking at this now. I've never seen the standardized variants before; this is pretty cool! I see that the release notes mention the application of these variants to CJK normalization, but I don't see a reference to these variants in the documentation for normalization itself (or from googling around for "variation sequence normalization"). Do you know of any other normalization implementations that handle this? Also, we're only taking the subset of StandardizedVariants.txt that pertains to CJK compatibility, right? So, we may want to name this process something related to CJK compatibility rather than standardized variants as a whole. I'll do a close code review too, but it overall looks good. |
|
||
match self.iter.next() { | ||
Some(ch) => { | ||
// At this time, the longest replacement sequence has length 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we assert this or codegen a constant from the python file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It effectively is asserted by the TinyVec::<[char; 2]>
, which panics if too many elements are appended. I could codegen the constant if you want, but the code is simpler this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be misreading it, but I thought TinyVec
allocates if it exceeds the inline array size?
https://docs.rs/tinyvec/1.1.0/tinyvec/enum.TinyVec.html
but yeah, just asserting within the python file should be fine, agreed that codegenerating a constant is probably overkill :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah, you can use ArrayVec
which will have the panic on overflow behavior we want here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, looks like I misread the TinyVec
docs. Updated to use ArrayVec
, and I added an assert to the python file.
scripts/unicode.py
Outdated
stats("Canonical fully decomp", self.canon_fully_decomp) | ||
stats("Compatible fully decomp", self.compat_fully_decomp) | ||
stats("CJK Compat Variants", self.cjk_compat_variants_fully_decomp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "CJK Compat Variants fully decomp" ?
okay overall looks good modulo those nits! thanks for the quick turnaround time on the changes. |
Also, remove the non-`fully` `cjk_compat_variants_decomp` map, since it's no longer used.
The standard normalization algorithm decomposes CJK compatibility ideographs
into nominally equivalent codepoints, but which traditionally look different,
and is one of the main reasons normalization is considered destructive in
practice.
Unicode 6.3 introduced a solution for this, by providing
standardized variation sequences for these codepoints. For example, while
U+2F8A6 "CJK COMPATIBILITY-IDEOGRAPH-2F8A6" canonically decomposes to U+6148
with a different appearance, in Unicode 6.3 and later the standardized variation
sequences in the StandardizedVariants.txt file include the following:
which says that "CJK COMPATIBILITY IDEOGRAPH-2F8A6" corresponds to
U+6148 U+FE00, where U+FE00 is "VARIATION SELECTOR-1".
U+6148 and U+FE00 are both normalized codepoints, so we can transform text
containing U+2F8A6 into normal form without losing information about the
distinct appearance. At this time, many popular implementations ignore these
variation selectors, however this technique at least preserves the information
in a standardized way, so implementations could use it if they chose.
This PR adds "ext" versions of thenfd
,nfc
,nfkd
, andnkfd
iterators, which perform the standard algorithms extended with this technique.
They don't match the standard decompositions, and don't guarantee stability,
but they do produce appropriately normalized output.
I used the generic term "ext" to reflect that other extensions couldtheoretically be added in the future. The standard decomposition tables are
limited by their stability requirements, but these "ext" versions could be
free to adopt new useful rules.
This PR adds a new
svar()
function which returns an iterator that performsthis technique.
I'm not an expert in any of these topics, so please correct me if I'm mistaken
in any of this. Also, I'm open to ideas about how to best present this
functionality in the API.