Skip to content

Add ascii fast path for unicode_word_indices and unicode_words #147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 28, 2025

Conversation

PSeitz-dd
Copy link
Contributor

@PSeitz-dd PSeitz-dd commented Jul 11, 2025

This PR adds a fast path for unicode_word_indices and unicode_words.

  • Add ascii version behind ascii check
  • Add benchmark
  • Add enum Iterator to wrap ascii and non-ascii into a single iterator
  • Add proptest to make sure ascii and non-ascii methods behave exactly the same.

Performance for ASCII text is greatly improved, while everything else gets a slight performance hit.

unicode_word_indices/log
                        time:   [209.68 ns 212.35 ns 215.32 ns]
                        thrpt:  [761.82 MiB/s 772.45 MiB/s 782.29 MiB/s]
                 change:
                        time:   [-82.313% -82.117% -81.888%] (p = 0.00 < 0.05)
                        thrpt:  [+452.12% +459.18% +465.38%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
unicode_word_indices/english
                        time:   [426.95 µs 428.03 µs 429.18 µs]
                        thrpt:  [110.42 MiB/s 110.72 MiB/s 110.99 MiB/s]
                 change:
                        time:   [+2.0057% +2.3798% +2.7729%] (p = 0.00 < 0.05)
                        thrpt:  [-2.6981% -2.3245% -1.9662%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
unicode_word_indices/japanese
                        time:   [415.66 µs 416.25 µs 416.86 µs]
                        thrpt:  [116.02 MiB/s 116.18 MiB/s 116.35 MiB/s]
                 change:
                        time:   [+0.5766% +0.8184% +1.0398%] (p = 0.00 < 0.05)
                        thrpt:  [-1.0291% -0.8118% -0.5733%]

/// [`unicode_words`]: trait.UnicodeSegmentation.html#tymethod.unicode_words
/// [`UnicodeSegmentation`]: trait.UnicodeSegmentation.html
#[derive(Debug)]
pub struct UnicodeWords<'a> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: removing these types is a breaking change

We deliberately have concrete types here so that they can be used inside other things conveniently.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR to keep the Iterator structs
(personally I prefer Iterators though as the contract is clearer)

src/word.rs Outdated
pub struct UnicodeWordIndices<'a> {
#[allow(clippy::type_complexity)]
inner: Filter<UWordBoundIndices<'a>, fn(&(usize, &str)) -> bool>,
inner: Box<dyn DoubleEndedIterator<Item = (usize, &'a str)> + 'a>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want this to allocate either.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced it with a custom enum iterator (it's also much faster :))

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it stands I think this is still a ways away from landing, for spec correctness reasons and for performance reasons. We should not be allocating a trait object here; you can get the same effect with a custom iterator that wraps an enum around the two iterators. (Either also works for this but I don't want to add a dependency to a somewhat bottom-of-the-deptree crate)

@PSeitz-dd
Copy link
Contributor Author

as it stands I think this is still a ways away from landing, for spec correctness reasons and for performance reasons. We should not be allocating a trait object here; you can get the same effect with a custom iterator that wraps an enum around the two iterators. (Either also works for this but I don't want to add a dependency to a somewhat bottom-of-the-deptree crate)

I replaced the Box with a custom enum. It's faster, and the performance hit for non-ascii is quite low now.

src/word.rs Outdated
///
/// Any other single ASCII byte is its own boundary (the default WB999).
#[derive(Debug)]
pub struct AsciiWordBoundIter<'a> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be pub, maybe pub(crate) if you need it to be

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!

@Manishearth Manishearth merged commit af87c8d into unicode-rs:master Jul 28, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy