-
Notifications
You must be signed in to change notification settings - Fork 63
Add ascii fast path for unicode_word_indices and unicode_words #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// [`unicode_words`]: trait.UnicodeSegmentation.html#tymethod.unicode_words | ||
/// [`UnicodeSegmentation`]: trait.UnicodeSegmentation.html | ||
#[derive(Debug)] | ||
pub struct UnicodeWords<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: removing these types is a breaking change
We deliberately have concrete types here so that they can be used inside other things conveniently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR to keep the Iterator structs
(personally I prefer Iterators though as the contract is clearer)
src/word.rs
Outdated
pub struct UnicodeWordIndices<'a> { | ||
#[allow(clippy::type_complexity)] | ||
inner: Filter<UWordBoundIndices<'a>, fn(&(usize, &str)) -> bool>, | ||
inner: Box<dyn DoubleEndedIterator<Item = (usize, &'a str)> + 'a>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want this to allocate either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced it with a custom enum iterator (it's also much faster :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as it stands I think this is still a ways away from landing, for spec correctness reasons and for performance reasons. We should not be allocating a trait object here; you can get the same effect with a custom iterator that wraps an enum around the two iterators. (Either
also works for this but I don't want to add a dependency to a somewhat bottom-of-the-deptree crate)
I replaced the Box with a custom enum. It's faster, and the performance hit for non-ascii is quite low now. |
src/word.rs
Outdated
/// | ||
/// Any other single ASCII byte is its own boundary (the default WB999). | ||
#[derive(Debug)] | ||
pub struct AsciiWordBoundIter<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be pub
, maybe pub(crate)
if you need it to be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!!
This PR adds a fast path for
unicode_word_indices
andunicode_words
.Performance for ASCII text is greatly improved, while everything else gets a slight performance hit.