Skip to content

gh-134873: fix various quadratic worst-time complexities in _header_value_parser.py [WIP] #134947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

picnixz
Copy link
Member

@picnixz picnixz commented May 30, 2025

This is a work in progress (I need to go now) but I'll continue tomorrow. I want to add tests, and some other places are still not fixed because I didn't find a straightforward fix.

@picnixz picnixz changed the title gh-134873: fix various quadratic worst-time complexity in _header_value_parser.py [WIP] gh-134873: fix various quadratic worst-time complexities in _header_value_parser.py [WIP] May 30, 2025
@picnixz picnixz added type-security A security issue needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels May 30, 2025
@picnixz
Copy link
Member Author

picnixz commented May 31, 2025

Urgh, so there is still a quadratic complexity that I need to think about, it's when we alternate if-branches. For instance in get_phrase:

    try:
        token, value = get_word(value)
        phrase.append(token)
    except errors.HeaderParseError:
        phrase.defects.append(errors.InvalidHeaderDefect(
            "phrase does not start with word"))
    while value and value[0] not in PHRASE_ENDS:
        if value[0]=='.':
            phrase.append(DOT)
            phrase.defects.append(errors.ObsoleteHeaderDefect(
                "period in 'phrase'"))
            value = value[1:]
        else:
            try:
                token, value = get_word(value)
            except errors.HeaderParseError:
                if value[0] in CFWS_LEADER:
                    token, value = get_cfws(value)
                    phrase.defects.append(errors.ObsoleteHeaderDefect(
                        "comment found without atom"))
                else:
                    raise
            phrase.append(token)
    return phrase, value

The

if value[0]=='.':
    phrase.append(DOT)
    phrase.defects.append(errors.ObsoleteHeaderDefect(
        "period in 'phrase'"))
    value = value[1:]

is quadratic if we're processing multiple times .. However, if I have something like 'a + '.a' * 100, then the if branch still requires a copy every two iterations, whatever I put inside. Even if the length reduces at each iteration, it doesn't sufficiently reduce. I'll need to think a bit more.

One idea would have been to use a deque to prevent a copy when stripping parts, but then this requires to reconstruct a deque everytime.

Maybe switch to a stateful parser? That way, we shouldn't have high complexity and we should be fine. But this requires a complete rewrite of this module.

@picnixz
Copy link
Member Author

picnixz commented May 31, 2025

Ok, this patch still fixes some cases but not all. Cases where two branches alternate would still be subject to O(n²) complexities (unless we avoid the copy in .lstrip() or in [1:], it's not possible to avoid this with .lstrip() or slices since they still need O(n) to copy the rest of the string).

The advantage of .lstrip() over slices is essentially when the if branch is executed more than once before going to the else branch (namely, we can batch-process some characters). For instance, "a" + "." * 50000 is parsed using lstrip() in O(n) instead of O(n²). However, "a" + ".a" * 50000 is still subject to O(n²) parsing.

@picnixz
Copy link
Member Author

picnixz commented May 31, 2025

@serhiy-storchaka I'm a bit stuck here. I don't really have a better idea than to rewrite the module where we would use a deque to hold the current value. That way, I can call .popleft() to drop prefixed chars. Unfortunately, this also means that cannot really return a string anymore as the module is used and signatures are actually in _typeshed: https://github.com/python/typeshed/blob/main/stdlib/email/_header_value_parser.pyi.

So, to ensure backward compatibility, I think I'll need a new module... I can't think of another solution instead of entirely rewriting the logic so that we don't have un-necessary slice so your help would be appreciated, TiA!


I thought about holding the current "index" where the parser stopped but again, I don't think it'll be sufficient as I'll still need to make slices at some point to extract some values to hold (OTOH, using a deque allows me to move some characters to elsewhere without having to copy the string twice, though I'll still need a ''.join() on the part that is being stored).

def get_something(value):
	storage = Storage()
	head = []
	while cond(value):
	    head.append(value.popleft())
	storage.value = ''.join(head)
	return storage, value

johnzhou721

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes type-security A security issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy