Skip to content

Allow lowercase hexadecimal characters in base64.b16decode() #79738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
djhoulihan mannequin opened this issue Dec 22, 2018 · 4 comments
Open

Allow lowercase hexadecimal characters in base64.b16decode() #79738

djhoulihan mannequin opened this issue Dec 22, 2018 · 4 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@djhoulihan
Copy link
Mannequin

djhoulihan mannequin commented Dec 22, 2018

BPO 35557
Nosy @warsaw, @serhiy-storchaka, @tirkarthi, @DJHoulihan
PRs
  • bpo-35557 - Allow base64.b16decode() to accept lowercase hexadecimal characters by default #11285
  • Files
  • testing_data.txt: Testing functions used to demonstrate a ~9.4% improvement in hexadecimal decoding performance.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2018-12-22.02:49:42.519>
    labels = ['3.8', 'type-bug', 'library']
    title = 'Allow lowercase hexadecimal characters in base64.b16decode()'
    updated_at = <Date 2018-12-23.00:33:11.261>
    user = 'https://github.com/djhoulihan'

    bugs.python.org fields:

    activity = <Date 2018-12-23.00:33:11.261>
    actor = 'djhoulihan'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2018-12-22.02:49:42.519>
    creator = 'djhoulihan'
    dependencies = []
    files = ['48013']
    hgrepos = []
    issue_num = 35557
    keywords = ['patch']
    message_count = 4.0
    messages = ['332319', '332326', '332329', '332377']
    nosy_count = 4.0
    nosy_names = ['barry', 'serhiy.storchaka', 'xtreak', 'djhoulihan']
    pr_nums = ['11285']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue35557'
    versions = ['Python 3.8']

    @djhoulihan
    Copy link
    Mannequin Author

    djhoulihan mannequin commented Dec 22, 2018

    Currently, the base64 method b16decode does not decode a hexadecimal string with lowercase characters by default. To do so requires passing casefold=True as the second argument. I propose a change to the b16decode method to allow it to accept hexadecimal strings containing lowercase characters without requiring the casefold argument.

    The revision itself is straightforward. We simply have to amend the regular expression to match the lowercase characters a-f in addition to A-F. Likewise the corresponding tests in Lib/base64.py also need to be changed to account for the lack of a second argument. Therefore there are two files total which need to be refactored.

    In my view, there are several compelling reasons for this change:

    1. There is a nontrivial performance improvement. I've already made the changes on my own test branch[1] and I see a mean decoding performance improvement of approximately 9.4% (tested by taking the average of 1,000,000 hexadecimal string encodings). The testing details are included in a file attached to this issue.

    2. Hexadecimal strings are case insensitive, i.e. 8DEF is equivalent to 8def. This is the particularly motivating reason why I've written the patch - there have been many times when I've been momentarily confounded by a hexadecimal string that won't decode only to realize I'm yet again passing in a lowercase string.

    3. The behavior of the underlying method in binascii, unhexlify, accepts both uppercase and lowercase characters by default without requiring a second parameter. From the perspective of code hygiene and language consistency, I think it's both more practical and more elegant for the language to behave in the same, predictable manner (particularly because base64.py simply calls binascii.c under the hood). Additionally, the binascii method hexlify actually outputs strings in lowercase encoding, meaning that any use of both binascii and base64 in the same application will have to consistently do a casefold conversion if output from binascii.hexlify is fed back as input to base64.b16decode for some reason.

    There are two arguments against this patch, as far as I can see it:

    1. In the relevant IETF reference documentation (RFC3548[2], referenced directly in the b16decode docstring; and RFC4648[3] with supersedes it), under Security Considerations the author Simon Josefsson claims that there exists a potential side channel security issue intrinsic to accepting case insensitive hexadecimal strings in a decoding function. While I'm not dismissing this out of hand, I personally do not find the claimed vulnerability compelling, and Josefsson does not clarify a real world attack scenario or threat model. I think it's important we challenge this assumption in light of the potential nontrivial improvements to both language consistency and performance. I would be very interested in hearing a real threat model here that would practically exist outside of a very contrived scenario. Moreover if this is such a security issue, why is the behavior already evident in binascii.unhexlify?

    2. The other reason may be that there's simply no reason to make such a change. An argument can be put forward that a developer won't frequently have to deal with this issue because the opposite method, b16encode, produces hexadecimal strings with uppercase characters. However, in my experience hexadecimal strings with lowercase characters are extremely common in situations where developers haven't produced the strings themselves in the language.

    As I mentioned, I have already written the changes on my own patch branch. I'll open a pull request once this issue has been created and reference the issue in the pull request on GitHub.

    References:

    1. https://github.com/djhoulihan/cpython/tree/base64_case_sensitivity

    2. https://tools.ietf.org/html/rfc3548

    3. https://tools.ietf.org/html/rfc4648

    @djhoulihan djhoulihan mannequin added 3.7 (EOL) end of life performance Performance or resource usage 3.8 (EOL) end of life stdlib Python modules in the Lib dir labels Dec 22, 2018
    @tirkarthi
    Copy link
    Member

    Thanks for the report. A couple of points as below :

    • This changes the interface of the function by removing a parameter. Thus it will break compatibility with Python 2 and also earlier versions of Python 3. Removing a parameter in the signature has to go through a deprecation cycle if this is going to be accepted.
    • Please don't use time.time and mean for benchmarks that can be misleading. There are modules like timeit and perf (https://pypi.org/project/perf/) that are more reliable.

    I looked for some more inefficiencies and I can see re.search for every run. Perhaps re.compile can be used to store the compiled regex at module level and then to match against the string. This makes the function 25% faster without changing the interface. In case casefold=False then an extra call to make the string upper case is avoided giving some more benefit.

    With re.search inside the function

    $ python3.7 -m perf timeit -s 'import base64; hex_data="806903d098eb50957b1b376385f233bb3a5d54f54191c8536aefee21fc9ba3ca"' 'base64.b16decode(hex_data, casefold=True)'
    .....................
    Mean +- std dev: 3.08 us +- 0.22 us
    $ python3.7 -m perf timeit -s 'import base64; hex_data="806903d098eb50957b1b376385f233bb3a5d54f54191c8536aefee21fc9ba3ca".upper()' 'base64.b16decode(hex_data)'
    .....................
    Mean +- std dev: 2.93 us +- 0.20 us

    With the regex compiled to a variable at the module level

    $ python3.7 -m perf timeit -s 'import base64; hex_data="806903d098eb50957b1b376385f233bb3a5d54f54191c8536aefee21fc9ba3ca"' 'base64.b16decode(hex_data, casefold=True)'
    .....................
    Mean +- std dev: 2.08 us +- 0.15 us
    $ python3.7 -m perf timeit -s 'import base64; hex_data="806903d098eb50957b1b376385f233bb3a5d54f54191c8536aefee21fc9ba3ca".upper()' 'base64.b16decode(hex_data)'
    .....................
    Mean +- std dev: 1.98 us +- 0.17 us

    Since this is a comparison of fixed set of elements I tried using a set of elements and any to short-circuit but it seems to be slower

    $ python3.7 -m perf timeit -s 'import base64; hex_data="806903d098eb50957b1b376385f233bb3a5d54f54191c8536aefee21fc9ba3ca"' 'base64.b16decode(hex_data, casefold=True)'
    .....................
    Mean +- std dev: 8.21 us +- 0.66 us

    I am opening a PR to use the compiled regex at the module level since I see it as a net win of 25-30% without any interface change or test case changes required.

    @serhiy-storchaka
    Copy link
    Member

    This is compatibility breaking change. Furthermore, it changes the purposed behavior that was from the initial version (4c904d1).

    Since currently there is an option which allows to accept lowercase hexadecimal characters, I do not see a need in this change. You can also use bytes.fromhex().

    @djhoulihan djhoulihan mannequin added type-bug An unexpected behavior, bug, or error and removed 3.7 (EOL) end of life performance Performance or resource usage labels Dec 23, 2018
    @djhoulihan
    Copy link
    Mannequin Author

    djhoulihan mannequin commented Dec 23, 2018

    Karthikeyan,

    Thank you for taking the time to respond so thoroughly. In particular, in the future I'll be more careful to qualify and quantify potential performance improvements using timeit or perf.

    That being said, as I mentioned the primary motivation for this is not a performance improvement - I just felt that was a nice potential side effect. Rather, this enhancement brings base64.b16decode into behavioral consistency with binascii.unhexlify. The binascii method already accepts both uppercase and lowercase hexadecimal characters by default.

    However I can definitely understand what you and Serhiy are saying about this being a breaking change. Therefore I'd like to amend my proposal to the following:

    1. Keep the casefold argument and corresponding logic, but keep the revised regex that will match against both uppercase and lowercase characters; and

    2. Only put this change in for Python 3.8, this way existing code which uses the explicit argument on versions <= 3.7 does not break (but will still function normally).

    I've altered this issue to reflect my amended proposal, targeting only version 3.8 and editing the type to be behavior instead of performance. In this way, the change will still make base64.b16decode consistent with binascii.unhexlify (and the case insensitivity of hexadecimal encoding more generally) without breaking existing code or requiring developers to change workflows using casefold.

    Naturally there would be additional logic that enforces the case sensitivity if casefold=false is passed to the method, and this will likewise not break existing workflows either.

    If this change is considered agreeable, I will amend my open pull request to roll back the breaking change and refactor the way casefold is processed. From my perspective this amended proposal offers an upside in language consistency without any downside.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @picnixz picnixz added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error 3.8 (EOL) end of life labels Jun 1, 2025
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants
    pFad - Phonifier reborn

    Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

    Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


    Alternative Proxies:

    Alternative Proxy

    pFad Proxy

    pFad v3 Proxy

    pFad v4 Proxy