Skip to content

Fixed #26005 -- Convert URIs to IRIs according to RFC #6428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

Chronial
Copy link
Contributor

@Chronial Chronial commented Apr 7, 2016

This fixes the broken implementation of 10b17a2. As that change touched the wsgi infrastructure someone with understanding of that should review this change.
The original change introduced its uri_to_iri function in two places: 10b17a2#diff-f6d1c75ec606389da5af6558bf57f171R129 and 10b17a2#diff-97160f50594424a40f2621d5a3c581ccL273.

In the second case the original author clearly didn't want uri_to_iri but a fancier unquote, so I replaced it with that. In the second case I am not so sure.

Also note that this is in a certain way a backwards incompatible change. Code that assumed that uri_to_iri is a fancier unquote will break. The new code will also not do what the documentation said before – which is why I had to fix it. But the previous documentation was also self-contradicting, as explained in the trac ticket.

_hextobyte = {('%02x' % char).encode(): six.int2byte(char)
for char in _allowed_ascii}
_hextobyte.update(((a + b).encode(), six.int2byte(int(a + b, 16)))
for a in _hexdig[8:] for b in _hexdig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much point in using a lazily computed global variable here. Can't this be moved to the module level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is based on the unquote implementation in the standardlib. I couldn't see any disadvantage of this approach, so I followed their example. But now that I'm looking at this again, this approach does not seem completely thread-safe.

The reason given in the standardlib is memory. I don't know the exact memory overhead for the strings and the dict, but I would assume this structure takes less than 4kb of memory.

If you think that's the better approach, I can move the initialization to module level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is not thread safe, there's a small race condition window. As the dict is assigned to a global variable on first call anyway (ending up in non-collectable memory) I don't see the point of the lazy initialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to module level. Also discovered and fixed another bug with the previous implementation (uppercase hex encodings of ascii letters were not decoded).

@Chronial
Copy link
Contributor Author

Chronial commented May 2, 2016

@timgraham IMHO it would be good if this could make the next release. This is an unavoidable backwards-incompatible change, so the sooner this is done, the smaller the chance that a lot of code depends on the old behavior.

# Under Python 3, non-ASCII values in the WSGI environ are arbitrarily
# decoded with ISO-8859-1. We replicate this behavior here.
# Refs comment in `get_bytes_from_wsgi()`.
return path.decode(ISO_8859_1) if six.PY3 else path
return path.decode('iso-8859-1') if six.PY3 else path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I made it because IMHO this is better style. My understanding is that these constants only exist for py2/py3 compatibility. Their use was needed in the UTF_8 line. The new code only does decoding on py3, so it doesn't need the constants and can just pass the string. This seemed strictly more pythonic to me.

Should I change it back? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I haven't looked into the history of the constants but I figured they were to prevent typos in the strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't both versions (constant/string) throw an exception on typo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this inspired a separate cleanup: #8032.

@timgraham
Copy link
Member

@loic, since you merged the original patch, I'm wondering if you have any feedback?


Takes an URI in ASCII bytes (e.g. '/I%20%E2%99%A5%20Django/') and returns
unicode containing the encoded result (e.g. '/I \xe2\x99\xa5 Django/').
"""
if uri is None:
return uri
uri = force_bytes(uri)
iri = unquote_to_bytes(uri) if six.PY3 else unquote(uri)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code based on an existing implementation? If that's the case, we should specify it / link to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a modification of the unquote_to_bytes() function from the standardlib, that was called here before.

Should I note this in a comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add in the comment what modification was needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not seem very practical. This code is far enough removed from the original, that it is a lot easier to just look at it and see what it does than to understand how it differs from the original. The only concept that is the same is “split the string at "%" and then look up the two characters after the "%" in a map”. But even why this lookup is done is not the same anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I assumed it was a small tweak to the stdlib version.

@timgraham
Copy link
Member

@Chronial, would you like to continue this? Things might be simpler now that master supports Python 3 only. If there are outstanding questions, I'll try to learn enough to give a response if no one else can.

@Chronial
Copy link
Contributor Author

Chronial commented Feb 1, 2017

@timgraham yes, I would still be very happy to work towards getting this merged.

Should I rebase this onto the head of master?

@timgraham
Copy link
Member

Yes, please rebase and let me know if you need some question answered.

@Chronial Chronial force-pushed the bugfix/26005 branch 3 times, most recently from 2df73d4 to 03d135e Compare February 7, 2017 14:55
@Chronial
Copy link
Contributor Author

Chronial commented Feb 7, 2017

@timgraham done. This actually got quite a bit clearer, since 48c34f3 implemented the correct fix for what the broken 10b17a2 was attempting to do.

Copy link
Member

@timgraham timgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few cosmetic edits: http://dpaste.com/0ZTXH9X


Takes an URI in ASCII bytes (e.g. '/I%20%E2%99%A5%20Django/') and returns
a string containing the encoded result (e.g. '/I \xe2\x99\xa5 Django/').
"""
if uri is None:
return uri
uri = force_bytes(uri)
iri = unquote_to_bytes(uri)

bits = uri.split(b'%')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could some incline comments help understanding these lines?

@Chronial
Copy link
Contributor Author

Chronial commented Feb 8, 2017

@timgraham Applied the requested changes.

@timgraham
Copy link
Member

merged in 03281d8, thanks!

@timgraham timgraham closed this Feb 9, 2017
@Chronial
Copy link
Contributor Author

Chronial commented Feb 9, 2017

Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy