-
-
Notifications
You must be signed in to change notification settings - Fork 32.7k
Fixed #26005 -- Convert URIs to IRIs according to RFC #6428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
django/utils/encoding.py
Outdated
_hextobyte = {('%02x' % char).encode(): six.int2byte(char) | ||
for char in _allowed_ascii} | ||
_hextobyte.update(((a + b).encode(), six.int2byte(int(a + b, 16))) | ||
for a in _hexdig[8:] for b in _hexdig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see much point in using a lazily computed global variable here. Can't this be moved to the module level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is based on the unquote
implementation in the standardlib. I couldn't see any disadvantage of this approach, so I followed their example. But now that I'm looking at this again, this approach does not seem completely thread-safe.
The reason given in the standardlib is memory. I don't know the exact memory overhead for the strings and the dict, but I would assume this structure takes less than 4kb of memory.
If you think that's the better approach, I can move the initialization to module level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this is not thread safe, there's a small race condition window. As the dict is assigned to a global variable on first call anyway (ending up in non-collectable memory) I don't see the point of the lazy initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to module level. Also discovered and fixed another bug with the previous implementation (uppercase hex encodings of ascii letters were not decoded).
@timgraham IMHO it would be good if this could make the next release. This is an unavoidable backwards-incompatible change, so the sooner this is done, the smaller the chance that a lot of code depends on the old behavior. |
django/test/client.py
Outdated
# Under Python 3, non-ASCII values in the WSGI environ are arbitrarily | ||
# decoded with ISO-8859-1. We replicate this behavior here. | ||
# Refs comment in `get_bytes_from_wsgi()`. | ||
return path.decode(ISO_8859_1) if six.PY3 else path | ||
return path.decode('iso-8859-1') if six.PY3 else path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I made it because IMHO this is better style. My understanding is that these constants only exist for py2/py3 compatibility. Their use was needed in the UTF_8
line. The new code only does decoding on py3, so it doesn't need the constants and can just pass the string. This seemed strictly more pythonic to me.
Should I change it back? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, I haven't looked into the history of the constants but I figured they were to prevent typos in the strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't both versions (constant/string) throw an exception on typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this inspired a separate cleanup: #8032.
@loic, since you merged the original patch, I'm wondering if you have any feedback? |
django/utils/encoding.py
Outdated
|
||
Takes an URI in ASCII bytes (e.g. '/I%20%E2%99%A5%20Django/') and returns | ||
unicode containing the encoded result (e.g. '/I \xe2\x99\xa5 Django/'). | ||
""" | ||
if uri is None: | ||
return uri | ||
uri = force_bytes(uri) | ||
iri = unquote_to_bytes(uri) if six.PY3 else unquote(uri) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code based on an existing implementation? If that's the case, we should specify it / link to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a modification of the unquote_to_bytes()
function from the standardlib, that was called here before.
Should I note this in a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add in the comment what modification was needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does not seem very practical. This code is far enough removed from the original, that it is a lot easier to just look at it and see what it does than to understand how it differs from the original. The only concept that is the same is “split the string at "%" and then look up the two characters after the "%" in a map”. But even why this lookup is done is not the same anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I assumed it was a small tweak to the stdlib version.
@Chronial, would you like to continue this? Things might be simpler now that master supports Python 3 only. If there are outstanding questions, I'll try to learn enough to give a response if no one else can. |
@timgraham yes, I would still be very happy to work towards getting this merged. Should I rebase this onto the head of master? |
Yes, please rebase and let me know if you need some question answered. |
2df73d4
to
03d135e
Compare
03d135e
to
ee6219e
Compare
@timgraham done. This actually got quite a bit clearer, since 48c34f3 implemented the correct fix for what the broken 10b17a2 was attempting to do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few cosmetic edits: http://dpaste.com/0ZTXH9X
django/utils/encoding.py
Outdated
|
||
Takes an URI in ASCII bytes (e.g. '/I%20%E2%99%A5%20Django/') and returns | ||
a string containing the encoded result (e.g. '/I \xe2\x99\xa5 Django/'). | ||
""" | ||
if uri is None: | ||
return uri | ||
uri = force_bytes(uri) | ||
iri = unquote_to_bytes(uri) | ||
|
||
bits = uri.split(b'%') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could some incline comments help understanding these lines?
@timgraham Applied the requested changes. |
merged in 03281d8, thanks! |
Thanks :) |
This fixes the broken implementation of 10b17a2. As that change touched the wsgi infrastructure someone with understanding of that should review this change.
The original change introduced its
uri_to_iri
function in two places: 10b17a2#diff-f6d1c75ec606389da5af6558bf57f171R129 and 10b17a2#diff-97160f50594424a40f2621d5a3c581ccL273.In the second case the original author clearly didn't want
uri_to_iri
but a fancierunquote
, so I replaced it with that. In the second case I am not so sure.Also note that this is in a certain way a backwards incompatible change. Code that assumed that
uri_to_iri
is a fancierunquote
will break. The new code will also not do what the documentation said before – which is why I had to fix it. But the previous documentation was also self-contradicting, as explained in the trac ticket.