Fixed #26005 -- Convert URIs to IRIs according to RFC #6428

Chronial · 2016-04-07T19:03:47Z

This fixes the broken implementation of 10b17a2. As that change touched the wsgi infrastructure someone with understanding of that should review this change.
The original change introduced its uri_to_iri function in two places: 10b17a2#diff-f6d1c75ec606389da5af6558bf57f171R129 and 10b17a2#diff-97160f50594424a40f2621d5a3c581ccL273.

In the second case the original author clearly didn't want uri_to_iri but a fancier unquote, so I replaced it with that. In the second case I am not so sure.

Also note that this is in a certain way a backwards incompatible change. Code that assumed that uri_to_iri is a fancier unquote will break. The new code will also not do what the documentation said before – which is why I had to fix it. But the previous documentation was also self-contradicting, as explained in the trac ticket.

charettes · 2016-04-07T20:15:33Z

django/utils/encoding.py

+            _hextobyte = {('%02x' % char).encode(): six.int2byte(char)
+                          for char in _allowed_ascii}
+            _hextobyte.update(((a + b).encode(), six.int2byte(int(a + b, 16)))
+                              for a in _hexdig[8:] for b in _hexdig)


I don't see much point in using a lazily computed global variable here. Can't this be moved to the module level?

This code is based on the unquote implementation in the standardlib. I couldn't see any disadvantage of this approach, so I followed their example. But now that I'm looking at this again, this approach does not seem completely thread-safe.

The reason given in the standardlib is memory. I don't know the exact memory overhead for the strings and the dict, but I would assume this structure takes less than 4kb of memory.

If you think that's the better approach, I can move the initialization to module level.

Agreed this is not thread safe, there's a small race condition window. As the dict is assigned to a global variable on first call anyway (ending up in non-collectable memory) I don't see the point of the lazy initialization.

Moved to module level. Also discovered and fixed another bug with the previous implementation (uppercase hex encodings of ascii letters were not decoded).

Chronial · 2016-05-02T09:21:55Z

@timgraham IMHO it would be good if this could make the next release. This is an unavoidable backwards-incompatible change, so the sooner this is done, the smaller the chance that a lot of code depends on the old behavior.

timgraham · 2016-05-07T20:33:03Z

django/test/client.py

        # Under Python 3, non-ASCII values in the WSGI environ are arbitrarily
        # decoded with ISO-8859-1. We replicate this behavior here.
        # Refs comment in `get_bytes_from_wsgi()`.
-        return path.decode(ISO_8859_1) if six.PY3 else path
+        return path.decode('iso-8859-1') if six.PY3 else path


Is this change needed?

No, I made it because IMHO this is better style. My understanding is that these constants only exist for py2/py3 compatibility. Their use was needed in the UTF_8 line. The new code only does decoding on py3, so it doesn't need the constants and can just pass the string. This seemed strictly more pythonic to me.

Should I change it back? :)

Not sure, I haven't looked into the history of the constants but I figured they were to prevent typos in the strings.

won't both versions (constant/string) throw an exception on typo?

Thanks, this inspired a separate cleanup: #8032.

timgraham · 2016-05-07T20:38:29Z

@loic, since you merged the original patch, I'm wondering if you have any feedback?

loic · 2016-05-08T05:03:46Z

django/utils/encoding.py


    Takes an URI in ASCII bytes (e.g. '/I%20%E2%99%A5%20Django/') and returns
    unicode containing the encoded result (e.g. '/I \xe2\x99\xa5 Django/').
    """
    if uri is None:
        return uri
    uri = force_bytes(uri)
-    iri = unquote_to_bytes(uri) if six.PY3 else unquote(uri)
+


Is this code based on an existing implementation? If that's the case, we should specify it / link to it.

Yes, this is a modification of the unquote_to_bytes() function from the standardlib, that was called here before.

Should I note this in a comment?

Also add in the comment what modification was needed.

That does not seem very practical. This code is far enough removed from the original, that it is a lot easier to just look at it and see what it does than to understand how it differs from the original. The only concept that is the same is “split the string at "%" and then look up the two characters after the "%" in a map”. But even why this lookup is done is not the same anymore.

Fair enough, I assumed it was a small tweak to the stdlib version.

timgraham · 2017-01-31T15:46:43Z

@Chronial, would you like to continue this? Things might be simpler now that master supports Python 3 only. If there are outstanding questions, I'll try to learn enough to give a response if no one else can.

Chronial · 2017-02-01T14:38:00Z

@timgraham yes, I would still be very happy to work towards getting this merged.

Should I rebase this onto the head of master?

timgraham · 2017-02-06T17:46:00Z

Yes, please rebase and let me know if you need some question answered.

Chronial · 2017-02-07T16:37:10Z

@timgraham done. This actually got quite a bit clearer, since 48c34f3 implemented the correct fix for what the broken 10b17a2 was attempting to do.

timgraham

A few cosmetic edits: http://dpaste.com/0ZTXH9X

timgraham · 2017-02-08T00:08:17Z

django/utils/encoding.py


    Takes an URI in ASCII bytes (e.g. '/I%20%E2%99%A5%20Django/') and returns
    a string containing the encoded result (e.g. '/I \xe2\x99\xa5 Django/').
    """
    if uri is None:
        return uri
    uri = force_bytes(uri)
-    iri = unquote_to_bytes(uri)
+
+    bits = uri.split(b'%')


Could some incline comments help understanding these lines?

Chronial · 2017-02-08T18:41:13Z

@timgraham Applied the requested changes.

timgraham · 2017-02-09T14:29:53Z

merged in 03281d8, thanks!

Chronial · 2017-02-09T16:05:49Z

Thanks :)

charettes reviewed Apr 7, 2016
View reviewed changes

timgraham mentioned this pull request Apr 9, 2016

Fixes #26005 - uri_to_iri() perfoms percent decoding incorrectly #6085

Closed

timgraham reviewed May 7, 2016
View reviewed changes

loic reviewed May 8, 2016
View reviewed changes

Chronial force-pushed the bugfix/26005 branch 3 times, most recently from 2df73d4 to 03d135e Compare February 7, 2017 14:55

Fixed #26005 -- Convert URIs to IRIs according to RFC

ee6219e

Chronial force-pushed the bugfix/26005 branch from 03d135e to ee6219e Compare February 7, 2017 16:12

timgraham reviewed Feb 8, 2017

View reviewed changes

Chronial added 2 commits February 8, 2017 19:25

Apply suggested changes

0d252d9

Speed up uri_to_iri for the common case

9c3e573

timgraham closed this Feb 9, 2017

felixxm mentioned this pull request May 31, 2017

Fixed test_hyperlinked_related_lookup_url_encoded_exists. encode/django-rest-framework#5179

Merged

Uh oh!

Fixed #26005 -- Convert URIs to IRIs according to RFC #6428

Fixed #26005 -- Convert URIs to IRIs according to RFC #6428

Uh oh!

Conversation

Chronial commented Apr 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Chronial commented May 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timgraham commented May 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timgraham commented Jan 31, 2017

Uh oh!

Chronial commented Feb 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timgraham commented Feb 6, 2017

Uh oh!

Chronial commented Feb 7, 2017

Uh oh!

timgraham left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Chronial commented Feb 8, 2017

Uh oh!

timgraham commented Feb 9, 2017

Uh oh!

Chronial commented Feb 9, 2017

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Chronial commented Feb 1, 2017 •

edited

Loading