Skip to content

Commit 76cd81d

Browse files
orsenthilgpsheadserhiy-storchaka
authored
bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595)
* issue43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. Co-authored-by: Gregory P. Smith <greg@krypto.org> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
1 parent 14fc2bd commit 76cd81d

File tree

4 files changed

+54
-0
lines changed

4 files changed

+54
-0
lines changed

Doc/library/urllib.parse.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -312,6 +312,9 @@ or on combining URL components into a URL string.
312312
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
313313
decomposed before parsing, no error will be raised.
314314

315+
Following the `WHATWG spec`_ that updates RFC 3986, ASCII newline
316+
``\n``, ``\r`` and tab ``\t`` characters are stripped from the URL.
317+
315318
.. versionchanged:: 3.6
316319
Out-of-range port numbers now raise :exc:`ValueError`, instead of
317320
returning :const:`None`.
@@ -320,6 +323,10 @@ or on combining URL components into a URL string.
320323
Characters that affect netloc parsing under NFKC normalization will
321324
now raise :exc:`ValueError`.
322325

326+
.. versionchanged:: 3.10
327+
ASCII newline and tab characters are stripped from the URL.
328+
329+
.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser
323330

324331
.. function:: urlunsplit(parts)
325332

@@ -674,6 +681,10 @@ task isn't already covered by the URL parsing functions above.
674681

675682
.. seealso::
676683

684+
`WHATWG`_ - URL Living standard
685+
Working Group for the URL Standard that defines URLs, domains, IP addresses, the
686+
application/x-www-form-urlencoded format, and their API.
687+
677688
:rfc:`3986` - Uniform Resource Identifiers
678689
This is the current standard (STD66). Any changes to urllib.parse module
679690
should conform to this. Certain deviations could be observed, which are
@@ -697,3 +708,5 @@ task isn't already covered by the URL parsing functions above.
697708

698709
:rfc:`1738` - Uniform Resource Locators (URL)
699710
This specifies the formal syntax and semantics of absolute URLs.
711+
712+
.. _WHATWG: https://url.spec.whatwg.org/

Lib/test/test_urlparse.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -612,6 +612,35 @@ def test_urlsplit_attributes(self):
612612
with self.assertRaisesRegex(ValueError, "out of range"):
613613
p.port
614614

615+
def test_urlsplit_remove_unsafe_bytes(self):
616+
# Remove ASCII tabs and newlines from input
617+
url = "http://www.python.org/java\nscript:\talert('msg\r\n')/#frag"
618+
p = urllib.parse.urlsplit(url)
619+
self.assertEqual(p.scheme, "http")
620+
self.assertEqual(p.netloc, "www.python.org")
621+
self.assertEqual(p.path, "/javascript:alert('msg')/")
622+
self.assertEqual(p.query, "")
623+
self.assertEqual(p.fragment, "frag")
624+
self.assertEqual(p.username, None)
625+
self.assertEqual(p.password, None)
626+
self.assertEqual(p.hostname, "www.python.org")
627+
self.assertEqual(p.port, None)
628+
self.assertEqual(p.geturl(), "http://www.python.org/javascript:alert('msg')/#frag")
629+
630+
# Remove ASCII tabs and newlines from input as bytes.
631+
url = b"http://www.python.org/java\nscript:\talert('msg\r\n')/#frag"
632+
p = urllib.parse.urlsplit(url)
633+
self.assertEqual(p.scheme, b"http")
634+
self.assertEqual(p.netloc, b"www.python.org")
635+
self.assertEqual(p.path, b"/javascript:alert('msg')/")
636+
self.assertEqual(p.query, b"")
637+
self.assertEqual(p.fragment, b"frag")
638+
self.assertEqual(p.username, None)
639+
self.assertEqual(p.password, None)
640+
self.assertEqual(p.hostname, b"www.python.org")
641+
self.assertEqual(p.port, None)
642+
self.assertEqual(p.geturl(), b"http://www.python.org/javascript:alert('msg')/#frag")
643+
615644
def test_attributes_bad_port(self):
616645
"""Check handling of invalid ports."""
617646
for bytes in (False, True):

Lib/urllib/parse.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,9 @@
7878
'0123456789'
7979
'+-.')
8080

81+
# Unsafe bytes to be removed per WHATWG spec
82+
_UNSAFE_URL_BYTES_TO_REMOVE = ['\t', '\r', '\n']
83+
8184
# XXX: Consider replacing with functools.lru_cache
8285
MAX_CACHE_SIZE = 20
8386
_parse_cache = {}
@@ -469,6 +472,9 @@ def urlsplit(url, scheme='', allow_fragments=True):
469472
else:
470473
scheme, url = url[:i].lower(), url[i+1:]
471474

475+
for b in _UNSAFE_URL_BYTES_TO_REMOVE:
476+
url = url.replace(b, "")
477+
472478
if url[:2] == '//':
473479
netloc, url = _splitnetloc(url, 2)
474480
if (('[' in netloc and ']' not in netloc) or
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
The presence of newline or tab characters in parts of a URL could allow
2+
some forms of attacks.
3+
4+
Following the controlling specification for URLs defined by WHATWG
5+
:func:`urllib.parse` now removes ASCII newlines and tabs from URLs,
6+
preventing such attacks.

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy