Skip to content

Commit 8ac7613

Browse files
gh-102555: Fix comment parsing in HTMLParser according to the HTML5 standard (GH-135664)
* "--!>" now ends the comment. * "-- >" no longer ends the comment. * Support abnormally ended empty comments "<-->" and "<--->". --------- Co-author: Kerim Kabirov <the.privat33r+gh@pm.me> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
1 parent b582d75 commit 8ac7613

File tree

3 files changed

+50
-3
lines changed

3 files changed

+50
-3
lines changed

Lib/html/parser.py

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,8 @@
2929
starttagopen = re.compile('<[a-zA-Z]')
3030
endtagopen = re.compile('</[a-zA-Z]')
3131
piclose = re.compile('>')
32-
commentclose = re.compile(r'--\s*>')
32+
commentclose = re.compile(r'--!?>')
33+
commentabruptclose = re.compile(r'-?>')
3334
# Note:
3435
# 1) if you change tagfind/attrfind remember to update locatetagend too;
3536
# 2) if you change tagfind/attrfind and/or locatetagend the parser will
@@ -336,6 +337,21 @@ def parse_html_declaration(self, i):
336337
else:
337338
return self.parse_bogus_comment(i)
338339

340+
# Internal -- parse comment, return length or -1 if not terminated
341+
# see https://html.spec.whatwg.org/multipage/parsing.html#comment-start-state
342+
def parse_comment(self, i, report=True):
343+
rawdata = self.rawdata
344+
assert rawdata.startswith('<!--', i), 'unexpected call to parse_comment()'
345+
match = commentclose.search(rawdata, i+4)
346+
if not match:
347+
match = commentabruptclose.match(rawdata, i+4)
348+
if not match:
349+
return -1
350+
if report:
351+
j = match.start()
352+
self.handle_comment(rawdata[i+4: j])
353+
return match.end()
354+
339355
# Internal -- parse bogus comment, return length or -1 if not terminated
340356
# see https://html.spec.whatwg.org/multipage/parsing.html#bogus-comment-state
341357
def parse_bogus_comment(self, i, report=1):

Lib/test/test_htmlparser.py

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -367,17 +367,45 @@ def test_comments(self):
367367
html = ("<!-- I'm a valid comment -->"
368368
'<!--me too!-->'
369369
'<!------>'
370+
'<!----->'
370371
'<!---->'
372+
# abrupt-closing-of-empty-comment
373+
'<!--->'
374+
'<!-->'
371375
'<!----I have many hyphens---->'
372376
'<!-- I have a > in the middle -->'
373-
'<!-- and I have -- in the middle! -->')
377+
'<!-- and I have -- in the middle! -->'
378+
'<!--incorrectly-closed-comment--!>'
379+
'<!----!>'
380+
'<!----!-->'
381+
'<!---- >-->'
382+
'<!---!>-->'
383+
'<!--!>-->'
384+
# nested-comment
385+
'<!-- <!-- nested --> -->'
386+
'<!--<!-->'
387+
'<!--<!--!>'
388+
)
374389
expected = [('comment', " I'm a valid comment "),
375390
('comment', 'me too!'),
376391
('comment', '--'),
392+
('comment', '-'),
393+
('comment', ''),
394+
('comment', ''),
377395
('comment', ''),
378396
('comment', '--I have many hyphens--'),
379397
('comment', ' I have a > in the middle '),
380-
('comment', ' and I have -- in the middle! ')]
398+
('comment', ' and I have -- in the middle! '),
399+
('comment', 'incorrectly-closed-comment'),
400+
('comment', ''),
401+
('comment', '--!'),
402+
('comment', '-- >'),
403+
('comment', '-!>'),
404+
('comment', '!>'),
405+
('comment', ' <!-- nested '), ('data', ' -->'),
406+
('comment', '<!'),
407+
('comment', '<!'),
408+
]
381409
self._run_check(html, expected)
382410

383411
def test_condcoms(self):
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Fix comment parsing in :class:`html.parser.HTMLParser` according to the
2+
HTML5 standard. ``--!>`` now ends the comment. ``-- >`` no longer ends the
3+
comment. Support abnormally ended empty comments ``<-->`` and ``<--->``.

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy