Skip to content

gh-135661: Fix CDATA section parsing in HTMLParser #135665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Jun 18, 2025

"] ]>" and "]] >" no longer end the CDATA section.

"] ]>" and "]] >" no longer end the CDATA section.
@serhiy-storchaka serhiy-storchaka added type-security A security issue needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes labels Jul 4, 2025
Comment on lines 328 to 332
j = rawdata.find(']]>')
if j < 0:
return -1
self.unknown_decl(rawdata[i+3: j])
return j + 3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the HTML5 standard (https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state), it should be either data or bogus comment (which ends with >, not ]]>), but this depends on the context. It may be that I incorrectly understand the HTML5 standard, because this part is difficult to implement.

Copy link
Member

@ezio-melotti ezio-melotti Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried copying the content of the tests in the following file:

<!DOCTYPE html>
<html>
<body>
<![CDATA[just some plain text]]><hr>
<![CDATA[<!-- not a comment -->]]><hr>
<![CDATA[&not-an-entity-ref;]]><hr>
<![CDATA[<not a='start tag'>]]><hr>
<![CDATA[]]><hr>
<![CDATA[[[I have many brackets]]]]><hr>
<![CDATA[I have a > in the middle]]><hr>
<![CDATA[I have a ]] in the middle]]><hr>
<![CDATA[] ]>]]><hr>
<![CDATA[]] >]]><hr>
<![CDATA[
    if (a < b && a > b) {
        printf("[<marquee>How?</marquee>]");
    }
]]><hr>

</body>
</html>

and this was the result on Firefox:

<html><head></head><body>
<!--[CDATA[just some plain text]]--><hr>
<!--[CDATA[<!-- not a comment ---->]]&gt;<hr>
<!--[CDATA[&not-an-entity-ref;]]--><hr>
<!--[CDATA[<not a='start tag'-->]]&gt;<hr>
<!--[CDATA[]]--><hr>
<!--[CDATA[[[I have many brackets]]]]--><hr>
<!--[CDATA[I have a --> in the middle]]&gt;<hr>
<!--[CDATA[I have a ]] in the middle]]--><hr>
<!--[CDATA[] ]-->]]&gt;<hr>
<!--[CDATA[]] -->]]&gt;<hr>
<!--[CDATA[
    if (a < b && a --> b) {
        printf("[<marquee>How?</marquee>]");
    }
]]&gt;<hr>



</body></html>
Image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and if you try <svg><text y="100"><![CDATA[foo<br>bar]]></text></svg>, you will see that content between <![CDATA[ and ]]> is interpreted as a raw data.

This is context dependent.

HTMLParser is actually just a tokenizer. To determine the context automatically, it needs to support the stack of open elements and to know what elements are in the HTML namespace. This is all in the specification, and we will implement this in future. But this is a different level of complexity. So I solved the issue by letting the user to determine the context. New method support_cdata() sets how HTMLParser will parse CDATA. This is not good, but perhaps better than the current state.

j = rawdata.find(']]>')
if j < 0:
return -1
self.unknown_decl(rawdata[i+3: j])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.unknown_decl(rawdata[i+3: j])
self.unknown_decl(rawdata[i+3:j])

Comment on lines 328 to 332
j = rawdata.find(']]>')
if j < 0:
return -1
self.unknown_decl(rawdata[i+3: j])
return j + 3
Copy link
Member

@ezio-melotti ezio-melotti Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried copying the content of the tests in the following file:

<!DOCTYPE html>
<html>
<body>
<![CDATA[just some plain text]]><hr>
<![CDATA[<!-- not a comment -->]]><hr>
<![CDATA[&not-an-entity-ref;]]><hr>
<![CDATA[<not a='start tag'>]]><hr>
<![CDATA[]]><hr>
<![CDATA[[[I have many brackets]]]]><hr>
<![CDATA[I have a > in the middle]]><hr>
<![CDATA[I have a ]] in the middle]]><hr>
<![CDATA[] ]>]]><hr>
<![CDATA[]] >]]><hr>
<![CDATA[
    if (a < b && a > b) {
        printf("[<marquee>How?</marquee>]");
    }
]]><hr>

</body>
</html>

and this was the result on Firefox:

<html><head></head><body>
<!--[CDATA[just some plain text]]--><hr>
<!--[CDATA[<!-- not a comment ---->]]&gt;<hr>
<!--[CDATA[&not-an-entity-ref;]]--><hr>
<!--[CDATA[<not a='start tag'-->]]&gt;<hr>
<!--[CDATA[]]--><hr>
<!--[CDATA[[[I have many brackets]]]]--><hr>
<!--[CDATA[I have a --> in the middle]]&gt;<hr>
<!--[CDATA[I have a ]] in the middle]]--><hr>
<!--[CDATA[] ]-->]]&gt;<hr>
<!--[CDATA[]] -->]]&gt;<hr>
<!--[CDATA[
    if (a < b && a --> b) {
        printf("[<marquee>How?</marquee>]");
    }
]]&gt;<hr>



</body></html>
Image

* Add HTMLParser.support_cdata().
Copy link
Member

@encukou encukou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand HTML enough to judge if this change is the right one.
I'm adding docs/changelog suggestions to clarify the behaviour, as I understand it.

Comment on lines +129 to +130
If *flag* is false, then the :meth:`handle_comment` method will be called
for ``<![CDATA[...>``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mention the default behaviour.

Suggested change
If *flag* is false, then the :meth:`handle_comment` method will be called
for ``<![CDATA[...>``.
If *flag* is false, or if :meth:`!support_cdata` has not been called yet,
then the :meth:`handle_comment` method will be called for ``<![CDATA[...>``.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a weak point of such approach. It should be true by default to be able to parse valid HTML (when <![CDATA[...]]> is only used in foreign content) by default. But secure parsing needs to set it to false at the beginning and the set to true or false after every open or close tag, depending on the complex algorithm.

So we should set it to true by default if we keep this approach.

If *flag* is false, then the :meth:`handle_comment` method will be called
for ``<![CDATA[...>``.

.. versionadded:: 3.13.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mention the previous behaviour.

Suggested change
.. versionadded:: 3.13.6
.. versionadded:: 3.13.6
Previously, :meth:`unknown_decl` was called for ``<![CDATA[...>``.

Comment on lines +1 to +2
Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to
the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to
the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.
Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to
the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.
By default, :meth:`~HTMLParser.handle_comment` is called for CDATA.
The old behavior (calling :meth:`~HTMLParser.unknown_decl`) can be restored
using a new method, :meth:`~HTMLParser.support_cdata`.

Comment on lines +178 to +179
def support_cdata(self, flag=True):
self._support_cdata = flag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this is the best way to handle the issue.
Since this solves a security issue, it also needs to be added to a bug fix release and backported to several other releases. Making the method private should be enough to solve the security issue in the short term without adding to the public API. This will also give us time to think about alternative (and possibly better) solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private means that users shouldn't call it. Given that nothing in html calls it, making it private is the same as not adding it at all.

It looks like the issue can't be solved in CPython alone -- users need to adapt their code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be solved in CPython, and I hope we will, but it looks like the solution will be too complex to risk backporting it to security-only branches. Providing a method to alter behavior we pass the ball to user side. This is not good, but otherwise the problem will left unsolved. And without closing this hole, all other security fixes for HTMLParser are worthless.

I will try other approach, but can't guarantee anything.

@encukou
Copy link
Member

encukou commented Jul 22, 2025

Here's an idea (though it might be naive): what if the parser called a handle_cdata_start method if it exists, and used its return value to steer the tokenization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting core review DO-NOT-MERGE needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes type-security A security issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy