Skip to content

html5lib.treebuilders.dom.dom2sax crashes on 'xml:lang' attribute #6

@gsnedders

Description

@gsnedders

http://code.google.com/p/html5lib/issues/detail?id=200

Reported by vovanec, Mar 6, 2012

A simple test case(my program has more complex handler implementation but the problem is reproducible with the default handler):

import xml.sax.handler
import html5lib

def test(html):
    handler = xml.sax.handler.ContentHandler()
    parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder('dom'))
    dom = parser.parse(html)
    html5lib.treebuilders.dom.dom2sax(dom, handler)

html = '<html xml:lang="en">'
test(html)

With html5lib 0.95 it produces the following traceback:

python test.py 
Traceback (most recent call last):
  File "test.py", line 13, in <module>
    test(html)
  File "test.py", line 10, in test
    html5lib.treebuilders.dom.dom2sax(dom, handler)
  File "/home/vkuznets/packages/html5lib-0.95/html5lib-0.95/html5lib/treebuilders/dom.py", line 271, in dom2sax
    for child in node.childNodes: dom2sax(child, handler, nsmap)
  File "/home/vkuznets/packages/html5lib-0.95/html5lib-0.95/html5lib/treebuilders/dom.py", line 256, in dom2sax
    del attributes[(attr.namespaceURI, attr.nodeName)]
KeyError: (None, u'xml:lang')

With previous versions(at least 0.11) there's no any error. I assume this attribute may be invalid in the xml namespace, but anyway I don't think it is ok for parser just to crash. I've seen A LOT of html documents that has such attribute in the real world.

Tested it with Python 2.6.5, Linux

Please advise.

Thanks,
--Vladimir

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    pFad - Phonifier reborn

    Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

    Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


    Alternative Proxies:

    Alternative Proxy

    pFad Proxy

    pFad v3 Proxy

    pFad v4 Proxy