lxml.html adds a find_class method to elements:: >>> from lxml.etree import Comment >>> from lxml.html import document_fromstring, fragment_fromstring, tostring >>> from lxml.html import fragments_fromstring, fromstring >>> from lxml.html.clean import clean, clean_html >>> from lxml.html import usedoctest >>> try: unicode = unicode ... except NameError: unicode = str >>> h = document_fromstring(''' ... ... ... P1 ... P2 ... ''') >>> print(tostring(h, encoding=unicode)) P1 P2 >>> print([e.text for e in h.find_class('fn')]) ['P1'] >>> print([e.text for e in h.find_class('vcard')]) ['P1', 'P2'] Also added is a get_rel_links, which you can use to search for links like ````:: >>> h = document_fromstring(''' ... test 1 ... ... item 3 ... item 4''') >>> print([e.attrib['href'] for e in h.find_rel_links('tag')]) ['2', '4'] >>> print([e.attrib['href'] for e in h.find_rel_links('nofollow')]) [] Another method is ``get_element_by_id`` that does what it says:: >>> print(tostring(fragment_fromstring(''' ...
... stuff ...
''').get_element_by_id('test'), encoding=unicode)) stuff Or to get the content of an element without the tags, use text_content():: >>> el = fragment_fromstring(''' ...
This is a bold link
''') >>> el.text_content() 'This is a bold link' Or drop an element (leaving its content) or the entire tree, like:: >>> doc = document_fromstring(''' ... ... ...
... This is a test of stuff. ...
... ...
footer
... ... ''') >>> doc.get_element_by_id('link').drop_tag() >>> print(tostring(doc, encoding=unicode))
This is a test of stuff.
footer
>>> doc.get_element_by_id('body').drop_tree() >>> print(tostring(doc, encoding=unicode))
footer
Note, however, that comment text will not be merged into the tree when you drop the comment. Here, ``drop_tag()`` behaves exactly like ``drop_tree()``: >>> for comment in doc.getiterator(Comment): ... comment.drop_tag() >>> print(tostring(doc, encoding=unicode))
footer
In Python3 it should be possible to parse strings given as bytes objects, at least if an encoding is given. >>> from lxml.html import HTMLParser >>> enc = 'utf-8' >>> html_parser = HTMLParser(encoding=enc) >>> src = 'Test'.encode(enc) >>> doc = fromstring(src, parser=html_parser) >>> print(tostring(doc, encoding=unicode)) Test >>> docs = fragments_fromstring(src, parser=html_parser) >>> len(docs) 1 >>> print(docs[0]) Test Bug 599318: Call fromstring with a frameset fragment should not raise an error, the whole document is returned. >>> import lxml.html >>> content=''' ... ... ... ''' >>> etree_document = lxml.html.fromstring(content) >>> print(tostring(etree_document, encoding=unicode)) Bug 599318: Call fromstring with a div fragment should not raise an error, only the element is returned >>> import lxml.html >>> content='
' >>> etree_document = lxml.html.fromstring(content) >>> print(tostring(etree_document, encoding=unicode))
Bug 599318: Call fromstring with a head fragment should not raise an error, the whole document is returned. >>> import lxml.html >>> content='' >>> etree_document = lxml.html.fromstring(content) >>> print(tostring(etree_document, encoding=unicode)) Bug 690319: Leading whitespace before doctype declaration should not raise an error. >>> import lxml.html >>> content=''' ... ... ... ''' >>> etree_document = lxml.html.fromstring(content) >>> print(tostring(etree_document, encoding=unicode))