lxml.html adds a find_class method to elements::
>>> from lxml.etree import Comment
>>> from lxml.html import document_fromstring, fragment_fromstring, tostring
>>> from lxml.html import fragments_fromstring, fromstring
>>> from lxml.html.clean import clean, clean_html
>>> from lxml.html import usedoctest
>>> try: unicode = unicode
... except NameError: unicode = str
>>> h = document_fromstring('''
...
...
... P1
... P2
... ''')
>>> print(tostring(h, encoding=unicode))
P1
P2
>>> print([e.text for e in h.find_class('fn')])
['P1']
>>> print([e.text for e in h.find_class('vcard')])
['P1', 'P2']
Also added is a get_rel_links, which you can use to search for links
like ````::
>>> h = document_fromstring('''
... test 1
... item 2
... item 3
... item 4''')
>>> print([e.attrib['href'] for e in h.find_rel_links('tag')])
['2', '4']
>>> print([e.attrib['href'] for e in h.find_rel_links('nofollow')])
[]
Another method is ``get_element_by_id`` that does what it says::
>>> print(tostring(fragment_fromstring('''
...
... stuff
...
''').get_element_by_id('test'), encoding=unicode))
stuff
Or to get the content of an element without the tags, use text_content()::
>>> el = fragment_fromstring('''
... ''')
>>> el.text_content()
'This is a bold link'
Or drop an element (leaving its content) or the entire tree, like::
>>> doc = document_fromstring('''
...
...
...
... This is a
test of stuff.
...
...
... footer
...
... ''')
>>> doc.get_element_by_id('link').drop_tag()
>>> print(tostring(doc, encoding=unicode))
This is a test of stuff.
footer
>>> doc.get_element_by_id('body').drop_tree()
>>> print(tostring(doc, encoding=unicode))
footer
Note, however, that comment text will not be merged into the tree when you
drop the comment. Here, ``drop_tag()`` behaves exactly like ``drop_tree()``:
>>> for comment in doc.getiterator(Comment):
... comment.drop_tag()
>>> print(tostring(doc, encoding=unicode))
footer
In Python3 it should be possible to parse strings given as bytes objects, at
least if an encoding is given.
>>> from lxml.html import HTMLParser
>>> enc = 'utf-8'
>>> html_parser = HTMLParser(encoding=enc)
>>> src = 'Test'.encode(enc)
>>> doc = fromstring(src, parser=html_parser)
>>> print(tostring(doc, encoding=unicode))
Test
>>> docs = fragments_fromstring(src, parser=html_parser)
>>> len(docs)
1
>>> print(docs[0])
Test
Bug 599318: Call fromstring with a frameset fragment should not raise an error,
the whole document is returned.
>>> import lxml.html
>>> content='''
... '''
>>> etree_document = lxml.html.fromstring(content)
>>> print(tostring(etree_document, encoding=unicode))
Bug 599318: Call fromstring with a div fragment should not raise an error,
only the element is returned
>>> import lxml.html
>>> content=''
>>> etree_document = lxml.html.fromstring(content)
>>> print(tostring(etree_document, encoding=unicode))
Bug 599318: Call fromstring with a head fragment should not raise an error,
the whole document is returned.
>>> import lxml.html
>>> content=''
>>> etree_document = lxml.html.fromstring(content)
>>> print(tostring(etree_document, encoding=unicode))
Bug 690319: Leading whitespace before doctype declaration should not raise an error.
>>> import lxml.html
>>> content='''
...
...
... '''
>>> etree_document = lxml.html.fromstring(content)
>>> print(tostring(etree_document, encoding=unicode))