Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it’ll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you’ve parsed your document into an lxml tree, you can use the .xpath()
method to search for elements.
try: # Python 2 from urllib2 import urlopen except ImportError: from urllib.request import urlopen from lxml import etree url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html" response = urlopen(url) htmlparser = etree.HTMLParser() tree = etree.parse(response, htmlparser) tree.xpath(xpathselector)
There is also a dedicated lxml.html()
module with additional functionality.
Note that in the above example I passed the response
object directly to lxml
, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests
library, you want to set stream=True
and pass in the response.raw
object after enabling transparent transport decompression:
import lxml.html import requests url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html" response = requests.get(url, stream=True) response.raw.decode_content = True tree = lxml.html.parse(response.raw)
Of possible interest to you is the CSS Selector support; the CSSSelector
class translates CSS statements into XPath expressions, making your search for td.empformbody
that much easier:
from lxml.cssselect import CSSSelector td_empformbody = CSSSelector('td.empformbody') for elem in td_empformbody(tree): # Do something with these table cells.
Coming full circle: BeautifulSoup itself does have very complete CSS selector support:
for cell in soup.select('table#foobar td.empformbody'): # Do something with these table cells.