【问题标题】:python parse html table using lxmlpython使用lxml解析html表
【发布时间】:2013-12-23 12:05:24
【问题描述】:

我有一个这样的 html 表:

<TABLE>
<TR>
    <TD><P>Name</P></TD>
    <TD><P>Fees</P></TD>
    <TD><P>Awards</P></TD>
    <TD><P>Total</P></TD>
</TR>
<TR>
    <TD><P>Tony</P></TD>
    <TD >7,800</TD>
    <TD >7</TD>
    <TD>15,400</TD>
</TR>
<TR>
    <TD><P>Paul</FONT></P></TD>
    <TD >7,800</TD>
    <TD >7</TD>
    <TD>15,400</TD>
</TR>
<TR>
    <TD><P>Richard</P></TD>
    <TD >7,800</TD>
    <TD >7</TD>
    <TD>15,400</TD>
</TR>

</TR>
</TABLE>

我想提取表的值。我尝试了以下方法。

import lxml.html
html = lxml.html.parse(''html_table)
text_value = html.xpath('//tr/td/text()')
packages = html.xpath('//tr/td/p')
p_content = [p.text_content() for p in packages]

有没有办法将&lt;p&gt; 文本和&lt;td&gt; 的文本提取到一个列表中?

【问题讨论】:

标签: python html html-table lxml


【解决方案1】:

你可以做类似的事情

>>> doc = """<TABLE>
... <TR>
...     <TD><P>Name</P></TD>
...     <TD><P>Fees</P></TD>
...     <TD><P>Awards</P></TD>
...     <TD><P>Total</P></TD>
... </TR>
... <TR>
...     <TD><P>Tony</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Paul</FONT></P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Richard</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... 
... </TR>
... </TABLE>"""
>>> import lxml.html
>>> root = lxml.html.fromstring(doc)
>>> root.xpath('//tr/td//text()')
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>> 

如果文档中有 2 个表格,您可以先在表格上循环,然后对每个表格上的后代文本节点使用 relative XPath 表达式(带有前导 .

>>> doc = """<TABLE>
... <TR>
...     <TD><P>Name</P></TD>
...     <TD><P>Fees</P></TD>
...     <TD><P>Awards</P></TD>
...     <TD><P>Total</P></TD>
... </TR>
... <TR>
...     <TD><P>Tony</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Paul</FONT></P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Richard</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... 
... </TR>
... </TABLE>
... <TABLE>
... <TR>
...     <TD><P>Name</P></TD>
...     <TD><P>Fees</P></TD>
...     <TD><P>Awards</P></TD>
...     <TD><P>Total</P></TD>
... </TR>
... <TR>
...     <TD><P>Tony</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Paul</FONT></P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Richard</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... 
... </TR>
... </TABLE>"""
>>> import lxml.html
>>> root = lxml.html.fromstring(doc)
>>> root.xpath('//tr/td//text()')
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400', 'Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>> for tbl in root.xpath('//table'):
...     elements = tbl.xpath('.//tr/td//text()')
...     print elements
... 
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>> 

【讨论】:

  • 谢谢@paul t。为您解答。它工作得很好。我还有另一个疑问,我曾作为评论提出疑问。 “是否可以使用 lxml 将 html 中的多个表解析为不同的列表?”,你能帮我解决这个问题吗?
  • @kishorekdty ,我刚刚在表格上添加了一个循环示例
猜你喜欢
  • 1970-01-01
  • 2012-04-12
  • 2013-01-01
  • 2012-06-10
  • 2010-12-07
  • 1970-01-01
  • 2013-01-17
  • 1970-01-01
  • 2011-04-03
相关资源
最近更新 更多