【问题标题】:python beautifulsoup parsing recursingpython beautifulsoup解析递归
【发布时间】:2016-05-23 20:01:41
【问题描述】:

我是 python/BeautifulSoup 初学者,我正在尝试提取 <td width="473" valign="top"> -> <strong> 中的所有内容。

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="pl" lang="pl">
<head>
    <title>MIEJSKI OŚRODEK KULTURY W ŻORACH Repertuar Kina Na Starówce</title>
</head>
<body>
<div class="page_content">
<p>&nbsp;</p>
<p>
<table style="width: 450px;" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="57" valign="top">
<p align="center"><strong>Data</strong></p>
</td>
<td width="473" valign="top">
<p align="center"><strong>Tytuł Filmu</strong></p>
</td>
<td width="95" valign="top">
<p align="center"><strong>Godzina</strong></p>
</td>
</tr>
<tr>
<td width="57" valign="top">
<p align="center"><strong>&nbsp;</strong></p>
</td>
<td width="473" valign="top">
<p align="center"><strong>1 - 5.05</strong></p>
</td>
<td width="95" valign="top">
<p align="center">&nbsp;</p>
</td>
</tr>
<tr>
<td width="57" valign="top">
<p align="center"><strong>1</strong></p>
</td>
<td width="473" valign="top">
<p align="center"><strong>KINO POWT&Oacute;REK: ZWIERZOGR&Oacute;D </strong>USA/b.o&nbsp; cena 10 zł</p>
</td>
<td width="95" valign="top">
<p align="center">16:30</p>
</td>
</tr>

</tbody>
</table>
</p>
</body>
</html>

我能做的最远的事情就是用这段代码获取所有标签的列表:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("zory1.html"), "html.parser")

y = soup.find_all(width="473")

newy = str(y)

newsoup = BeautifulSoup(newy ,"html.parser")
stronglist = newsoup.find_all('strong')

lasty = str(stronglist)

lastsoup = BeautifulSoup(lasty , "html.parser")

lst = soup.find_all('strong')

for item in lst:
    print item

如何取出标签内的内容,初学者级别?

谢谢

【问题讨论】:

  • 我听说 lxml.cssSelect 非常适合在没有任何毛茸茸的情况下执行此操作..

标签: python beautifulsoup


【解决方案1】:

使用get_text() 获取节点的文本。

完整的工作示例,我们遍历表格中的所有行和所有单元格:

from bs4 import BeautifulSoup

data = """your HTML here"""
soup = BeautifulSoup(data, "html.parser")

for row in soup.find_all("tr"):
    print([cell.get_text(strip=True) for cell in row.find_all("td")])

打印:

['Data', 'Tytuł Filmu', 'Godzina']
['', '1 - 5.05', '']
['1', 'KINO POWTÓREK: ZWIERZOGRÓDUSA/b.o\xa0 cena 10 zł', '16:30']

【讨论】:

  • 您可能希望[td.text.strip() for td in soup.select("td[width=473]") 仅查找特定的 td。为什么select不支持多个属性?
【解决方案2】:

你来了

from bs4 import BeautifulSoup

navigator = BeautifulSoup(open("zory1.html"), "html.parser")

tds = navigator.find_all("td", {"width":"473"})

resultList = [item.strong.get_text() for item in tds]

for item in resultList:
    print item

结果

$ python test.py
Tytuł Filmu
1 - 5.05
KINO POWTÓREK: ZWIERZOGRÓD 

【讨论】:

    猜你喜欢
    • 2012-01-26
    • 1970-01-01
    • 2021-12-21
    • 2020-02-06
    • 2011-05-03
    • 2014-03-06
    • 2014-06-16
    • 2011-07-21
    • 2017-09-12
    相关资源
    最近更新 更多