文本信息无法正确抓取-Python答案

【问题标题】：Text Information not scrape properly-Python文本信息无法正确抓取-Python
【发布时间】：2016-12-29 15:20:02
【问题描述】：

我需要抓取以下 HTML 之间的文本信息。我下面的代码在标签和类名相同的情况下无法正常工作。在这里，我需要在单个列表元素中获取文本，而不是作为两个不同的列表元素。我在这里为没有像下面这样的拆分的情况编写的代码。在我的情况下，我需要抓取这两种文本并将其附加到一个列表中。

示例 HTML 代码（其中列表元素为一个）- 正常工作：

<DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">The board of Hillshire Brands has withdrawn its recommendation to acquire frozen foods maker Pinnacle Foods, clearing the way for Tyson Foods' $8.55bn takeover bid.</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Last Monday Tyson won the bidding war for Hillshire, maker of Ball Park hot dogs, with a $63-a-share offer, topping rival poultry processor Pilgrim's Pride's $7.7bn bid.</SPAN></P>

示例 HTML 代码（其中列表元素为两个）：

<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2">&nbsp;News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&amp;A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>

Python 代码：

soup = BeautifulSoup(response, 'html.parser')
tree = html.fromstring(response)
values = [[''.join(text for text in div.xpath('.//p[@class="c9"]//span[@class="c2"]//text()'))] for div in tree.xpath('//div[@class="c5"]') if div.getchildren()]
        split_at = ','
textvalues = [list(g) for k, g in groupby(values, lambda x: x != split_at) if k]
list2 = [x for x in textvalues[0] if x]
def purify(list2):
     for (i, sl) in enumerate(list2):
          if type(sl) == list:
              list2[i] = purify(sl)
            return [i for i in list2 if i != [] and i != '']
list3=purify(list2)
flattened = [val for sublist in list3 for val in sublist]

电流输出：

["M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi","--Remaining text--"]

预期样本输出：

["M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi --Remaining text--"]

请帮我解决上述问题。

【问题讨论】：

您可以这样做以获得预期的输出 -flattened = [' '.join(map(str,flattened))]
我需要单独追加两个列表元素。但实际上我还有其他最终列表元素也被附加并给了我错误的结果。
我没有得到你，你能发布你得到的错误。
我需要单独追加两个列表元素。但实际上我还有其他最终列表元素也被附加并给了我错误的结果。结果为 ['1 of 80 DOCUMENTS','']。所有列表元素都作为单个元素附加。假设所有 80 个文档的元素文本都以单行文本的形式出现。
您是否删除了现有的声明 - flattened = [val for sublist in list3 for val in sublist] 并尝试了上述代码？我打算在那之后添加它。

标签： python html beautifulsoup html-parsing

【解决方案1】：

这样的？

from bs4 import BeautifulSoup
a="""
<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2">&nbsp;News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&amp;A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
"""
l = BeautifulSoup(a).text.split('\n')
b = [' '.join(l[1:])]
print b

输出：

[u"M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi  Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago. But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food. Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.\xa0 "]

【讨论】：

这里我不能根据新行拆分文本，因为我需要刮掉这么多列，每列都用 '\n' 字符分隔。

【解决方案2】：

text = '''<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2">&nbsp;News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&amp;A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>'''

html = etree.HTML(text)

res = html.xpath('//span[@class="c2" and ../@class="c9"]/text()')

print([''.join(res)])

出来：

 ["M&A simmers as producers swallow up brands to win shelf space, writes Neil MunhsiPickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.\xa0"]

【讨论】：

它仍然是单行而不是多行。