从包含在具有相同类的 div 中的多个链接中提取文本的单行打印问题答案

【问题标题】：Problem with printing in a single line extracted text from multiple links that are contained in divs with the same class从包含在具有相同类的 div 中的多个链接中提取文本的单行打印问题
【发布时间】：2019-08-30 20:40:47
【问题描述】：

我正在尝试从具有多个具有相同类的 div 的页面中提取文本。每个 div 包含不同数量的文本链接。从每个 div 中提取的文本需要打印在一行中。

例如，如果一个 div 包含三个链接，另一个 div 包含 2 个链接，我想从第一个 div 中的三个链接中提取文本并将结果打印在一行中，然后从两个链接中提取文本第二个 div 并将其打印在新行中。我还想将提取的数据作为单个项目存储在数组中。

下面的代码正确打印了组合数据，但是除了提取的文本之外，它还打印了<a> 标签和 URL。我尝试添加文本属性 (content.text) 但出现以下错误：

AttributeError：ResultSet 对象没有属性“文本”。您可能将项目列表视为单个项目。当你打算调用 find() 时，你调用了 find_all() 吗？

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    html = urlopen("URL")
    bs = BeautifulSoup(html.read(), "html.parser")    
    int_array = []
    int_data = bs.findAll("div", {"class": "new_titles"})
    for div in int_data:
        content = div.find_all("a")
        int_array.append(content)
        print(content)

【问题讨论】：

这能回答你的问题吗？ Beautiful Soup: 'ResultSet' object has no attribute 'find_all'?

标签： python beautifulsoup

【解决方案1】：

试试下面的代码。我想你正在照顾这个。

bs = BeautifulSoup(html.read(), "html.parser")
int_array = []
int_data = bs.findAll("div", {"class": "new_titles"})
for div in int_data:
    item=[a.text.strip() for a in div.find_all("a")]
    content =' '.join(item)    
    int_array.append(content)
    print(content)

【讨论】：

感谢 Thomas 和 Kunduk 的帮助。我在我的 URL 上测试了这两种解决方案。 Kunduk 的解决方案完全符合预期。 Thomas 解决方案没有立即使用我的 URL 我将接受 Kunduk 的解决方案。感谢你们花时间帮助我解决这个问题。

【解决方案2】：

错误消息说明了一切：如果您只是在其后加上.text，那么您将超链接列表（div.find_all("a") 会给您很多）视为一个项目。

与<div> 元素类似，您需要遍历链接并利用每个单独链接的文本。

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://stackoverflow.com/questions/57732994/problem-with-printing-in-a-single-line-extracted-text-from-multiple-links-that-a/57733094?noredirect=1#comment101906332_57733094")
bs = BeautifulSoup(html.read(), "html.parser")
int_data = bs.findAll("div")
for div in int_data:
    int_array = []
    content = div.find_all("a")
    for link in content:
        int_array.append(link.text.replace("\n", "").replace("\r", ""))
    print("***"+" ".join(int_array)+"***")

【讨论】：

@Menachem：我添加了一个适用于此 SO 页面的完整示例