Beautiful Soup 和 find_all 方法未列出文本文件中的所有标签答案

【问题标题】：Beautiful Soup and the find_all method not listing all tags in text fileBeautiful Soup 和 find_all 方法未列出文本文件中的所有标签
【发布时间】：2021-12-31 02:13:01
【问题描述】：

我正在尝试抓取我放入本地 html 文件的网站。当我使用 find_all() 方法时，我可以在 python 结果中显示所有标签的文本。问题是我无法让它显示 .txt 文件中的所有文本。

from bs4 import BeautifulSoup

def interest_retrieval（文件名）：使用 open(f'{filename}', 'r') 作为 html_file：内容 = html_file.read()

    soup = BeautifulSoup(content, 'lxml')
    interests = soup.find_all('h2')
    for interest in interests:
        with open ('interest.txt', 'w') as file:
            file.write(f'{interest.text}')
        print(interest.text)

Python 会将所有标签显示为文本，但当我写入 .txt 文件时，它只会显示最后一个标签。 output of txt document

编辑我也想做类似的事情，但使用 docx 文件。我采用了 Igor 建议的代码，但将部分更改为我需要的 docx 文件。但我仍然对 docx 文件有同样的问题。

from bs4 import BeautifulSoup
import docx
def interest_retrieval(filename):
 with open(f'{filename}', 'r') as html_file:
    content = html_file.read()

    soup = BeautifulSoup(content, 'lxml')
    interests = soup.find_all('h2')
    with open('interest.txt', 'w') as file:
        for interest in interests:
            mydoc = docx.Document()
            mydoc.add_paragraph(f'{interest.text}')
            mydoc.save("C:/Users\satam\PycharmProjects\pythonProject\Web Scraper\list.docx")
            print(interest.text)

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

您在每次迭代中以写入模式重新打开文件；这会覆盖它以前的内容。要么只打开一次，然后将循环放在 with 块中，要么用 a 模式打开它（a 表示“追加”；open('interest.txt', 'a')）。

（在这种情况下，前者可能更可取，因为在您不断写入文件时，似乎没有理由一次又一次地打开和关闭文件。）

【讨论】：

【解决方案2】：

每次迭代都会重写interest.txt 文件。您只需将with open... 部分从for 循环中取出。尝试替换此片段

    for interest in interests:
        with open ('interest.txt', 'w') as file:
            file.write(f'{interest.text}')
        print(interest.text)

使用以下代码：

    with open('interest.txt', 'w') as file:
        for interest in interests:
            file.write(f'{interest.text}')
            print(interest.text)

完整代码如下：

from bs4 import BeautifulSoup


def interest_retrieval(filename):
    with open(f'{filename}', 'r') as html_file:
        content = html_file.read()

    soup = BeautifulSoup(content, 'lxml')
    interests = soup.find_all('h2')
    with open('interest.txt', 'w') as file:
        for interest in interests:
            file.write(f'{interest.text}')
            print(interest.text)

编辑：这是更新问题的.docx 版本：

from bs4 import BeautifulSoup
import docx


def interest_retrieval(filename):
    with open(f'{filename}', 'r') as html_file:
        content = html_file.read()

        soup = BeautifulSoup(content, 'lxml')
        interests = soup.find_all('h2')

        mydoc = docx.Document()
        for interest in interests:
            mydoc.add_paragraph(f'{interest.text}')
            print(interest.text)

    mydoc.save("C:/Users\satam\PycharmProjects\pythonProject\Web Scraper\list.docx")

N。 B.docx模块可以由pip install python-docx安装。

【讨论】：