如何使用 Beautiful Soup 获取标签的内容？答案

【问题标题】：How to get the content of a tag with a Beautiful Soup?如何使用 Beautiful Soup 获取标签的内容？
【发布时间】：2021-02-22 21:21:53
【问题描述】：

我正在尝试从各种 AMC 测试中提取问题。以https://artofproblemsolving.com/wiki/index.php/2002_AMC_10B_Problems/Problem_1 为例。要获得问题文本，我只需要第一个

元素中的常规字符串文本和第一个

元素中的中的乳胶。

到目前为止我的代码：

res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
latex_equation = soup.select('p img')[0].get('alt')

当我得到乳胶方程时它可以工作，但之前的问题有更多部分用双引号引起来。有没有办法得到问题的另一部分，即“什么是价值”。我正在考虑使用正则表达式，但我想看看 Beautiful Soup 是否有可以为我获取它的功能。

【问题讨论】：

标签： python beautifulsoup request

【解决方案1】：

尝试使用zip()：

import requests
from bs4 import BeautifulSoup

URL = "https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")

for text, tag in zip(soup.select_one(".mw-parser-output p"), soup.select("p img")):
    print(text, tag.get("alt"))
    break

输出：

What is the value of  $\frac{2a^{-1}+\frac{a^{-1}}{2}}{a}$

编辑：

soup = BeautifulSoup(requests.get(URL).content, "html.parser")

for text, tag in zip(soup.select(".mw-parser-output p"), soup.select("p img")):
    print(text.text.strip(), tag.get("alt"))

【讨论】：

.mw-parser-output 做了什么
还有一种方法可以让它循环播放所有图像和 p，因为最后还有更多。
@JamesHuang 1. 选择课程mw-parser-output 是CSS Selector 2. 查看我的编辑（希望对您有所帮助，因为该页面很难抓取）。
新的解决方案打印出正确的答案，但单独打印出一堆额外的东西。我不知道我应该如何以这种方式将文本按正确的顺序排列。我尝试使用这个名为 soup.children 的东西，但它有点错误，因为它将所有图像标签组合在一起。

【解决方案2】：

嗯，BS4 似乎有点问题。我花了一段时间才得到这个。不要认为这些奇怪的间距和一切都是可行的。 RegEx 将是您的最佳选择。让我知道这是否好。检查了前两个问题，他们工作得很好。然而，AMC 确实存在一些几何图像问题，所以我认为它不适用于这些问题。

import bs4
import requests
import re

res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.content, 'html.parser').find('p')
elements = [i for i in soup.prettify().split("\n") if i][1:-2]
latex_reg = re.compile(r'alt="(.*?)"')
for n, i in enumerate(elements):
    mo = latex_reg.search(i)
    if mo:
        elements[n] = mo.group(1)
    elements[n] = re.sub(' +', ' ', elements[n]).lstrip()
    if elements[n][0] == "$":
        elements[n] = " "+elements[n]+" "

print(elements)
print("".join(elements))

【讨论】：