【问题标题】:How to get the text of all the elements in a given html in Python3?如何在Python3中获取给定html中所有元素的文本?
【发布时间】:2020-11-08 08:18:37
【问题描述】:

如何从以下html中提取元素的所有文本:

from bs4 import BeautifulSoup


html3 = """
<div class="tab-cell l1">
    <span class="cyan-90">***</span>
    <h2 class="white-80">
        <a class="k-link" href="#" title="Jump">Jump</a>
    </h2>
    <h3 class="black-70">
        <span>Red</span>
        <span class="black-50">lock</span>
    </h3>
    <div class="l-block">
        <a class="lang-menu" href="#">A</a>
        <a class="lang-menu" href="#">B</a>
        <a class="lang-menu" href="#">C</a>
    </div>
    <div class="black-50">
        <div class="p-bold">Period</div>
        <div class="tab--cell">$</div><div class="white-90">Method</div>
        <div class="tab--cell">$</div><div class="tab--cell">Type</div>
    </div>
</div>
"""

soup = BeautifulSoup(html3, "lxml")
if soup.find('div', attrs={'class': 'tab-cell l1'}):
    div_descendants = soup.div.descendants
    for des in div_descendants:
       if des.name is not None:
           print(des.name)
           if des.find(class_='k-link'):
               print(des.a.string)
           if des.find(class_='black-70'):
               print('span')
               print(des.span.text)

我只收到第一个链接的文本,之后我什么也得不到。 我想逐行抓取并得到我想要的任何东西,如果有人有任何想法请告诉我。

【问题讨论】:

  • 到目前为止,您在这一行 div_descendants = div.descendants 上尚未解决对 div 的引用。
  • 我的错,忘记了一行...
  • 好吧,div.descendants 仍未解决。移动它不会解决它。你必须先声明它。
  • 解决了 div.descendants
  • 最初,我试图通过像 des[2]、des[4] 那样在我失败的地方切片 des 来获取文本所以我找到了相反的方法,但是在以同样的方式获得第一个值之后,我以为我也会得到剩余的价值,但那没有发生......

标签: python-3.x web-scraping beautifulsoup python-requests


【解决方案1】:

你自己的if-条件阻碍你得到所有东西。根据class_=... 条件,您仅在两种情况下打印 - 您并非在所有条件下都打印:

# html3 = see above 

from bs4 import BeautifulSoup
import lxml 

soup = BeautifulSoup(html3, "lxml")
if soup.find('div', attrs={'class': 'tab-cell l1'}):
    div_descendants = soup.div.descendants
    for des in div_descendants:
        if des.name is not None:
            print(des.name)
            found = False
            if des.find(class_='k-link'):
                print(des.a.string)
                found = True
            if des.find(class_='black-70'):
                print('span')
                print(des.span.text)
                found = True
            # find all others that are not already reported:
            if not found:
                print(f"Other {des.name}: {des.string}")

输出:

span
Other span: ***
h2
Jump
a
Other a: Jump
h3
Other h3: None
span
Other span: Red
span
Other span: lock
div
Other div: None
a
Other a: A
a
Other a: B
a
Other a: C
div
Other div: None 
div
Other div: Period
div
Other div: $
div
Other div: Method
div
Other div: $
div
Other div: Type

【讨论】:

    【解决方案2】:

    这样解决问题:

    from bs4 import BeautifulSoup
    import lxml
    
    
    html3 = """
    <div class="tab-cell l1">
        <span class="cyan-90">***</span>
        <h2 class="white-80">
            <a class="k-link" href="#" title="Jump">Jump</a>
        </h2>
        <h3 class="black-70">
            <span>Red</span>
            <span class="black-50">lock</span>
        </h3>
        <div class="l-block">
            <a class="lang-menu" href="#">A</a>
            <a class="lang-menu" href="#">B</a>
            <a class="lang-menu" href="#">C</a>
        </div>
        <div class="black-50">
            <div class="p-bold">Period</div>
            <div class="tab--cell">$</div><div class="white-90">Method</div>
            <div class="tab--cell">$</div><div class="tab--cell">Type</div>
        </div>
    </div>
    """
    
    soup = BeautifulSoup(html3, "lxml")
    if soup.find('div', attrs={'class': 'tab-cell l1'}):
        div_descendants = soup.div.descendants
        for des in div_descendants:
            if des.name is not None and des.string is not None:
                print(f"{des.name}: {des.string}")
    

    【讨论】:

      猜你喜欢
      • 2021-07-30
      • 1970-01-01
      • 2017-07-28
      • 2014-02-25
      • 1970-01-01
      • 2016-03-18
      • 2022-09-29
      • 2019-08-20
      • 2011-07-04
      相关资源
      最近更新 更多