【问题标题】:Python Scraping Unable to extract all <li> 'sPython Scraping 无法提取所有 <li> 的
【发布时间】:2020-12-31 04:02:07
【问题描述】:
import requests

from bs4 import BeautifulSoup


def get_data_from_web():
    url = "http://mohfw.gov.in"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    div = soup.find('div', class_='col-xs-8 site-stats-count')
    li = div.find_all('li')
    print(li)
    
get_data_from_web()

我正在尝试从 http://mohfw.gov.in 中提取 Corona 统计数据,但我只得到一个第一个 li

虽然一共有3里,

我尝试专门为那些 li 标签提供课程,但我得到了none 响应

<div class="col-xs-8 site-stats-count"> 
    <ul style="margin-top:0px;">
        <li class="bg-blue">
        <strong class="mob-hide">Active &nbsp;<span class="active_per"></span></strong>
        <strong class="mob-hide">973175<span class='up'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(14859<i class='fa fa-arrow-up'></i>)</span></strong>
        <!--<span class='down'>3565 <i class='fa fa-arrow-down'></i></span>-->      
        <span class="mob-show">Active </span>
        <span class="mob-show"><span class="active_per"></span> </span> 
        <span class="mob-show"><strong>973175<span class='up'><br>(14859<i class='fa fa-arrow-up'></i>)</span></strong></span> </span>  
        </li> 
        <li class="bg-green">
        <strong class="mob-hide">Discharged &nbsp;<span class="discharged_per"></span></strong>
        <strong class="mob-hide">3702595<span class='cup'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(78399<i class='fa fa-arrow-up'></i>)</span></strong>
        <span class="mob-show">Discharged </span>
        <span class="mob-show"><span class="discharged_per"></span> </span> 
        <span class="mob-show"><strong>3702595<span class='cup'><br>(78399<i class='fa fa-arrow-up'></i>)</span></strong></span> </span>
        </li>                               
        <li class="bg-red">
        <strong class="mob-hide">Deaths &nbsp;<span class="death_per"></span></strong>
        <strong class="mob-hide">78586&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class='up'>   (1114<i class='fa fa-arrow-up'></i>)</span></strong>
        <span class="mob-show">Deaths </span>
        <span class="mob-show"><span class="death_per"></span> </span>  
        <span class="mob-show"><strong>78586<span class='up'><br>(1114<i class='fa fa-arrow-up'></i>)</span></strong></span> </span>
        <!--<span class='down'> <i class='fa fa-arrow-down'></i></span>-->      
        </li>
        </ul></div>

【问题讨论】:

  • 调用的函数名不正确
  • 对错字很抱歉,但仍然没有变化

标签: python html web-scraping beautifulsoup


【解决方案1】:

该页面上的 HTML 标记已损坏,请尝试使用 lxmlhtml5lib 解析器对其进行解析:

import requests
from bs4 import BeautifulSoup


def get_data_from_web():
    url = "http://mohfw.gov.in"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'lxml')      # <-- change to lxml or html5lib
    div = soup.find('div', class_='col-xs-8 site-stats-count')
    lis = div.find_all('li')
    for li in lis:
        print(li)
        print('-' * 80)

get_data_from_web()

打印:

<li class="bg-blue">
<strong class="mob-hide">Active  <span class="active_per"></span></strong>
<strong class="mob-hide">973175<span class="up">     (14859<i class="fa fa-arrow-up"></i>)</span></strong>
<!--<span class='down'>3565 <i class='fa fa-arrow-down'></i></span>-->
<span class="mob-show">Active </span>
<span class="mob-show"><span class="active_per"></span> </span>
<span class="mob-show"><strong>973175<span class="up"><br/>(14859<i class="fa fa-arrow-up"></i>)</span></strong></span>
</li>
--------------------------------------------------------------------------------
<li class="bg-green">
<strong class="mob-hide">Discharged  <span class="discharged_per"></span></strong>
<strong class="mob-hide">3702595<span class="cup">     (78399<i class="fa fa-arrow-up"></i>)</span></strong>
<span class="mob-show">Discharged </span>
<span class="mob-show"><span class="discharged_per"></span> </span>
<span class="mob-show"><strong>3702595<span class="cup"><br/>(78399<i class="fa fa-arrow-up"></i>)</span></strong></span>
</li>
--------------------------------------------------------------------------------
<li class="bg-red">
<strong class="mob-hide">Deaths  <span class="death_per"></span></strong>
<strong class="mob-hide">78586     <span class="up">   (1114<i class="fa fa-arrow-up"></i>)</span></strong>
<span class="mob-show">Deaths </span>
<span class="mob-show"><span class="death_per"></span> </span>
<span class="mob-show"><strong>78586<span class="up"><br/>(1114<i class="fa fa-arrow-up"></i>)</span></strong></span>
<!--<span class='down'> <i class='fa fa-arrow-down'></i></span>-->
</li>
--------------------------------------------------------------------------------

【讨论】:

【解决方案2】:

我试图获取 div 信息,但似乎 div 以第一个 li 标签结尾。下面是代码。尝试运行一次,你会看到。

import requests
from bs4 import BeautifulSoup

def get_data_from_web():
    print("here")
    url = "http://mohfw.gov.in"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    div = soup.find('div', class_='col-xs-8 site-stats-count')
    li = div.find_all('li')
    print(div)
    
get_data_from_web()

这是输出 -

<div class="col-xs-8 site-stats-count">
<ul style="margin-top:0px;">
<li class="bg-blue">
<strong class="mob-hide">Active  <span class="active_per"></span></strong>
<strong class="mob-hide">973175<span class="up">     (14859<i class="fa fa-arrow-up"></i>)</span></strong>
<!--<span class='down'>3565 <i class='fa fa-arrow-down'></i></span>-->
<span class="mob-show">Active </span>
<span class="mob-show"><span class="active_per"></span> </span>
<span class="mob-show"><strong>973175<span class="up"><br/>(14859<i class="fa fa-arrow-up"></i>)</span></strong></span> </li></ul></div>

【讨论】:

  • 检查问题用 HTML 代码替换了图像,似乎 div 不是以第一个 li 标签结尾。第 255 行:mohfw.gov.in
猜你喜欢
  • 2021-08-19
  • 1970-01-01
  • 2021-05-29
  • 1970-01-01
  • 2020-04-17
  • 1970-01-01
  • 2019-01-31
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多