【问题标题】:Web scraping in python HTML page does not come fullpython HTML页面中的网页抓取未满
【发布时间】:2021-03-04 10:40:45
【问题描述】:

我正在尝试从页面中抓取两个表

但是当我使用 soup.find('table') 时,它就是找不到它。另外,当我打印汤对象时,HTML代码的表格部分没有打印出来,有什么解决办法吗?

到目前为止我的代码:

from bs4 import BeautifulSoup
import pandas as pd
import requests

url = 'http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-a-vista/opcoes/posicoes-em-aberto/posicoes-em-aberto-8AE490CA64BA055F0164CCCAE1F1460A.htm?empresaEmissora=AMBEV%20S.A.&data=19/11/2020&dataVencimento=21/12/20&f=0'

r = requests.get(url)

soup = BeautifulSoup(r.text, 'lxml')

table = soup.find('div').find_all('table')

print(table)

输出:

[]
[Finished in 3.4s]

当我运行这个时:

from bs4 import BeautifulSoup
import pandas as pd
import requests

url = 'http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-a-vista/opcoes/posicoes-em-aberto/posicoes-em-aberto-8AE490CA64BA055F0164CCCAE1F1460A.htm?empresaEmissora=AMBEV%20S.A.&data=19/11/2020&dataVencimento=21/12/20&f=0'

r = requests.get(url)

soup = BeautifulSoup(r.text, 'lxml')

table = soup.find('tbody').find_all('tr')

print(table)

我明白了,但是在页面的 HTML 中,表格信息在一个 tbody > tr 中,就像我之前刮过的表格一样

Traceback (most recent call last):
  File "C:\Users\jvbf9\Documents\data-science\scraping_thiago\main.py", line 11, in <module>
    table = soup.find('tbody').find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find_all'
[Finished in 7.2s with exit code 1]

【问题讨论】:

  • 如果您查看原始页面源代码,这些表格都是由 javascript 生成的,因此您必须改用 Selenium 之类的东西。

标签: python html web-scraping python-requests


【解决方案1】:

当您创建解析器时,您不会检索您检索内容的文本:

from bs4 import BeautifulSoup
import pandas as pd
import requests

url = 'http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de- 
dados/market-data/consultas/mercado-a-vista/opcoes/posicoes-em- 
aberto/posicoes-em-aberto-8AE490CA64BA055F0164CCCAE1F1460A.htm? 
empresaEmissora=AMBEV%20S.A.&data=19/11/2020&dataVencimento=21/12/20&f=0'

r = requests.get(url)

soup = BeautifulSoup(r.content, 'lxml')

table = soup.find('div').find_all('table')

print(table)

这应该是问题所在。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-01-20
    • 2018-03-20
    • 1970-01-01
    • 2020-03-06
    • 1970-01-01
    • 1970-01-01
    • 2018-10-01
    • 2020-07-30
    相关资源
    最近更新 更多