【问题标题】:How can I get the following python code to output worldmaps.info (it seems this question was answered but does not work for me)如何获得以下 python 代码来输出 worldmaps.info (似乎这个问题已得到回答,但对我不起作用)
【发布时间】:2020-10-15 08:41:27
【问题描述】:

我试图从 worldometer.info 获取值(类似于帖子Python: No tables found matching pattern '.+') 我使用的代码如下:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://www.worldometers.info/coronavirus/#countries'
header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9","X-Requested-With": "XMLHttpRequest"}

r = requests.get(url, headers=header)

# fix HTML multiple tbody
soup = BeautifulSoup(r.text, "html.parser")
for body in soup("tbody"):
    body.unwrap()

print(soup)

df = pd.read_html(str(soup), index_col=1, thousands=r',', flavor="bs4")[0]
df = df.replace(regex=[r'\+', r'\,'], value='')

df = df.fillna('0')
df = df.to_json(orient='index')

print(df)

输出是页面的html,然后当pandas处理它时出现错误:

Traceback (most recent call last):
  File "./covid19_status.py", line 37, in <module>
    df = pd.read_html(str(soup), index_col=1, thousands=r',', flavor="bs4")[0]
  File "/usr/local/lib64/python3.6/site-packages/pandas/util/_decorators.py", line 296, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/pandas/io/html.py", line 1101, in read_html
    displayed_only=displayed_only,
  File "/usr/local/lib64/python3.6/site-packages/pandas/io/html.py", line 917, in _parse
    raise retained
  File "/usr/local/lib64/python3.6/site-packages/pandas/io/html.py", line 898, in _parse
    tables = p.parse_tables()
  File "/usr/local/lib64/python3.6/site-packages/pandas/io/html.py", line 217, in parse_tables
    tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
  File "/usr/local/lib64/python3.6/site-packages/pandas/io/html.py", line 563, in _parse_tables
    raise ValueError(f"No tables found matching pattern {repr(match.pattern)}")
ValueError: No tables found matching pattern '.+'

谁能告诉我如何解决这个问题?我尝试使用类似文章中的正则表达式,但无法使其正常工作,并且未包含在此代码中(我对 python 非常熟悉)。

提前致谢!

【问题讨论】:

    标签: python pandas beautifulsoup


    【解决方案1】:

    您可以按照this question 的答案中提供的代码进行操作。完整代码如下:

    import pandas as pd
    import requests
    import re
    
    url = 'https://www.worldometers.info/coronavirus/#countries'
    header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9","X-Requested-With": "XMLHttpRequest"}
    
    r = requests.get(url, headers=header).text
    
    r = re.sub(r'<.*?>', lambda g: g.group(0).upper(), r)
    
    dfs = pd.read_html(r)
    
    dfs[0].to_csv('D:\\Worldometer.csv',index = False)
    

    CSV 文件的屏幕截图:

    【讨论】:

    • 非常感谢!是的,这很有帮助。实际上,我更早地得到了这个工作,但我只是在寻找第 3 列和第 4 列的输出以添加为指标。这会非常困难吗? ++10
    • 作为指标?你是什​​么意思?你能更清楚吗?顺便说一句,如果我的回答对您有所帮助,请点击投票按钮下方的勾号,将其作为最佳答案。
    • 使用相同的代码我仍然收到一个错误:回溯(最后一次调用):文件“./covid19_status.py”,第 31 行,在 r = re.sub(r '<.>', lambda g: g.group(0).upper(), r) 文件“/usr/lib64/python3.6/re.py”,第 191 行,在 sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or bytes-like object` 另外,我想使用 BeautifulSoup 和 pandas 为另一个应用程序格式化。
    • 我确实点击了向上箭头,但它没有增加数字???
    • 对于你得到的错误——将r = re.sub(r'&lt;.*?&gt;', lambda g: g.group(0).upper(), r)更改为r = re.sub(r'&lt;.*?&gt;', lambda g: g.group(0).upper(), str(r))
    猜你喜欢
    • 2019-11-21
    • 1970-01-01
    • 1970-01-01
    • 2019-11-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-09-19
    • 1970-01-01
    相关资源
    最近更新 更多