【问题标题】:webscraping USGS html data using regular expressions使用正则表达式抓取 USGS html 数据
【发布时间】:2019-12-11 07:27:54
【问题描述】:

我正在使用漂亮的汤来抓取 html,并使用正则表达式使用 USGS 站点从两条河流中提取数据。我正在收集河规高度、日期和时间。代码适用于第一个,但不适用于第二个。

from urllib.request import urlopen as uReq
import re
import os
from bs4 import BeautifulSoup as soup

wilson_url = 'https://waterdata.usgs.gov/or/nwis/uv?site_no=14301500'
wilson_client = uReq(wilson_url)
wilson_html = wilson_client.read()
wilson_client.close()

wilson_soup = soup(wilson_html, "html.parser")
wilson = wilson_soup.findAll("div",{"class":"stationContainer"})

wilson_lvl_text = wilson[2].text

gauge_compile = re.compile('Most recent instantaneous value:\s+(\d+\\.\d+\d)+\s+(\d+\d+\\-\d+\d+\\-\d+\d+\d+\d)+\s+\s+\s+(\d+\d+\\:\d+\d+\s+\w+\w+\w)')
gauge_search = gauge_compile.search(wilson_lvl_text)

wilson = float(gauge_search.group(1))
wil_day = gauge_search.group(2)
wil_time = gauge_search.group(3)
print('As of', wil_day, ', at', wil_time, '...')
print()
print('The Wilson River level is', wilson, 'feet.')
nehalem_url = 'https://waterdata.usgs.gov/nwis/uv?site_no=14301000'
nehalem_client = uReq(nehalem_url)
nehalem_html = nehalem_client.read()
nehalem_client.close()

nehalem_soup = soup(nehalem_html, "html.parser")
nehalem = nehalem_soup.findAll("div",{"class":"stationContainer"})

nehalem_lvl_text = nehalem[2].text

gauge_compile = re.compile('Most recent inehantaneous value:\s+(\d+\\.\d+\d)+\s+(\d+\d+\\-\d+\d+\\-\d+\d+\d+\d)+\s+\s+\s+(\d+\d+\\:\d+\d+\s+\w+\w+\w)')
gauge_search = gauge_compile.search(nehalem_lvl_text)

nehalem = float(gauge_search.group(1))
neh_day = gauge_search.group(2)
neh_time = gauge_search.group(3)
print('As of', neh_day, ', at', neh_time, '...')
print()
print('The Nehalem River level is', nehalem, 'feet.')

运行模块会输出正确的威尔逊河读数,但在尝试使用正则表达式查找 Nehalem River 仪表读数时会出错:

As of 12-10-2019 , at 22:30 PST ...

The Wilson River level is 4.2 feet.
Traceback (most recent call last):
  File "C:\Python\Scripts\streams.py", line 41, in <module>
    nehalem = float(gauge_search.group(1))
AttributeError: 'NoneType' object has no attribute 'group'

【问题讨论】:

  • As documented, re.search() 如果未找到匹配项,则返回 None
  • 该错误消息有什么不清楚的地方?另外,为什么您似乎要两次创建和编译相同的正则表达式?

标签: python regex beautifulsoup


【解决方案1】:

伙计,没有“最近的瞬时值”,它是“最近的瞬时值:”。

【讨论】:

  • ugh..由于英语能力不足而损失的时间。谢谢,物有所值。
猜你喜欢
  • 2014-07-06
  • 2015-02-24
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2010-11-27
  • 1970-01-01
  • 2015-12-15
  • 1970-01-01
相关资源
最近更新 更多