如何阅读网站的内容？答案

【问题标题】：How to read the content of an website?如何阅读网站的内容？
【发布时间】：2016-07-06 07:09:44
【问题描述】：

我是使用 python 2.7 的网络爬虫的新手。

1。背景

现在，我想从AQICN.org 收集有用的数据，这是一个提供世界各地空气质量数据的好网站。

我想用python每小时获取所有中国站点的数据。但我现在被困住了。

2。我的麻烦

以这个网站(http://aqicn.org/city/shenyang/usconsulate/) 为例。

此页面提供美国驻中国领事馆的空气污染和气象参数。使用这样的代码，我无法获得有用的信息。

import urllib
from bs4 import BeautifulSoup
import re
import json

html_aqi =    
urllib.urlopen("http://aqicn.org/city/shenyang/usconsulate/json").read().decode('utf-8')
soup= BeautifulSoup(html_aqi)
l = soup.p.get_text() 
aqi= json.loads(l)

结果如下所示：

> ValueError: No JSON object could be decoded

所以，我把 html_aqi 改成这个格式（参考某人的作品）：

http://aqicn.org/aqicn/json/android/shenyang/usconsulate/json

代码运行良好。

3。我的目标。

格式 1：(http://aqicn.org/city/shenyang/usconsulate/json)
格式 2：(http://aqicn.org/aqicn/json/android/shenyang/usconsulate/json)

一般来说，我可以处理格式 2 。但是，我收集了格式1的中国所有网站的网站。那么，任何人都可以为我提供一些帮助来应对格式 1 吗？非常感谢。

更新

格式一很难转成二格式（需要考虑很多条件）

使用这样的代码不容易做到：

city_name = url_format1.split("/")[5]
site_name = url_format1.split("/")[6]
url_format2 = "http://aqicn.org/aqicn/json/android/"+ city_name + "/"+    site_name

### --- Reason Why it's hard  in practice  
1559 sites need to be care with, and these sites differ by their location.     
Some are in city, some are in county. Their url are not the same pattern.   
for example: 
Type1 --> http://aqicn.org/city/hebi/json
Type2 --> http://aqicn.org/city/jiangsu/huaian/json
Type3 --> http://aqicn.org/city/china/xinzhou/jiyin/json

【问题讨论】：

标签： python json beautifulsoup web-crawler urllib

【解决方案1】：

如果您对空气质量指数感兴趣，请查找 div 和 aqivalue 类：

>>> import urllib
>>> from bs4 import BeautifulSoup
>>> 
>>> url = "http://aqicn.org/city/shenyang/usconsulate/json"
>>> soup = BeautifulSoup(urllib.urlopen(url), "html.parser")
>>> soup.find("div", class_="aqivalue").get_text()
u'171'

【讨论】：

感谢您这么快的回复！我尝试过这个。它有效。顺便问一下，如何列出网站可以提供的所有课程？我想获得更多的 div（例如，so2、温度、风等）。输入soup，内容太重了。
@HanZhengzu 通常的方法是使用浏览器开发工具检查所需的元素。请注意，这并不总是意味着您会在 urllib 得到的 HTML 中找到元素，因为在浏览器中会发生很多事情来构建页面 - 额外的 API 调用、javascript 代码执行等 - urllib 不是浏览器。只是在将来考虑到这一点。希望对您有所帮助。
感谢您的指导！我现在正在使用 Chorme 开发者工具，希望能找到我感兴趣的内容。
很抱歉再次打扰您。对于这样的内容，“值将转换为美国 EPA AQI 标准。”>53" 我从html文档中剪下来的。如何获得53的值。我试过你的方法，但失败了。
@韩正祖soup.find("td", id="min_pm10").get_text()呢？

【解决方案2】：

第一个 url http://aqicn.org/city/shenyang/usconsulate/json 实际上并没有返回 JSON 数据。它返回 HTML 数据。如果您真的对此内容感兴趣，则必须解析 HTML 数据。

您可以使用 Beautifulsoup's HTML parser 执行此操作，尽管 lxml.html 包更简单一些。

【讨论】：