Scrapy：为什么我不能从地下天气中提取目标数据？答案

【问题标题】：Scrapy: why I can't extract my targeted data from weather underground?Scrapy：为什么我不能从地下天气中提取目标数据？
【发布时间】：2021-04-13 13:22:55
【问题描述】：

我是 Python 和网络抓取的新手，这是我关于 stackoverflow 的第一个问题。我看了几个教程，然后尝试从这个页面上的表格中提取数据：https://www.wunderground.com/hourly/ir/tehran/date/2021-04-14。

表：TABLE

但问题是我似乎无法在 scrapy shell 中访问正确的类。这是我的蜘蛛：

import scrapy


class SpSpider(scrapy.Spider):
    name = 'sp'
    start_urls = ['http://https://www.wunderground.com/hourly/ir/tehran/date/2021-04-14/']

    def parse(self, response):
        time = response.css('span.ng-star-inserted').extract()

这就是我在终端中得到的：

In [4]: response.css('span.ng-star-inserted::text').extract()**


Out[4]: 
['\xa0',
 'F',
 'Night',
 '\xa0',
 'in',
 '\xa0',
'miles',
'\xa0',
'F',
'\xa0',
'%',
'\xa0',
'in',
'\xa0',
'in']

我写这篇文章的目的是只获取一个数据（这里 12 是表中的时间）。但如您所见，列表内容不相关。我应该如何访问数据？

P.S：我正在研究 python 3.8

【问题讨论】：

标签： python web-scraping scrapy scrapy-shell

【解决方案1】：

对于初学者来说可能有点复杂，但没关系。

您要查找的数据是通过 XHR 请求发送的。（F12->网络-XHR）。您发出的请求仅返回将包含数据的 html 标记。

在以下代码中，我使用的 url 取自 XHR 选项卡。所以我在这个网址上进行查询。它返回一个 JSON 响应。然后我将这个 JSON 响应（很容易包含在 Python 中的 dictionary 类型中）转换为 Pandas 数据帧。

注意查询得到的响应中包含“所有”可用天数的每小时预报（相当于点击网页上的左右箭头时）

import requests as rq 
import pandas as pd

headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"}
url = "https://api.weather.com/v3/wx/forecast/hourly/15day?apiKey=6532d6454b8aa370768e63d6ba5a832e&geocode=35.696,51.401&units=e&language=en-US&format=json"
resp = rq.get(url,  headers=headers).json()

resp.keys() ## pour observer

df = pd.DataFrame.from_dict(resp) # JSON to DF
df["validTimeLocal"] = pd.to_datetime(df["validTimeLocal"], infer_datetime_format=True) # object type to datetime type
df.sort_values(["validTimeLocal"], ascending=True, inplace=True) # sort the df by datetimes

sub_df = df[["validTimeLocal", "temperature", "precipChance"]] # select variables you want
print(sub_df.iloc[20:25]) ## print some, and compare to the website

对BOLD中的单词进行一些研究以取得进展。另请查看 requests 和 bs4 包。

注意：网址包含特定于您对德黑兰的研究的参数：地理编码等...

【讨论】：

【解决方案2】：

第一次获取，如果只需要，使用css定位器：

.mat-row:nth-of-type(1)>.cdk-column-timeHour>span

第二：

.mat-row:nth-of-type(2)>.cdk-column-timeHour>span

等等。

【讨论】：