【问题标题】:How to extract data from an interactive graph如何从交互式图表中提取数据
【发布时间】:2019-11-29 06:49:12
【问题描述】:

我需要从提供聚合轮询号码的website 获取数据点。数据以交互式图表的形式呈现。我应该如何获取每个候选人的所有数据点(日期:数字对)?我试图分析和检查源代码,但找不到它指向的数据文件。我对 Python 或 R 中的解决方案感到满意。非常感谢您的帮助。

【问题讨论】:

  • 你说你检查了一个数据文件,你检查过API调用吗?这是最有可能的来源 IMO。
  • 我同意,我也怀疑过。但不知道 1.如何检查 API 调用和 2.我不确定如何在找到调用后获取文件...请指教。谢谢!
  • 要找到 API 调用,最好的办法是使用浏览器的开发工具(或任何他们称之为的工具)来监控网络请求。不过,这是特定于浏览器的,因此您必须自己找出答案。第 2 部分稍微复杂一些。如果他们只是抓取一个文件,那应该没问题。如果它是一个 API 调用,你必须找出它是如何被调用的,然后自己去做。这有多难完全取决于网站。我可以自己看看,但要等到明天。

标签: python r json web-scraping interactive


【解决方案1】:

如上所述,在开发工具中找到 API 调用。然后只需获取响应并根据需要对其进行操作:

import requests
import pandas as pd
import json
import time


timestamp = str(int(time.time()*1000.0))

url ='https://www.realclearpolitics.com/epolls/json/6730_historical.js'

headers = {
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Mobile Safari/537.36'}

payload = {
timestamp: '',
'callback': 'return_json'}


jsonStr = requests.get(url, headers=headers, params=payload).text
jsonData = json.loads(jsonStr.split('(',1)[-1].rsplit(')',1)[0])

results = pd.DataFrame()
df = pd.DataFrame(jsonData['poll']['rcp_avg'])
for idx, row in df.iterrows():
    temp_df = pd.DataFrame(row['candidate'])
    temp_df['date'] = row['date']
    results = results.append(temp_df, sort=True).reset_index(drop=True)

输出:

print (results)
     affiliation    color                date  ...        name status value
0                 #009900 2019-11-28 06:00:00  ...       Biden      1  27.0
1                 #457fff 2019-11-28 06:00:00  ...     Sanders      1  18.3
2                 #996600 2019-11-28 06:00:00  ...      Warren      1  15.8
3                 #990099 2019-11-28 06:00:00  ...   Buttigieg      1  11.0
4                 #ff9900 2019-11-28 06:00:00  ...      Harris      1   3.8
5                 #3da882 2019-11-28 06:00:00  ...        Yang      1   3.3
6                 #f2dc0f 2019-11-28 06:00:00  ...   Bloomberg      1   2.5
7                 #000000 2019-11-28 06:00:00  ...   Klobuchar      1   2.2
8                 #66ccff 2019-11-28 06:00:00  ...      Booker      1   1.8
9                 #666666 2019-11-28 06:00:00  ...      Steyer      1   1.7
10                #ff0074 2019-11-28 06:00:00  ...     Gabbard      1   1.3
11                #cc9900 2019-11-28 06:00:00  ...      Castro      1   1.2
12                #9966ff 2019-11-28 06:00:00  ...      Bennet      1   0.6
13                #10671b 2019-11-28 06:00:00  ...     Bullock      3   0.4
14                #990000 2019-11-28 06:00:00  ...     Patrick      3   0.4
15                #6672ff 2019-11-28 06:00:00  ...      Sestak      3   0.3
16                #009900 2019-11-27 06:00:00  ...       Biden      1  28.2
17                #457fff 2019-11-27 06:00:00  ...     Sanders      1  17.8
18                #996600 2019-11-27 06:00:00  ...      Warren      1  16.7
19                #990099 2019-11-27 06:00:00  ...   Buttigieg      1  10.5
20                #ff9900 2019-11-27 06:00:00  ...      Harris      1   3.8
21                #3da882 2019-11-27 06:00:00  ...        Yang      1   3.2
22                #f2dc0f 2019-11-27 06:00:00  ...   Bloomberg      1   2.4
23                #000000 2019-11-27 06:00:00  ...   Klobuchar      1   2.0
24                #66ccff 2019-11-27 06:00:00  ...      Booker      1   1.7
25                #666666 2019-11-27 06:00:00  ...      Steyer      1   1.7
26                #ff0074 2019-11-27 06:00:00  ...     Gabbard      1   1.5
27                #cc9900 2019-11-27 06:00:00  ...      Castro      1   1.0
28                #9966ff 2019-11-27 06:00:00  ...      Bennet      1   0.8
29                #10671b 2019-11-27 06:00:00  ...     Bullock      3   0.4
         ...      ...                 ...  ...         ...    ...   ...
5650              #996600 2018-12-10 06:00:00  ...      Warren      1   6.0
5651              #990099 2018-12-10 06:00:00  ...   Buttigieg      1   NaN
5652              #ff9900 2018-12-10 06:00:00  ...      Harris      1   5.3
5653              #3da882 2018-12-10 06:00:00  ...        Yang      1   NaN
5654              #f2dc0f 2018-12-10 06:00:00  ...   Bloomberg      1   NaN
5655              #000000 2018-12-10 06:00:00  ...   Klobuchar      1   NaN
5656              #66ccff 2018-12-10 06:00:00  ...      Booker      1   4.0
5657              #666666 2018-12-10 06:00:00  ...      Steyer    NaN   NaN
5658              #ff0074 2018-12-10 06:00:00  ...     Gabbard      1   NaN
5659              #cc9900 2018-12-10 06:00:00  ...      Castro      1   NaN
5660              #9966ff 2018-12-10 06:00:00  ...      Bennet      1   NaN
5661              #10671b 2018-12-10 06:00:00  ...     Bullock      3   NaN
5662              #990000 2018-12-10 06:00:00  ...     Patrick    NaN   NaN
5663              #6672ff 2018-12-10 06:00:00  ...      Sestak    NaN   NaN
5664              #009900 2018-12-09 06:00:00  ...       Biden      1  29.0
5665              #457fff 2018-12-09 06:00:00  ...     Sanders      1  17.7
5666              #996600 2018-12-09 06:00:00  ...      Warren      1   6.0
5667              #990099 2018-12-09 06:00:00  ...   Buttigieg      1   NaN
5668              #ff9900 2018-12-09 06:00:00  ...      Harris      1   5.3
5669              #3da882 2018-12-09 06:00:00  ...        Yang      1   NaN
5670              #f2dc0f 2018-12-09 06:00:00  ...   Bloomberg      1   NaN
5671              #000000 2018-12-09 06:00:00  ...   Klobuchar      1   NaN
5672              #66ccff 2018-12-09 06:00:00  ...      Booker      1   4.0
5673              #666666 2018-12-09 06:00:00  ...      Steyer    NaN   NaN
5674              #ff0074 2018-12-09 06:00:00  ...     Gabbard      1   NaN
5675              #cc9900 2018-12-09 06:00:00  ...      Castro      1   NaN
5676              #9966ff 2018-12-09 06:00:00  ...      Bennet      1   NaN
5677              #10671b 2018-12-09 06:00:00  ...     Bullock      3   NaN
5678              #990000 2018-12-09 06:00:00  ...     Patrick    NaN   NaN
5679              #6672ff 2018-12-09 06:00:00  ...      Sestak    NaN   NaN

[5680 rows x 7 columns]

如您所见,当您绘制此图表时,它看起来就像网站上的图表:

# Convert columns to appropriate type to chart
results['value'] = results['value'].astype(float)
results['date'] = pd.to_datetime(results['date']) 

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('darkgrid')
palette = pd.Series(results.color.values,index=results.name).to_dict()

sns.lineplot(data=results, x="date", y="value", hue="name", palette=palette)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

【讨论】:

  • 非常感谢(还有亚历山大!)!回答接受。它工作得很好。你能解释一下payload在这里做什么吗?这似乎是关键的一步。再次感谢!
  • 当您查看开发工具时,您会看到有时需要使用查询表单来获得所需的回报。老实说,在这种情况下不确定它是否重要,但无论如何都包含它
  • 是的,我想我找到了带有 Chrome 的检查选项的查询。再次感谢!
猜你喜欢
  • 2022-07-05
  • 1970-01-01
  • 1970-01-01
  • 2014-06-08
  • 2023-03-31
  • 2021-04-05
  • 2015-08-12
相关资源
最近更新 更多