如何使用 python 从网站上抓取图表？答案

【问题标题】：How to scrape charts from a website with python?如何使用 python 从网站上抓取图表？
【发布时间】：2017-02-13 08:34:27
【问题描述】：

编辑：

所以我已将下面的脚本代码保存到一个文本文件中，但使用 re 提取数据仍然没有返回任何内容。我的代码是：

file_object = open('source_test_script.txt', mode="r")
soup = BeautifulSoup(file_object, "html.parser")
pattern = re.compile(r"^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$", re.MULTILINE | re.DOTALL)
scripts = soup.find("script", text=pattern)
profile_text = pattern.search(scripts.text).group(1)
profile = json.loads(profile_text)

print profile["data"], profile["categories"]

我想从网站中提取图表数据。以下是图表的源代码。

  <script type="text/javascript">
    jQuery(function() {

    var chart1 = new Highcharts.Chart({

          chart: {
             renderTo: 'chart1',
              defaultSeriesType: 'column',
            borderWidth: 2
          },
          title: {
             text: 'Productions'
          },
          legend: {
            enabled: false
          },
          xAxis: [{
             categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016],

          }],
          yAxis: {
             min: 0,
             title: {
             text: 'Productions'
          }
          },

          series: [{
               name: 'Productions',
               data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36]
               }]
       });
    });

    </script>

网站上有几个类似的图表，称为“chart1”、“chart2”等。我想提取以下数据：每个图表的类别线和数据线：

categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016]

data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36]

【问题讨论】：

我相信你可以使用 selenium 来做类似的事情，例如：stackoverflow.com/questions/10455130/…
是的，我正在使用 selenium 来解析 html 内容。我的代码是： [code] req=urllib2.Request(productions_url, headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'}) p=urllib2 .urlopen(req) 汤=BeautifulSoup(p.readlines()[0], 'html.parser')[/code]。我的问题是，一旦我解析了 html，如何提取这两条特定的行。
HTML 解析器不会帮助你，因为那是 JavaScript。所以，你必须自己解析它。

标签： python graph screen-scraping

【解决方案1】：

另一种方法是在控制台中使用 Highcharts 的 JavaScript 库，然后使用 Selenium 拉取它。

import time
from selenium import webdriver

website = ""

driver = webdriver.Firefox()
driver.get(website)
time.sleep(5)

temp = driver.execute_script('return window.Highcharts.charts[0]'
                             '.series[0].options.data')
data = [item[1] for item in temp]
print(data)

根据您尝试提取的图表和系列，您的案例可能会略有不同。

【讨论】：

这应该是公认的答案！更简单、更直观。

【解决方案2】：

我会结合使用正则表达式和 yaml 解析器。下面快速而肮脏 - 您可能需要调整正则表达式，但它适用于示例：

import re
import sys
import yaml

chart_matcher = re.compile(r'^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$',
        re.MULTILINE | re.DOTALL)

script = sys.stdin.read()

m = chart_matcher.findall(script)

for name, data in m:
    print name
    try:
        chart = yaml.safe_load(data)
        print "categories:", chart['xAxis'][0]['categories']
        print "data:", chart['series'][0]['data']
    except Exception, e:
        print e

需要 yaml 库 (pip install PyYAML)，您应该使用 BeautifulSoup 提取正确的 <script> 标签，然后再将其传递给正则表达式。

编辑 - 完整示例

对不起，我没有说清楚。您使用 BeautifulSoup 解析 HTML 并提取 <script> 元素，然后使用 PyYAML 解析 javascript 对象声明。您不能使用内置的 json 库，因为它不是有效的 JSON，但纯 javascript 对象声明（即没有函数）是 YAML 的子集。

from bs4 import BeautifulSoup
import yaml
import re

file_object = open('source_test_script.txt', mode="r")
soup = BeautifulSoup(file_object, "html.parser")

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)

charts = {}

# find every <script> tag in the source using beautifulsoup
for tag in soup.find_all('script'):

    # tabs are special in yaml so remove them first
    script = tag.text.replace('\t', '')

    # find each object declaration
    for name, obj_declaration in pattern.findall(script):
        try:
            # parse the javascript declaration
            charts[name] = yaml.safe_load(obj_declaration)
        except Exception, e:
            print "Failed to parse {0}: {1}".format(name, e)

# extract the data you want
for name in charts:
    print "## {0} ##".format(name);
    print "categories:", charts[name]['xAxis'][0]['categories']
    print "data:", charts[name]['series'][0]['data']
    print

输出：

## chart1 ##
categories: [1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
data: [22, 1, 0, 1, 6, 4, 9, 15, 15, 19, 24, 18, 53, 42, 54, 53, 61, 36]

请注意，我必须对正则表达式进行 tweek 处理，以使其处理来自 BeautifulSoup 的 unicode 输出和空格 - 在我最初的示例中，我只是将您的源直接通过管道传输到正则表达式。

编辑 2 - 没有 yaml

鉴于 javascript 看起来是部分生成的，您希望最好的办法是抓住线条 - 不优雅，但可能对您有用。

from bs4 import BeautifulSoup
import json
import re

file_object = open('citec.repec.org_p_c_pcl20.html', mode="r")
soup = BeautifulSoup(file_object, "html.parser")

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)

charts = {}

for tag in soup.find_all('script'):

    # tabs are special in yaml so remove them first
    script = tag.text

    values = {}

    # find each object declaration
    for name, obj_declaration in pattern.findall(script):
        for line in obj_declaration.split('\n'):
            line = line.strip('\t\n ,;')
            for field in ('data', 'categories'):
                if line.startswith(field + ":"):
                    data = line[len(field)+1:]
                    try:
                        values[field] = json.loads(data)
                    except:
                        print "Failed to parse %r for %s" % (data, name)

        charts[name] = values

print charts

请注意，chart7 失败，因为它引用了另一个变量。

【讨论】：

所以我已将下面的脚本代码保存到一个文本文件中，但使用 re 提取数据仍然没有返回任何内容。我的代码是：file_object = open('source_test_script.txt', mode="r") soup = BeautifulSoup(file_object, "html.parser") pattern = re.compile(r"^var (chart[0-9]+ ) = new Highcharts.Chart(({.*?}));$", re.MULTILINE | re.DOTALL) scripts = soup.find("script", text=pattern) profile_text = pattern.search(scripts.text ).group(1) profile = json.loads(profile_text) print profile["data"], profile["categories"]
我按照您的建议尝试了代码，但一直收到以下信息：“无法解析 chart1：解析“”中的流映射时，第 29 行，第 16 列：工具提示：{ ^ 预期',' 或 '}'，但得到了 '{'"
您可能仍想使用yaml.safe_load 而不是json.loads，因为它对错误输入更宽容（例如，chart3 在数组中有尾随逗号）
json.loads 代码现在可以工作，但 yaml 代码仍然给我同样的错误...