【问题标题】:Python lxml not picking up tagsPython lxml没有拾取标签
【发布时间】:2016-03-17 17:04:57
【问题描述】:

您好,我正在尝试在网络上抓取这个选举季节的 CNN 初选结果,并对其进行一些机器学习。我正在使用 Python 3.5,所以在研究了一下之后,我发现我可以使用 lxml 和 BeautifulSoup 以及请求来做到这一点。在使用 BeautifulSoup 失败后(我尝试使用 XPath,但没有成功),我尝试使用 lxml。在爱荷华州初选页面(以及迄今为止的每个州)中,CNN 根据县和每位候选人的选票百分比对其进行了细分。在查看 html 页面后,我看到每个县名的存储方式是县名是 h2 标签的一部分,紧跟在 div 标记(以及类属性)之后,依此类推。因此,我使用 CSSSelector 来尝试捕获(因为 h2 总是在县的 div 之后)。 html 部分如下所示:

<div class="race-results__county-header race-results__county-name section-header__column" data-reactid=".0.4.3.0.0.0.0.$0.0.$0">
    <h2 class="section-heading" data-reactid=".0.4.3.0.0.0.0.$0.0.$0.0">Adair</h2>
</div>

代码如下所示:

from lxml import html
import requests

page = requests.get('http://www.cnn.com/election/primaries/counties/ia/Rep').text
doc = html.fromstring(page)
link = doc.cssselect("div h2")
print(link)

但是,当我尝试打印链接时,绝对没有任何内容(只是一个空数组 [])。这是 html 的布局方式、代码还是解析器的问题?我正在使用 JetBeans 的 PyCharm,但我认为这与它没有任何关系。我对这些东西还很陌生,所以任何其他方法都将不胜感激。

【问题讨论】:

    标签: python html web-scraping lxml


    【解决方案1】:

    问题是,页面不包含您期望的结果,因为它们可能是通过 JavaScript 呈现的。

    当我从给定的 url 下载内容时,没有 &lt;h2&gt; 元素,但我发现那里有一条消息:请启用 JavaScript 以查看 CNN 的 2016 年选举中心。

    您没有获取数据,因为它们不在页面上。

    不要被这个事实弄糊涂了,您的浏览器可能会向您显示&lt;h2&gt; 元素 - 那是因为 JavaScript 已将其放入其中。

    提示:检查,页面加载的是什么 JSON 文件。有些文件很可能会为您的任务提供现成的数据。在我的网络浏览器中使用 F12(然后刷新页面)我看到了许多 JSON 文件,其中一些提供了有关候选人的数据。

    例如url:http://data.cnn.com/ELECTION/2016primary/candidates/can1187.json返回以下内容(缩短):

    {
      "candidateInfo": {
        "id": 1187,
        "fname": "Mike",
        "lname": "Huckabee",
        "party": "Rep",
        "rd": "1",
        "pd": "0",
        "td": "1",
        "d_nom": 1237,
        "inrace": true,
        "nominee": false,
        "rd_k": "1460",
        "td_k": 2472,
        "dpct": 0,
        "dpct_nom": 50,
        "states": [
          {
            "state": "Alabama",
            "code": "AL",
            "electiondate": "20160301",
            "primarytype": "primary",
            "candidates": []
          },
          {
            "state": "Alaska",
            "code": "AK",
            "electiondate": "20160301",
            "primarytype": "caucus",
            "candidates": []
          },
          {
            "state": "Arizona",
            "code": "AZ",
            "electiondate": "",
            "primarytype": "",
            "candidates": []
          },
          {
            "state": "Arkansas",
            "code": "AR",
            "electiondate": "20160301",
            "primarytype": "primary",
            "candidates": []
          },
          {
            "state": "Iowa",
            "code": "IA",
            "electiondate": "20160201",
            "primarytype": "caucus",
            "candidates": [
              {
                "id": 1187,
                "rd": "1",
                "pd": "0",
                "td": "1",
                "winner": false
              }
            ]
          },
          {
            "state": "Kansas",
            "code": "KS",
            "electiondate": "20160305",
            "primarytype": "caucus",
            "candidates": []
          },
          {
            "state": "Kentucky",
            "code": "KY",
            "electiondate": "20160305",
            "primarytype": "caucus",
            "candidates": []
          },
          {
            "state": "Louisiana",
            "code": "LA",
            "electiondate": "20160305",
            "primarytype": "primary",
            "candidates": []
          },
          {
            "state": "Maine",
            "code": "ME",
            "electiondate": "20160305",
            "primarytype": "caucus",
            "candidates": []
          },
          {
            "state": "Maryland",
            "code": "MD",
            "electiondate": "",
            "primarytype": "",
            "candidates": []
          },
          {
            "state": "Massachusetts",
            "code": "MA",
            "electiondate": "20160301",
            "primarytype": "primary",
            "candidates": []
          },
          {
            "state": "Michigan",
            "code": "MI",
            "electiondate": "20160308",
            "primarytype": "primary",
            "candidates": []
          },
          {
            "state": "Minnesota",
            "code": "MN",
            "electiondate": "20160301",
            "primarytype": "caucus",
            "candidates": []
          },
          {
            "state": "Mississippi",
            "code": "MS",
            "electiondate": "20160308",
            "primarytype": "primary",
            "candidates": []
          },
          {
            "state": "Missouri",
            "code": "MO",
            "electiondate": "20160315",
            "primarytype": "primary",
            "candidates": []
          },
          {
            "state": "Montana",
            "code": "MT",
            "electiondate": "",
            "primarytype": "",
            "candidates": []
          },
          {
            "state": "Nebraska",
            "code": "NE",
            "electiondate": "",
            "primarytype": "",
            "candidates": []
          },
          {
            "state": "Nevada",
            "code": "NV",
            "electiondate": "20160223",
            "primarytype": "caucus",
            "candidates": []
          },
          {
            "state": "New Hampshire",
            "code": "NH",
            "electiondate": "20160209",
            "primarytype": "primary",
            "candidates": []
          },
          {
            "state": "New Jersey",
            "code": "NJ",
            "electiondate": "",
            "primarytype": "",
            "candidates": []
          },
          {
            "state": "New Mexico",
            "code": "NM",
            "electiondate": "",
            "primarytype": "",
            "candidates": []
          },
          {
            "state": "New York",
            "code": "NY",
            "electiondate": "",
            "primarytype": "",
            "candidates": []
          },
          {
            "state": "North Carolina",
            "code": "NC",
            "electiondate": "20160315",
            "primarytype": "primary",
            "candidates": []
          },
          {
            "state": "North Dakota",
            "code": "ND",
            "electiondate": "",
            "primarytype": "",
            "candidates": []
          },
          {
            "state": "Ohio",
            "code": "OH",
            "electiondate": "20160315",
            "primarytype": "primary",
            "candidates": []
          },
          {
            "state": "Oklahoma",
            "code": "OK",
            "electiondate": "20160301",
            "primarytype": "primary",
            "candidates": []
          },
          {
            "state": "Oregon",
            "code": "OR",
            "electiondate": "",
            "primarytype": "",
            "candidates": []
          },
          {
            "state": "Virgin Islands",
            "code": "VI",
            "electiondate": "",
            "primarytype": "",
            "candidates": []
          },
          {
            "state": "Northern Marianas",
            "code": "MP",
            "electiondate": "",
            "primarytype": "",
            "candidates": []
          }
        ],
        "races": [
          {
            "status": "called",
            "code": "AR",
            "state": "Arkansas",
            "polltype": "exit",
            "primarytype": "primary",
            "cresults": true,
            "cmap": true,
            "xpoll": true,
            "electiondate": "20160301",
            "pctsrep": 100,
            "ts": 1457130949809,
            "racerank": 6,
            "winner": false,
            "vpct": 1,
            "pctDecimal": "1.2",
            "inc": false,
            "votes": 4703,
            "cvotes": "4,703",
            "rd": "0",
            "pd": "0",
            "sd": "0",
            "td": "0",
            "position": 13
          },
          {
            "status": "called",
            "code": "GA",
            "state": "Georgia",
            "polltype": "exit",
            "primarytype": "primary",
            "cresults": true,
            "cmap": true,
            "xpoll": true,
            "electiondate": "20160301",
            "pctsrep": 92,
            "ts": 1457130978961,
            "racerank": 8,
            "winner": false,
            "vpct": 0,
            "pctDecimal": "0.2",
            "inc": false,
            "votes": 2615,
            "cvotes": "2,615",
            "rd": "0",
            "pd": "0",
            "sd": "0",
            "td": "0",
            "position": 13
          },
          {
            "status": "called",
            "code": "TN",
            "state": "Tennessee",
            "polltype": "exit",
            "primarytype": "primary",
            "cresults": true,
            "cmap": true,
            "xpoll": true,
            "electiondate": "20160301",
            "pctsrep": 100,
            "ts": 1457131086792,
            "racerank": 7,
            "winner": false,
            "vpct": 0,
            "pctDecimal": "0.3",
            "inc": false,
            "votes": 2404,
            "cvotes": "2,404",
            "rd": "0",
            "pd": "0",
            "sd": "0",
            "td": "0",
            "position": 15
          },
          {
            "status": "called",
            "code": "IA",
            "state": "Iowa",
            "polltype": "entrance",
            "primarytype": "caucus",
            "cresults": true,
            "cmap": true,
            "xpoll": true,
            "electiondate": "20160201",
            "pctsrep": 99,
            "ts": 1454997428611,
            "racerank": 9,
            "winner": false,
            "vpct": 2,
            "pctDecimal": "1.8",
            "inc": false,
            "votes": 3345,
            "cvotes": "3,345",
            "rd": "1",
            "pd": "0",
            "sd": "1",
            "td": "1",
            "position": 14
          },
          {
            "status": "called",
            "code": "AL",
            "state": "Alabama",
            "polltype": "exit",
            "primarytype": "primary",
            "cresults": true,
            "cmap": true,
            "xpoll": true,
            "electiondate": "20160301",
            "pctsrep": 100,
            "ts": 1456958822650,
            "racerank": 8,
            "winner": false,
            "vpct": 0,
            "pctDecimal": "0.3",
            "inc": false,
            "votes": 2535,
            "cvotes": "2,535",
            "rd": "0",
            "pd": "0",
            "sd": "0",
            "td": "0",
            "position": 13
          }
        ],
        "lts": 1458233488340
      }
    }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2012-05-12
      • 1970-01-01
      • 2012-12-31
      • 2023-03-26
      • 1970-01-01
      • 2011-02-26
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多