使用 LXML.HTML 和 Xpath 进行 WebScraping答案

【问题标题】：WebScraping with LXML.HTML and Xpath使用 LXML.HTML 和 Xpath 进行 WebScraping
【发布时间】：2019-05-03 15:52:36
【问题描述】：

我尝试从网站中提取信息，但不幸的是我只能获得有限的范围。我对正确的 Xpath 有问题，它接收的不仅仅是整个表的第一个元素。为了显示 Xpath，我使用 Chrome DevTools。如何使 Xpath 更通用以获得所需的结果？或者有谁知道我怎样才能更聪明地做到这一点？我的目标是稍后获取一个 json 文件。

import requests
import lxml.html

html = requests.get('http://volcano.oregonstate.edu/volcano_table')
doc = lxml.html.fromstring(html.content)

volcanoes = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[1]/a/text()')
country = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[2]/text()')
latitude = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[4]/text()')
longitude = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[5]/text()')
elevation = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[6]/text()')

output = []
for info in zip(volcanoes, country, latitude, longitude, elevation):
    resp = {}
    resp['volcanoes'] = info[0]
    resp['country'] = info[1]
    resp['latitude'] = info[2]
    resp['longitude'] = info[3]
    resp['elevation'] = info[4]
    output.append(resp)

print(output)

这是代码目前能够接收的内容：

[{'volcanoes': 'Abu', 'country': '\n            Japan          ', 'latitude': '\n            34.50          ', 'longitude': '\n            131.60          ', 'elevation': '\n            641          '}]

【问题讨论】：

标签： python-3.x web-scraping lxml

【解决方案1】：

您定义的xpaths 容易出错。我试图改进它们。现在，以下内容应该为您提供所需的内容。

import json
import requests
from lxml.html import fromstring

res = requests.get('http://volcano.oregonstate.edu/volcano_table')
root = fromstring(res.text)
data = []
for item in root.xpath("//*[starts-with(@class,'views-table')]//tbody//tr"):
    d = {}
    d['volcan'] = item.xpath('.//td/a/text()')[0].strip()
    d['country'] = item.xpath('.//td/text()')[2].strip()
    d['lat'] = item.xpath('.//td/text()')[4].strip()
    d['longitude'] = item.xpath('.//td/text()')[5].strip()
    d['elevation'] = item.xpath('.//td/text()')[6].strip()
    data.append(d)

print(json.dumps(data,indent=4))

您可能喜欢的输出：

[
    {
        "volcan": "Abu",
        "country": "Japan",
        "lat": "34.50",
        "longitude": "131.60",
        "elevation": "641"
    },
    {
        "volcan": "Acamarachi",
        "country": "Chile",
        "lat": "-23.30",
        "longitude": "-67.62",
        "elevation": "6046"
    },

【讨论】：