Python HTML网页抓取答案

【问题标题】：Python HTML web scrapingPython HTML网页抓取
【发布时间】：2019-01-20 18:14:52
【问题描述】：

我正在尝试编写一个 python 程序来解析以下页面并提取 card sub-brand 和 brand 给定 card bin#: https://www.cardbinlist.com/search.html?bin=371793。以下代码 sn -p 检索卡类型。

page = requests.get('https://www.cardbinlist.com/search.html?bin=371793')
tree = html.fromstring(page.content)
print("card type: ", tree.xpath("//td//following::td[7]")[0].text)

但是，不确定如何使用与给定类似的逻辑来获得品牌

<th>Brand (Financial Service)</th> 
<td><a href="/AMEX-bin-list.html" target="_blank">AMEX</a></td>

然后

tree.xpath("//td//following::td[5]")[0].text

不返回任何内容。

【问题讨论】：

充分了解“lxml”和/或“tree.path”来尝试自己解决问题；如果遇到问题，请返回 SO 寻求帮助。
xpath 是 /html/body/div/div/div[3]/table/tbody/tr[8]/td

标签： python html web-scraping

【解决方案1】：

我建议你选择BeautifulSoup，因为 CSS 选择器比 xpaths 更方便。

通过使用漂亮的汤，你的问题的代码将是，

import requests
from bs4 import BeautifulSoup    

page = requests.get('https://www.cardbinlist.com/search.html?bin=371793')
soup = BeautifulSoup(page.content, 'html.parser')
brand_parent = soup.find('th', string='Brand (Financial Service)') # selects <th> element which contains text 'Brand (Financial Service)'
brand = brand_parent.find_next_sibling('td').text # O/P AMEX

如果你想使用 Xpath，

将 xpath 更改为 //td//following::td[5]/a 并尝试。

阅读以下答案以选择您的抓取方法，

Xpath vs DOM vs BeautifulSoup vs lxml vs other Which is the fastest approach to parse a webpage?

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

【讨论】：