【问题标题】:Python HTML web scrapingPython HTML网页抓取
【发布时间】:2019-01-20 18:14:52
【问题描述】:

我正在尝试编写一个 python 程序来解析以下页面并提取 card sub-brandbrand 给定 card bin#: https://www.cardbinlist.com/search.html?bin=371793。 以下代码 sn -p 检索卡类型。

page = requests.get('https://www.cardbinlist.com/search.html?bin=371793')
tree = html.fromstring(page.content)
print("card type: ", tree.xpath("//td//following::td[7]")[0].text)

但是,不确定如何使用与给定类似的逻辑来获得 品牌

<th>Brand (Financial Service)</th> 
<td><a href="/AMEX-bin-list.html" target="_blank">AMEX</a></td>

然后

tree.xpath("//td//following::td[5]")[0].text

不返回任何内容。

【问题讨论】:

  • 充分了解“lxml”和/或“tree.path”来尝试自己解决问题;如果遇到问题,请返回 SO 寻求帮助。
  • xpath/html/body/div/div/div[3]/table/tbody/tr[8]/td

标签: python html web-scraping


【解决方案1】:

我建议你选择BeautifulSoup,因为 CSS 选择器比 xpaths 更方便。

通过使用漂亮的汤,你的问题的代码将是,

import requests
from bs4 import BeautifulSoup    

page = requests.get('https://www.cardbinlist.com/search.html?bin=371793')
soup = BeautifulSoup(page.content, 'html.parser')
brand_parent = soup.find('th', string='Brand (Financial Service)') # selects <th> element which contains text 'Brand (Financial Service)'
brand = brand_parent.find_next_sibling('td').text # O/P AMEX

如果你想使用 Xpath,

将 xpath 更改为 //td//following::td[5]/a 并尝试。

阅读以下答案以选择您的抓取方法,

Xpath vs DOM vs BeautifulSoup vs lxml vs other Which is the fastest approach to parse a webpage?

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

【讨论】:

    猜你喜欢
    • 2018-10-01
    • 2019-12-04
    • 1970-01-01
    • 2018-01-06
    • 2021-03-04
    • 2021-01-12
    • 2022-01-27
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多