使用 python 请求解析 HTML答案

【问题标题】：Parsing HTML with python request使用 python 请求解析 HTML
【发布时间】：2019-01-18 04:17:19
【问题描述】：

我不是程序员，但我需要实现一个简单的 HTML 解析器。

经过一个简单的研究，我能够作为一个给定的例子来实现：

from lxml import html
import requests

page = requests.get('https://URL.COM')
tree = html.fromstring(page.content)

#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print 'Buyers: ', buyers
print 'Prices: ', prices

如何使用 tree.xpath 解析所有以“.com.br”结尾并以“://”开头的单词

【问题讨论】：

可以添加html dummy sn-p吗？
为什么需要自己实现？只需使用bs4 反正你需要外部库那么为什么不使用 bs4 而不是 lxml？
这不是 xpath 解析的工作方式 - 您首先使用文档结构进行解析，而不是内容！如果“以.com.br结尾并以://开头的单词”实际上是链接（<a href="...">标签），您可以使用xpath提取所有链接，然后过滤您想要的。

标签： python parsing html-parsing

【解决方案1】：

正如@nosklo 在这里指出的那样，您正在寻找href 标签和相关链接。解析树将由 html 元素本身组织，您可以通过专门搜索这些元素来查找文本。对于 url，这看起来像这样（使用 python 3.6 中的lxml 库）：

from lxml import etree
from io import StringIO
import requests

# Set explicit HTMLParser
parser = etree.HTMLParser()

page = requests.get('https://URL.COM')

# Decode the page content from bytes to string
html = page.content.decode("utf-8")

# Create your etree with a StringIO object which functions similarly
# to a fileHandler
tree = etree.parse(StringIO(html), parser=parser)

# Call this function and pass in your tree
def get_links(tree):
    # This will get the anchor tags <a href...>
    refs = tree.xpath("//a")
    # Get the url from the ref
    links = [link.get('href', '') for link in refs]
    # Return a list that only ends with .com.br
    return [l for l in links if l.endswith('.com.br')]


# Example call
links = get_links(tree)

【讨论】：