我在将网络抓取到 Python 时遇到问题答案

【问题标题】：I'm having trouble with web scraping to Python我在将网络抓取到 Python 时遇到问题
【发布时间】：2018-05-30 04:25:39
【问题描述】：

我对编码很陌生，我试图编写一个代码来从 coinmarketcap 导入莱特币的当前价格。但是，我无法让它工作，它会打印并清空列表。

import urllib
import re

htmlfile = urllib.urlopen('https://coinmarketcap.com/currencies/litecoin/')

htmltext = htmlfile.read()

regex = 'span class="text-large2" data-currency-value="">$304.08</span>'

pattern = re.compile(regex)

price = re.findall(pattern, htmltext)

print(price)

出来的是 "[]" 。问题可能很小，但我非常感谢您的帮助。

【问题讨论】：

我确实在代码中使用了单引号，但堆栈溢出立即将“span class="text-large2" data-currency-value="">$304.08" 转换为 $304.08。
正则表达式通常不是处理 HTML 的最佳工具。我建议查看BeautifulSoup 之类的内容。除此之外，您的 regex 模式可能没有按照您的想法执行。查看documentation。
也比re容易多了

标签： python html web screen-scraping

【解决方案1】：

正则表达式通常不是处理 HTML 的最佳工具。我建议查看BeautifulSoup 之类的内容。

例如：

import urllib
import bs4

f = urllib.urlopen("https://coinmarketcap.com/currencies/litecoin/")
soup = bs4.BeautifulSoup(f)
print(soup.find("", {"data-currency-value": True}).text)

当前打印“299.97”。

对于这种简单的情况，这可能不如使用re 好。但是，请参阅Using regular expressions to parse HTML: why not?

【讨论】：

【解决方案2】：

您需要更改您的正则表达式并在括号中添加一个组以捕获该值。

尝试匹配类似：<span class="text-large2" data-currency-value>300.59</span>，你需要这个正则表达式：

regex = 'span class="text-large2" data-currency-value>(.*?)</span>'

(.*?) 组用于捕获号码。

你得到：

['300.59']

【讨论】：