在 Canopy 上使用 Python 进行网页抓取答案

【问题标题】：Web-Scraping with Python on Canopy在 Canopy 上使用 Python 进行网页抓取
【发布时间】：2016-09-16 02:45:59
【问题描述】：

我在使用这行代码时遇到了问题，我想在其中打印上市公司的 4 种股票价格。我的问题是，虽然我运行它时没有错误，但代码只打印出股票价格应该去的空括号。这是我困惑的根源。

import urllib2
import re

symbolslist = ["aapl","spy","goog","nflx"]
i = 0

while i<len(symbolslist):
    url = "http://money.cnn.com/quote/quote.html?symb=' +symbolslist[i] + '"
    htmlfile = urllib2.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<span stream='+symbolslist[i]+' streamformat="ToHundredth" streamfeed="SunGard">(.+?)</span>'
    pattern = re.compile(regex)
    price = re.findall(pattern,htmltext)
    print "the price of", symbolslist[i], " is ", price
    i+=1

【问题讨论】：

标签： python web-scraping canopy

【解决方案1】：

因为你没有传递变量：

 url = "http://money.cnn.com/quote/quote.html?symb=' +symbolslist[i] + '"
                                                         ^^^^^
                                                      a string not the list element

使用str.format：

url = "http://money.cnn.com/quote/quote.html?symb={}".format(symbolslist[i])

你也可以直接遍历列表，不需要while循环，永远不要parse html with a regex，使用像bs4这样的html解析，你的正则表达式也是错误的。没有stream="aapl" 等。你想要的是streamformat="ToHundredth" 和streamfeed="SunGard" 的跨度；

import urllib2
from bs4 import BeautifulSoup

symbolslist = ["aapl","spy","goog","nflx"]


for symbol in symbolslist:
    url = "http://money.cnn.com/quote/quote.html?symb={}".format(symbol)
    htmlfile = urllib2.urlopen(url)
    soup = BeautifulSoup(htmlfile.read())
    price = soup.find("span",streamformat="ToHundredth", streamfeed="SunGard").text
    print "the price of {} is {}".format(symbol,price)

你可以看看我们是否运行代码：

In [1]: import urllib2

In [2]: from bs4 import BeautifulSoup

In [3]: symbols_list = ["aapl", "spy", "goog", "nflx"]

In [4]: for symbol in symbols_list:
   ...:         url = "http://money.cnn.com/quote/quote.html?symb={}".format(symbol)
   ...:         htmlfile = urllib2.urlopen(url)
   ...:         soup = BeautifulSoup(htmlfile.read(), "html.parser")
   ...:         price = soup.find("span",streamformat="ToHundredth", streamfeed="SunGard").text
   ...:         print "the price of {} is {}".format(symbol,price)
   ...:     
the price of aapl is 115.57
the price of spy is 215.28
the price of goog is 771.76
the price of nflx is 97.34

我们得到你想要的。

【讨论】：

现在我在尝试您的代码后收到此错误。它标记第 11 行并说： AttributeError: 'NoneType' object has no attribute 'text'
代码输出包含在答案中，如果您使用答案中的符号得到不同的输出，您要么错误地使用了代码，要么由于某种原因没有获得正确的源代码。在没有更多上下文的情况下告诉我你得到一个属性错误有点难以调试