AttributeError：'NoneType'对象没有属性'strip'与Python WebCrawler答案

【问题标题】：AttributeError: 'NoneType' object has no attribute 'strip' with Python WebCrawlerAttributeError：'NoneType'对象没有属性'strip'与Python WebCrawler
【发布时间】：2011-08-02 21:48:33
【问题描述】：

我正在编写一个 python 程序来使用 urllib2、api 的 python twitter 包装器和 BeautifulSoup 的组合来抓取 twitter。但是，当我运行我的程序时，我收到以下类型的错误：

雷克鲁格拉斐尔·纳达尔

Traceback (most recent call last):
  File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 78, in <module>
    crawl(start_follower, output, depth)
  File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
    crawl(y, output, in_depth - 1)
  File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
    crawl(y, output, in_depth - 1)
  File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 64, in crawl
    request = urllib2.Request(new_url)
  File "C:\Python28\lib\urllib2.py", line 192, in __init__
    self.__original = unwrap(url)
  File "C:\Python28\lib\urllib.py", line 1038, in unwrap
    url = url.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

我完全不熟悉这种类型的错误（python 新手），在线搜索它得到的信息很少。我也附上了我的代码，但你有什么建议吗？

谢谢打喷嚏

import twitter
import urllib
import urllib2
import htmllib
from BeautifulSoup import BeautifulSoup
import re

start_follower = "NYTimeskrugman" 
depth = 3
output = open(r'C:\Python27\outputtest.txt', 'a') #better to use SQL database thanthis

api = twitter.Api()

#want to also begin entire crawl with some sort of authentication service 

def site(follower):
    followersite = "http://mobile.twitter.com/" + follower
    return followersite

def getPage(follower): 
    thisfollowersite = site(follower)
    request = urllib2.Request(thisfollowersite)
    response = urllib2.urlopen(request)
    return response

def getSoup(response): 
    html = response.read()
    soup = BeautifulSoup(html)
    return soup

def get_more_tweets(soup): 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

def recordlinks(soup,output):
    tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
    for tag in tags: 
        a = tag.renderContents()
        b = str (a)
        output.write(b)
        output.write('\n\n')

def checkforstamp(soup):
    times = nsoup.findAll('a', {'href': True}, {'class': 'status_link'})
    for time in times:
        stamp = time.renderContents()
        if str(stamp) == '3 months ago':
            return True

def crawl(follower, output, in_depth):
    if in_depth > 0:
        output.write(follower)
        a = getPage(follower)
        new_soup = getSoup(a)
        recordlinks(new_soup, output)
        currenttime = False 
        while currenttime == False:
            new_url = get_more_tweets(new_soup)
            request = urllib2.Request(new_url)
            response = urllib2.urlopen(request)
            new_soup = getSoup(response)
            recordlinks(new_soup, output)
            currenttime = checkforstamp(new_soup)
        users = api.GetFriends(follower)
        for u in users[0:5]:
            x = u.screen_name 
            y = str(x)
            print y
            crawl(y, output, in_depth - 1)
            output.write('\n\n')
        output.write('\n\n\n')

crawl(start_follower, output, depth)
print("Program done. Look at output file.")

【问题讨论】：

爬虫的工作原理是首先识别关注者并使用漂亮的汤来解析他/她的页面，直到我遇到 3 个月大的推文。然后它转到每个追随者的前五个追随者，依此类推——重复相同的过程，直到达到我指定的深度。

标签： python html twitter web-crawler

【解决方案1】：

AttributeError: 'NoneType' 对象没有属性 'strip'

这正是它所说的：url.strip() 需要首先弄清楚url.strip 是什么，即查找url 的strip 属性。失败是因为url是'NoneType' object，即类型为NoneType的对象，即特殊对象None。

推测url 应该是str，即文本字符串，因为它们确实具有strip 属性。

这发生在File "C:\Python28\lib\urllib.py"，即urllib 模块中。那不是你的代码，所以我们回顾异常跟踪，直到找到我们写的东西：request = urllib2.Request(new_url)。我们只能假设我们传递给urllib2 模块的new_url 最终会成为urllib 中某处的url 变量。

那么new_url 是从哪里来的呢？我们查找有问题的代码行（注意异常回溯中有一个行号），我们看到前一行是new_url = get_more_tweets(new_soup)，所以我们使用get_more_tweets的结果。

对该函数的分析表明，它搜索了一些链接，试图找到一个标记为“更多”的链接，并为我们提供了它找到的第一个此类链接的 URL。我们没有考虑的情况是没有这样的链接。在这种情况下，函数刚刚到达末尾，并隐式返回 None （这就是 Python 处理到达末尾而没有显式返回的函数的方式，因为 Python 中没有返回类型的规范，并且必须始终返回一个值），这是该值的来源。

大概，如果没有“更多”链接，那么我们根本不应该尝试跟踪该链接。因此，我们通过显式检查此 None 返回值并在这种情况下跳过 urllib2.Request 来修复错误，因为没有可遵循的链接。

顺便说一句，对于尚未确定的currenttime，这个None 值将是比您当前使用的False 值更惯用的“占位符”值。您还可以考虑在变量和方法名称中使用下划线分隔单词时更加一致，以使内容更易于阅读。 :)

【讨论】：

【解决方案2】：

当你这样做时

request = urllib2.Request(new_url)

在crawl() 中，new_url 是None。当您从get_more_tweets(new_soup) 获得new_url 时，这意味着get_more_tweets() 正在返回None。

这意味着永远无法到达return d，这意味着str(b) == 'more' 永远不会为真，或者soup.findAll() 没有返回任何链接，因此for link in links 什么也不做。

【讨论】：

谢谢！我刚刚意识到我编写代码的方式 - 我假设每个 Twitter 用户都会有超过 1 页的推文。然而，对于我在爬取前三个推文后击中的第四个人来说，情况似乎并非如此。因此，当我到达第四个用户并且我的爬虫试图找到提供更多推文的“更多”链接时，它没有。然后它返回 None ，这会导致最终错误。我会尝试在我的代码中考虑到这一点，并及时通知您。
从头开始。我刚刚意识到这是第二个用户——拉斐尔·纳达尔，他是推特新手，因此只有一页推文……哈！

【解决方案3】：

当你在做的时候：request = urllib2.Request(new_url), new_url 应该是一个字符串，这个错误说它是None。

你从 get_more_tweets 函数中得到 new_url 的值，所以它在某处返回了 None。

def get_more_tweets(soup): 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

当我们查看这段代码时，该函数仅在 str(b)=="more" 在某个链接上时返回，所以您的问题是“为什么 str(b)=="more" 从来没有发生过？”。

【讨论】：

【解决方案4】：

您将None 而非字符串传递给urllib2.Request()。查看代码，这意味着new_url 有时是None。查看您的 get_more_tweets() 函数，它是这个变量的来源，我们看到：

def get_more_tweets(soup): 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

此函数仅在 b 为 "more" 时返回一个值，因为您的 return 语句在您的 if 下缩进。如果它等于任何其他值，则不返回任何值（即None）。

您需要始终在此处返回一个有效的 URL，或者您需要在将 None 返回值传递给 urllib2.Request() 之前检查它。

【讨论】：