【问题标题】:Should not visit the same url不应访问相同的网址
【发布时间】:2012-04-16 09:39:20
【问题描述】:

我是 python 新手,我正在开发一个网络爬虫,下面是从给定 url 获取链接的程序,但问题是我不希望它访问已经访问过的相同 url。请帮帮我。

import re
import urllib.request
import sqlite3
db = sqlite3.connect('test2.db')
db.row_factory = sqlite3.Row
db.execute('drop table if exists test')
db.execute('create table test(id INTEGER PRIMARY KEY,url text)')
#linksList = []
#module to vsit the given url and get the all links in that page
def get_links(urlparse):
        try:
            if urlparse.find('.msi') ==-1: #check whether the url contains .msi extensions
                htmlSource = urllib.request.urlopen(urlparse).read().decode("iso-8859-1")  
                #parsing htmlSource and finding all anchor tags
                linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource) #returns href and other attributes of a tag
                for link in linksList: 
                    start_quote = link.find('"') # setting start point in the link 
                    end_quote = link.find('"', start_quote + 1) #setting end point in the link
                    url = link[start_quote + 1:end_quote] # get the string between start_quote and end_quote
                    def concate(url): #since few href may return only /contact or /about so concatenating its baseurl
                        if url.find('http://'):
                            url = (urlparse) + url
                            return url
                        else:
                            return url
                    url_after_concate = concate(url)
#                    linksList.append(url_after_concate)
                    try:
                        if url_after_concate.find('.tar.bz') == -1: # skipping links which containts link to some softwares or downloads page
                            db.execute('insert or ignore into test(url) values (?)', [url_after_concate])
                    except:
                        print("insertion failed")
            else:
                return True
        except:
            print("failed")
get_links('http://www.python.org')     
cursor = db.execute('select * from test')
for row in cursor: # retrieve the links stored in database
    print (row['id'],row['url'])
    urlparse = row['url']
#    print(linksList)
#    if urlparse in linksList == -1:
    try:
        get_links(urlparse) # again parse the link from database
    except:
        print ("url error")

请给我建议解决问题的方法。

【问题讨论】:

  • 几厘米。您的函数有太多的嵌套级别。将concate 函数移除到get_links 之外。此外,它是“连接”。不要使用正则表达式来解析 HTML。使用像 BeautifulSoup 这样的库。不要使用 catch all except: 吞下异常并且不打印任何诊断信息。
  • 我要继续问:您是否考虑过使用递归 Web 下载器 wget,然后处理 wget 为您检索的内容?
  • @Li-aungYip 不,先生,我没有使用它。但我认为 wget 是从给定的 url 中获取一些内容。在这里我只对获取所有href的值感兴趣。

标签: python urllib web-crawler


【解决方案1】:

您应该有一个“访问过”页面的列表。当您请求下一个 url 时,您可以检查列表是否已经包含该 url,如果是,则跳过它。我不是 python 程序员,所以这里有一些 peusdo 代码

Create listOfVisitedUrls
...
Start Loop
Get nextUrl
If nextUrl IsNotIn listOfVisitedUrls Then
    Request nextUrl
    Add nextUrl to listOfVisitedUrls
End If
Loop

【讨论】:

  • 不是一个list,而是一个set。排序无关紧要,您需要快速查找。 Python 有一个原始的 set() 类型,您可以使用它。
【解决方案2】:

您可以使用以下代码:

import re
from urllib import urlopen

# Since few href may return only /contact or /about, concatenate to baseurl.
def concat(url, baseurl): 
    if url.find('http://'):
        url = baseurl + url
        return url
    else:
        return url

def get_links(baseurl):
    resulting_urls = set()
    try:
        # Check whether the url contains .msi extensions.
        if baseurl.find('.msi') == -1:
            # Parse htmlSource and find all anchor tags.
            htmlSource = urlopen(baseurl).read()
            htmlSource = htmlSource.decode("iso-8859-1")

            # Returns href and other attributes of a tag.
            linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource)

            for link in linksList:
                # Setting start and end points in the link.
                start_quote = link.find('"')
                end_quote = link.find('"', start_quote + 1)

                # Get the string between start_quote and end_quote.
                url = link[start_quote + 1:end_quote]

                url_after_concat = concat(url, baseurl)
                resulting_urls.add(url_after_concat)

        else:
            return True

    except:
        print("failed")

    return resulting_urls

get_links('http://www.python.org')

它将返回一个set(),其中包含您的baseurl 的唯一URL;对于`http://www.python.org',你应该得到:

set([u'http://www.python.org/download/',
 u'http://docs.python.org/',
 u'http://www.python.org#left-hand-navigation',
 u'http://wiki.python.org/moin/PyQt',
 u'http://wiki.python.org/moin/DatabaseProgramming/',
 u'http://roundup.sourceforge.net/',
 u'http://www.python.org/ftp/python/3.2.3/Python-3.2.3.tar.bz2',
 u'http://www.python.org/about/website',
 u'http://www.python.org/about/quotes',
 u'http://www.python.org/community/jobs/',
 u'http://www.python.org/psf/donations/',
 u'http://www.python.org/about/help/',
 u'http://wiki.python.org/moin/CgiScripts',
 u'http://www.zope.org/',
 u'http://www.pygame.org/news.html',
 u'http://pypi.python.org/pypi',
 u'http://wiki.python.org/moin/Python2orPython3',
 u'http://www.python.org/download/releases/2.7.3/',
 u'http://www.python.org/ftp/python/3.2.3/python-3.2.3.msi',
 u'http://www.python.org/community/',
 u'http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2',
 u'http://wiki.python.org/moin/WebProgramming',
 u'http://www.openbookproject.net/pybiblio/',
 u'http://twistedmatrix.com/trac/',
 u'http://wiki.python.org/moin/IntegratedDevelopmentEnvironments',
 u'http://www.pentangle.net/python/handbook/',
 u'http://wiki.python.org/moin/TkInter',
 u'http://www.vrplumber.com/py3d.py',
 u'http://sourceforge.net/projects/mysql-python',
 u'http://wiki.python.org/moin/GuiProgramming',
 u'http://www.python.org/about/',
 u'http://www.edgewall.com/trac/',
 u'http://osl.iu.edu/~lums/swc/',
 u'http://www.python.org/community/merchandise/',
 u"http://www.python.org'/psf/",
 u'http://wiki.python.org/moin/WxPython',
 u'http://docs.python.org/3.2/',
 u'http://www.python.org#content-body',
 u'http://www.python.org/getit/',
 u'http://www.python.org/news/',
 u'http://www.python.org/search',
 u'http://www.python.org/community/sigs/current/edu-sig',
 u'http://www.python.org/about/legal',
 u'http://www.timparkin.co.uk/',
 u'http://www.python.org/about/apps',
 u'http://www.turbogears.org/',
 u'http://www.egenix.com/files/python/mxODBC.html',
 u'http://docs.python.org/devguide/',
 u'http://docs.python.org/howto/sockets.html',
 u'http://www.djangoproject.com/',
 u'http://buildbot.net/trac',
 u'http://www.python.org/psf/',
 u'http://www.python.org/doc/',
 u'http://wiki.python.org/moin/Languages',
 u'http://www.xs4all.com/',
 u'http://www.python.org/',
 u'http://wiki.python.org/moin/NumericAndScientific',
 u'http://www.python.org/channews.rdf',
 u'http://www.alobbs.com/pykyra',
 u'http://wiki.python.org/moin/PythonXml',
 u'http://wiki.python.org/moin/PyGtk',
 u'http://www.python.org/ftp/python/2.7.3/python-2.7.3.msi',
 u'http://www.python.org/download/releases/3.2.3/',
 u'http://www.python.org/3kpoll'])

希望对您有所帮助。

【讨论】:

  • 谢谢先生,但我认为它不会继续访问从 results_urls 返回的 url。请帮助我继续访问所有链接,直到我终止程序。
  • 这只会构造唯一的set URL。您必须自己访问这些...
猜你喜欢
  • 1970-01-01
  • 2015-10-16
  • 1970-01-01
  • 2020-12-03
  • 1970-01-01
  • 2016-12-23
  • 2021-09-14
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多