【问题标题】:Python web scraper, same link with different text, countingPython网络爬虫,不同文本的相同链接,计数
【发布时间】:2013-08-23 11:59:39
【问题描述】:

所以我用 Python 和它的一些库制作了一个网络爬虫......它会进入给定的站点并从该站点的链接中获取所有链接和文本。我已经过滤了结果,所以我只打印该网站上的外部链接。

代码如下所示:

import urllib
import re
import mechanize
from bs4 import BeautifulSoup
import urlparse
import cookielib
from urlparse import urlsplit
from publicsuffix import PublicSuffixList

link = "http://www.ananda-pur.de/23.html"

newesturlDict = {}
baseAdrInsArray = []



br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(link, timeout=10)


for linkins in br.links():

    newesturl = urlparse.urljoin(linkins.base_url, linkins.url)

    linkTxt = linkins.text
    baseAdrIns = linkins.base_url

    if baseAdrIns not in baseAdrInsArray:
        baseAdrInsArray.append(baseAdrIns)

    netLocation = urlsplit(baseAdrIns)
    psl = PublicSuffixList()
    publicAddress = psl.get_public_suffix(netLocation.netloc)

    if publicAddress not in newesturl:

        if newesturl not in newesturlDict:
            newesturlDict[newesturl,linkTxt] = 1
        if newesturl in newesturlDict:
            newesturlDict[newesturl,linkTxt] += 1

newesturlCount = sorted(newesturlDict.items(),key=lambda(k,v):(v,k),reverse=True)
for newesturlC in newesturlCount:
    print baseAdrInsArray[0]," - ",newesturlC[0],"- count: ", newesturlC[1]

然后打印出这样的结果:

http://www.ananda-pur.de/23.html  -  ('http://www.yogibhajan.com/',  'http://www.yogibhajan.com') - count:  1
http://www.ananda-pur.de/23.html  -  ('http://www.kundalini-yoga-zentrum-berlin.de/', 'http://www.kundalini-yoga-zentrum-berlin.de') - count:  1
http://www.ananda-pur.de/23.html  -  ('http://www.kriteachings.org/', 'http://www.sat-nam-rasayan.de') - count:  1
http://www.ananda-pur.de/23.html  -  ('http://www.kriteachings.org/', 'http://www.kriteachings.org') - count:  1
http://www.ananda-pur.de/23.html  -  ('http://www.kriteachings.org/', 'http://www.gurudevsnr.com') - count:  1
http://www.ananda-pur.de/23.html  -  ('http://www.kriteachings.org/', 'http://www.3ho.de') - count:  1

而我的问题是那些具有不同文本的相同链接。根据打印示例,给定站点有 4 个链接 http://www.kriteachings.org/,但正如您所见,这 4 个链接中的每一个都有不同的 text:第一个是 http://www.sat-nam-rasayan.de,第二个是 http://www.kriteachings.org,第三个是 http://www.gurudevsnr.com,第四个是http://www.3ho.de

我想得到打印结果,我可以看到给定页面上有多少时间链接,但如果有不同的链接文本,它只会附加到来自同一链接的其他文本。为了达到这个例子的目的,我想得到这样的打印:

http://www.ananda-pur.de/23.html  -  http://www.yogibhajan.com/ - http://www.yogibhajan.com - count:  1
http://www.ananda-pur.de/23.html  -  http://www.kundalini-yoga-zentrum-berlin.de - http://www.kundalini-yoga-zentrum-berlin.de - count:  1
http://www.ananda-pur.de/23.html  -  http://www.kriteachings.org/ - http://www.sat-nam-rasayan.de, http://www.kriteachings.org, http://www.gurudevsnr.com, http://www.3ho.de  - count:  4

解释:

(第一个链接是页面,第二个是建立链接,第三个链接是 该已建立链接的确切文本,第 4 项是多少次 该链接出现在给定的网站上)

我的主要问题是我不知道如何比较?!,排序?!或者告诉程序这是同一个链接,并且应该附加不同的文本。

如果没有太多代码,这样的事情是否可能实现?我是 python nooby,所以我有点迷路了..

欢迎任何帮助或建议

【问题讨论】:

    标签: python hyperlink screen-scraping


    【解决方案1】:

    将链接收集到字典中,收集链接文本并处理计数:

    import cookielib
    
    import mechanize
    
    
    base_url = "http://www.ananda-pur.de/23.html"
    
    br = mechanize.Browser()
    cj = cookielib.LWPCookieJar()
    br.set_cookiejar(cj)
    br.set_handle_robots(False)
    br.set_handle_equiv(False)
    br.set_handle_redirect(True)
    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
    br.addheaders = [('User-agent',
                      'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
    page = br.open(base_url, timeout=10)
    
    links = {}
    for link in br.links():
        if link.url not in links:
            links[link.url] = {'count': 1, 'texts': [link.text]}
        else:
            links[link.url]['count'] += 1
            links[link.url]['texts'].append(link.text)
    
    # printing
    for link, data in links.iteritems():
        print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])
    

    打印:

    http://www.ananda-pur.de/23.html - index.html - Zadekstr 11,12351 Berlin, - 2
    http://www.ananda-pur.de/23.html - 28.html - Das Team - 1
    http://www.ananda-pur.de/23.html - http://www.yogibhajan.com/ - http://www.yogibhajan.com - 1
    http://www.ananda-pur.de/23.html - 24.html - Kontakt - 1
    http://www.ananda-pur.de/23.html - 25.html - Impressum - 1
    http://www.ananda-pur.de/23.html - http://www.kriteachings.org/ - http://www.kriteachings.org,http://www.gurudevsnr.com,http://www.sat-nam-rasayan.de,http://www.3ho.de - 4
    http://www.ananda-pur.de/23.html - http://www.kundalini-yoga-zentrum-berlin.de/ - http://www.kundalini-yoga-zentrum-berlin.de - 1
    http://www.ananda-pur.de/23.html - 3.html - Ergo Oranien 155 - 1
    http://www.ananda-pur.de/23.html - 2.html - Physio Bänsch 36 - 1
    http://www.ananda-pur.de/23.html - 13.html - Stellenangebote - 1
    http://www.ananda-pur.de/23.html - 23.html - Links - 1
    

    【讨论】:

    • 是的,看起来像解决方案...只有我需要忽略那些内部链接..但这不是问题,我相信我可以在我的代码中实现您的示例...将立即尝试
    • 当然,你可以检查是否link.url.startswith('http://'),如果没有则继续循环。
    猜你喜欢
    • 1970-01-01
    • 2016-05-26
    • 1970-01-01
    • 2015-05-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多