【问题标题】:Scrapy rule SgmlLinkExtractor not workingScrapy 规则 SgmlLinkExtractor 不起作用
【发布时间】:2013-07-16 22:40:17
【问题描述】:

我怎样才能让我的规则在我的 crawlspider 中工作并遵循链接,我添加了这个规则,但它不起作用,什么都没有显示,但我也没有得到任何错误。我在我的规则代码中注释了我的域应该是什么样子。

规则 #1

    Rule(SgmlLinkExtractor(allow=r'\/company\/.*\?goback=.*'), callback='parse_item',follow=True)
   # looking for domains like in my rule:
   #http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
   #http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits   

我也尝试了这条规则,但没有奏效,什么也没发生,也没有错误:规则 #2

  rules = (
    Rule(SgmlLinkExtractor(allow=('\/company\/[0-9][0-9][0-9][0-9]\?',)), callback='parse_item'),
)

代码

class LinkedPySpider(CrawlSpider):
    name = 'LinkedPy'
    allowed_domains = ['linkedin.com']
    login_page = 'https://www.linkedin.com/uas/login'
    start_urls = ["http://www.linkedin.com/csearch/results"]

    Rule(SgmlLinkExtractor(allow=r'\/company\/.*\?goback=.*'), callback='parse_item',follow=True)
   # looking for domains like in my rule:
   #http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
   #http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits

    def start_requests(self):
    yield Request(
    url=self.login_page,
    callback=self.login,
    dont_filter=True
    )

  #  def init_request(self):
    #"""This function is called before crawling starts."""
  #      return Request(url=self.login_page, callback=self.login)

    def login(self, response):
    #"""Generate a login request."""
    return FormRequest.from_response(response,
            formdata={'session_key': 'yescobar2012@gmail.com', 'session_password': 'yescobar01'},
            callback=self.check_login_response)

    def check_login_response(self, response):
    #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
    if "Sign Out" in response.body:
        self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
        # Now the crawling can begin..
        self.log('Hi, this is an response page! %s' % response.url)

        return Request(url='http://www.linkedin.com/csearch/results')

    else:
        self.log("\n\n\nFailed, Bad times :(\n\n\n")
        # Something went wrong, we couldn't log in, so nothing happens.


    def parse_item(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ol[@id=\'result-set\']/li')
    items = []
    for site in sites:
        item = LinkedconvItem()
        item['title'] = site.select('h2/a/text()').extract()
        item['link'] = site.select('h2/a/@href').extract()
        items.append(item)
    return items

输出

C:\Users\ye831c\Documents\Big Data\Scrapy\linkedconv>scrapy crawl LinkedPy
2013-07-15 12:05:15-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: linkedconv)
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-15 12:05:15-0500 [LinkedPy] INFO: Spider opened
2013-07-15 12:05:15-0500 [LinkedPy] INFO: Crawled 0 pages (at 0 pages/min), scra
ped 0 items (at 0 items/min)
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-15 12:05:16-0500 [LinkedPy] DEBUG: Crawled (200) <GET https://www.linked
in.com/uas/login> (referer: None)
2013-07-15 12:05:16-0500 [LinkedPy] DEBUG: Redirecting (302) to <GET http://www.
linkedin.com/nhome/> from <POST https://www.linkedin.com/uas/login-submit>
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/nhome/> (referer: https://www.linkedin.com/uas/login)
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG:


    Successfully logged in. Let's start crawling!



2013-07-15 12:05:17-0500 [LinkedPy] DEBUG: Hi, this is an item page! http://www.
linkedin.com/nhome/
2013-07-15 12:05:18-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/csearch/results> (referer: http://www.linkedin.com/nhome/)
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Closing spider (finished)
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2171,
     'downloader/request_count': 4,
     'downloader/request_method_count/GET': 3,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 87904,
     'downloader/response_count': 4,
     'downloader/response_status_count/200': 3,
     'downloader/response_status_count/302': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 7, 15, 17, 5, 18, 941000),
     'log_count/DEBUG': 12,
     'log_count/INFO': 4,
     'request_depth_max': 2,
     'response_received_count': 3,
     'scheduler/dequeued': 4,
     'scheduler/dequeued/memory': 4,
     'scheduler/enqueued': 4,
     'scheduler/enqueued/memory': 4,
     'start_time': datetime.datetime(2013, 7, 15, 17, 5, 15, 820000)}
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Spider closed (finished)

【问题讨论】:

    标签: python html scrapy


    【解决方案1】:

    SgmlLinkExtractor 使用 re 在链接 URL 中查找匹配项。

    你传入allow= 的内容会经过.compile(),然后使用_matches 检查页面中的所有链接,_matches 在已编译的正则表达式中使用....search()

        _matches = lambda url, regexs: any((r.search(url) for r in regexs))
    

    https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py

    当我在 Python shell 中检查您的正则表达式时,它们都可以工作(它们为 URL 1 和 2 返回一个 SRE_Match;我添加了一个失败的正则表达式进行比较):

    >>> import re
    >>> url1 = 'http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits'
    >>> url2 = 'http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits'
    >>> regex1 = re.compile(r'\/company\/.*\?goback=.*')
    >>> regex2 = re.compile('\/company\/[0-9][0-9][0-9][0-9]\?')
    >>> regex_fail = re.compile(r'\/company\/.*\?gobackbogus=.*')
    >>> regex1.search(url1)
    <_sre.SRE_Match object at 0xe6c308>
    >>> regex2.search(url1)
    <_sre.SRE_Match object at 0xe6c2a0>
    >>> regex_fail.search(url1)
    >>> regex1.search(url2)
    <_sre.SRE_Match object at 0xe6c308>
    >>> regex2.search(url2)
    <_sre.SRE_Match object at 0xe6c2a0>
    >>> regex_fail.search(url2)
    >>> 
    

    要检查页面中是否有链接(如果所有内容都不是 Javascript 生成的),我会添加一个非常通用的 Rule 匹配每个链接(设置 allow=() 或不设置 allow 在全部) 见http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor

    但最后,使用 LinkedIn API 进行公司搜索可能会更好: http://developer.linkedin.com/documents/company-search

    【讨论】:

    • 嘿,保罗,我不明白当你输入 regex1.search(url1) 时,它返回“<_sre.sre_match object at>”的表达式是什么意思,你说第一个表达式有效。所以第二个表达式不起作用?我看到同样的事情是返回 regex1 和 regex2。我使用scrapy的原因是因为我必须在我们的Intranet站点上提取信息,并且我使用linkedin站点作为一种实践,因为我公司以外的任何人都无法访问Intranet站点,我将无法发布问题我提取数据时的人脸。
    • 事实上它们都有效。如果您获得 SRE_Match,则表示已检测到该模式。 (我会编辑答案)
    • 如果这是您在处理 Intranet 网页之前的习惯,LinkedIn 可能不是最容易抓取的网站。你看过 dmoz.org 在 Scrapy 中的使用吗?(教程http://doc.scrapy.org/en/latest/intro/tutorial.html)。将教程转换为使用 CrawlSpider 可能会很有趣
    • 是的,我使用过 dmoz(我相信 dmoz 没有关注链接),但我正在寻找具有登录身份验证并使用 crawlspider 来关注链接的东西,但没有找到它.
    • 确实如此。您是否尝试过使用通用 Rule(SgmlLinkExtractor()) 来查看您的 HTML 中是否有链接?
    猜你喜欢
    • 1970-01-01
    • 2014-09-21
    • 2013-01-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-13
    • 2016-11-11
    相关资源
    最近更新 更多