【问题标题】:xpath results different from scrapy and browser consolexpath 结果与 scrapy 和浏览器控制台不同
【发布时间】:2017-09-12 20:14:59
【问题描述】:

我正在使用 selenium 和 PhantomJS 从大学web page收集教授的联系信息(不是出于恶意目的)

出于测试目的,假设 kw.txt 是一个仅包含两个姓氏的文件

最大

import scrapy
from selenium import webdriver

from universities.items import UniversitiesItem

class iupui(scrapy.Spider):
    name = 'iupui'
    allowed_domains = ['iupui.com']
    start_urls = ['http://iupuijags.com/staff.aspx']

    def __init__(self):
        self.last_name = ''

    def parse(self, response):
        with open('kw.txt') as file_object:
            last_names = file_object.readlines()

        for ln in last_names:
            #driver = webdriver.PhantomJS("C:\\Users\yashi\AppData\Roaming\Python\Python36\Scripts\phantomjs.exe")
            driver = webdriver.Chrome('C:\\Users\yashi\AppData\Local\Programs\Python\Python36\chromedriver.exe')
            driver.set_window_size(1120, 550)
            driver.get('http://iupuijags.com/staff.aspx')

            kw_search = driver.find_element_by_id('ctl00_cplhMainContent_txtSearch')
            search = driver.find_element_by_id('ctl00_cplhMainContent_btnSearch')

            self.last_name = ln.strip()
            kw_search.send_keys(self.last_name)
            search.click()

            item = UniversitiesItem()
            results = response.xpath('//table[@class="default_dgrd staff_dgrd"]//tr[contains(@class,"default_dgrd_item '
                                    'staff_dgrd_item") or contains(@class, "default_dgrd_alt staff_dgrd_alt")]')
            for result in results:
                full_name = result.xpath('./td[@class="staff_dgrd_fullname"]/a/text()').extract_first()
                print(full_name)
                if self.last_name in full_name.split():
                    item['full_name'] = full_name
                    email = result.xpath('./td[@class="staff_dgrd_staff_email"]/a/href').extract_first()
                    if email is not None:
                        item['email'] = email[7:]
                    else:
                        item['email'] = ''
                    item['phone'] = result.xpath('./td[@class="staff_dgrd_staff_phone"]/text()').extract_first()
                yield item
            driver.close()

但是,结果总是给我一堆名字的样子

2017-09-12 15:27:13 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
Dr. Roderick Perry
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx>
{}
Gail Barksdale
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx>
{}
John Rasmussen
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx>
{}
Jared Chasey
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx>
{}
Denise O'Grady
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx>
{}
Ed Holdaway
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx>
{}

每次迭代的结果长度始终相同。

当我将 xpath 放入控制台时,这就是它在控制台中的样子: console result

我实在想不通是什么问题。

【问题讨论】:

    标签: selenium xpath scrapy web-crawler


    【解决方案1】:

    这样的问题很少。

    • 您没有使用 selenium 代码的响应。你是 浏览页面,然后从页面源处什么都不做。

    • 接下来,即使未找到匹配项,您也会让出项目,因此
      空白项。

    • 当它应该在里面时,你也在循环之外创建项目

    • 您进行的比较区分大小写。所以你检查
      max,但结果有Max,你忽略了匹配。

    • 电子邮件的 href 中还缺少 @

    以下是固定版本

    class iupui(scrapy.Spider):
        name = 'iupui'
        allowed_domains = ['iupui.com']
        start_urls = ['http://iupuijags.com/staff.aspx']
    
        # def __init__(self):
        #     self.last_name = ''
    
        def parse(self, response):
            # with open('kw.txt') as file_object:
            #     last_names = file_object.readlines()
            last_names = ["max"]
            for ln in last_names:
                #driver = webdriver.PhantomJS("C:\\Users\yashi\AppData\Roaming\Python\Python36\Scripts\phantomjs.exe")
                driver = webdriver.Chrome()
                driver.set_window_size(1120, 550)
                driver.get('http://iupuijags.com/staff.aspx')
    
                kw_search = driver.find_element_by_id('ctl00_cplhMainContent_txtSearch')
                search = driver.find_element_by_id('ctl00_cplhMainContent_btnSearch')
    
                self.last_name = ln.strip()
                kw_search.send_keys(self.last_name)
                search.click()
    
                res = response.replace(body=driver.page_source)
    
    
                results = res.xpath('//table[@class="default_dgrd staff_dgrd"]//tr[contains(@class,"default_dgrd_item '
                                        'staff_dgrd_item") or contains(@class, "default_dgrd_alt staff_dgrd_alt")]')
                for result in results:
                    full_name = result.xpath('./td[@class="staff_dgrd_fullname"]/a/text()').extract_first()
                    print(full_name)
                    if self.last_name.lower() in full_name.lower().split():
                        item = UniversitiesItem()
    
                        item['full_name'] = full_name
                        email = result.xpath('./td[@class="staff_dgrd_staff_email"]/a/@href').extract_first()
                        if email is not None:
                            item['email'] = email[7:]
                        else:
                            item['email'] = ''
                        item['phone'] = result.xpath('./td[@class="staff_dgrd_staff_phone"]/text()').extract_first()
                        yield item
                driver.close()
    

    【讨论】:

    • 非常感谢。它看起来不错,虽然我没有机器来测试它。 res = response.replace(body=driver.page_source) 。我认为这是关键问题。而“最大”只是我的错误。在keyword.txt 中应该是“Max”。我也不断打开和关闭浏览器。那是非常愚蠢的。但现在我想我可以把它排除在循环之外,直到整个爬行工作完成。
    • 有效!谢谢!你想看看我问的另一个类似的问题吗?我想我也应该用其他东西代替响应。 stackoverflow.com/questions/46125667/…
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-09-22
    • 1970-01-01
    • 1970-01-01
    • 2020-10-19
    • 1970-01-01
    • 1970-01-01
    • 2014-05-06
    相关资源
    最近更新 更多