【问题标题】:Scrapy is not able to scrape the email field from websiteScrapy 无法从网站上抓取电子邮件字段
【发布时间】:2020-04-06 15:27:51
【问题描述】:

我正在尝试抓取网站以获取其数据,并且浏览器上的 javascript 似乎正在停止获取电子邮件地址。

有人能告诉我如何获取电子邮件地址吗?

网站:https://directory.easternuc.com/publicDirectory

from scrapy import cmdline
import scrapy
from tutorial.items import TutorialItem


class DemoSpider(scrapy.Spider):
    name = "DemoSpider"

    def start_requests(self):
        urls = []
        for page in range(1, 3):
            url = "https://directory.easternuc.com/publicDirectory?page=%s" %page
            urls.append(url)

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        item = TutorialItem()
        index = 1
        for _ in response.selector.xpath("//tr/td/h4/text()").getall():
            item['name'] = response.selector.xpath("//tr[%s]/td/h4/text()" % index).get()
            item['phone'] = response.selector.xpath("//tr[%s]/td[2]/text()" % index).get()
            item['mobile'] = response.selector.xpath("//tr[%s]/td[3]/text()" % index).get()
            item['email'] = response.selector.xpath("//tr[%s]/td[4]/text()" % index).get()
            index += 1
            yield item

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    这是因为这些电子邮件不是td 标记的直接子级

    请试试这个代码

    def parse(self, response):
        for tr in response.xpath("//table/tr"):
            item = TutorialItem()
            item['name'] = tr.xpath("./td[1]/h4/text()").get()
            item['phone'] = tr.xpath("./td[2]/text()").get()
            item['mobile'] = tr.xpath("./td[3]/text()").get()
            item['email'] = "".join(tr.xpath("./td[4]//text()").getall())
            yield item
    

    【讨论】:

      猜你喜欢
      • 2019-10-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多