【问题标题】:Scrape ajax pages抓取ajax页面
【发布时间】:2022-07-06 03:56:08
【问题描述】:

我不知道如何抓取ajax页面网站上没有分页点击load more button将加载网站这些是页面链接https://aaos22.mapyourshow.com/8_0/explore/exhibitor-gallery.cfm?featured=false

import scrapy
from scrapy.http import Request
from selenium import webdriver
from scrapy_selenium import SeleniumRequest
import pandas  as pd

class TestSpider(scrapy.Spider):
    name = 'test'
    
    
    def start_requests(self):
        yield SeleniumRequest(
            url="https://aaos22.mapyourshow.com/8_0/explore/exhibitor-gallery.cfm?featured=false",
            wait_time=3,
            screenshot=True,
            callback=self.parse,
            dont_filter=True
        )
        
    def parse(self, response):
        books = response.xpath("//h3[@class='card-Title\nbreak-word\nf3\nmb1\nmt0']//a//@href").extract()
        
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)
            
    def parse_book(self, response):
        title = response.css(".mr3-m::text").get()
        
        address = response.css(".showcase-address::text").get()
        address=address.strip()
        
        
        website = response.xpath("//li[@class='dib  ml3  mr3']//a[starts-with(@href, 'http')]/@href").get() 
        website=website.strip()
        
        phone = response.xpath("//li[@class='dib  ml3  mr3'] //span[contains(text(), 'Phone:')]/following-sibling::text()").get()
        phone=phone.strip().replace("-","")
        
        
        yield{
            'title':title,
            'address':address,
            'website':website,
            'phone':phone
            
        }
    
    

【问题讨论】:

  • 那么你到底卡在什么地方了?点击加载更多结果按钮?
  • 是的,当我单击它们显示结果时,我被困在Load More Results button,但我不知道如何从中抓取数据
  • 你想抓取哪些信息?
  • title address website phone
  • 我也没有看到你在代码试验中抓取 titleaddresswebsitephone

标签: python selenium web-scraping scrapy


【解决方案1】:

我没有使用你的代码,而是按照我的方式做的。但我希望这会有所帮助:

import requests
from pprint import pprint
import json

headers = {
    'x-requested-with': 'XMLHttpRequest',
}

params = {
    'action': 'search',
    'searchtype': 'exhibitorgallery',
    'searchsize': '200', # don`t increase this too much (increase the start parameter instead and send a new request after some delay)
    'start': '0',
}

response = requests.get('https://aaos22.mapyourshow.com/8_0/ajax/remote-proxy.cfm', params=params, headers=headers)

data = json.loads(response.text)

all_sites = []
for exs in data["DATA"]["results"]["exhibitor"]["hit"]:
    id = exs["fields"]["exhid_l"]
    site = f"https://aaos22.mapyourshow.com/8_0/exhibitor/exhibitor-details.cfm?exhid={id}"
    all_sites.append(site)

# now scrape all websites **slowly** and get the data you want
for site in all_sites:
    pass

其余部分仍由您决定;)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2013-04-29
    • 2010-09-20
    • 2017-08-28
    • 2022-11-12
    相关资源
    最近更新 更多