【问题标题】:stop redundant data scraping in crawler停止爬虫中的冗余数据抓取
【发布时间】:2021-03-08 18:19:24
【问题描述】:

我将 Selenium 与 Python 结合使用,但每当我运行我的 Python 脚本时,我都会得到冗余数据。

for index in range(1, 20):
    try:
        business_el = WebDriverWait(driver, 100).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="pane"]/div/div[1]/div/div/div[5]/div[1]/div[1]'.format(index))))  
        business_el.click()
        time.sleep(5)

        tree = html.fromstring(driver.page_source)
        title = get_data(tree, '//div[@role="main"]//h1[contains(@class, "section-hero-header-title-title ")]/span/text()')
        phone_number = get_data(tree, '//button[@data-tooltip="Copy phone number"]/div/div[@aria-hidden="false"]/div/text()')
        website_url = get_data(tree, '//button[@data-tooltip="Open website"]/div/div[@aria-hidden="false"]/div/text()')
        address = get_data(tree, '//button[@data-item-id="address"]/div/div[@aria-hidden="false"]/div/text()')
        ratings = get_data(tree, '//span[@class="section-star-display"]/text()')
        reviewsCount = get_data(tree, '//span[@class="section-rating-term"]//button[contains(@aria-label, " reviews")]/text()')
        description = get_data(tree, '//div[@class="section-editorial-quote"]/span/text()')
        try:
            email = parse_email(website_url)
        except Exception as e:
            email = ''

        print(title, phone_number, website_url, email, address, ratings, reviewsCount, description)
        writer.writerow([
            title, 
            phone_number, 
            website_url, 
            email,
            address, 
            ratings, 
            reviewsCount, 
            description, 
        ])

【问题讨论】:

  • 输出是什么?
  • 它多次废弃一个结果,在列表中,它在打印出第二行之前打印出第一行 10 次。

标签: python selenium web-scraping web-crawler


【解决方案1】:

driver.page_source 可能不会返回您认为的所有信息。这是收集页面数据的“最佳尝试”,它不一定是完整的 DOM。本质上它只是给出初始页面加载的内容,没有加载所有动态内容

另请参阅:https://stackoverflow.com/a/65567070/1387701

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2023-03-05
    • 2013-05-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-11-30
    相关资源
    最近更新 更多