【问题标题】:Crawl Quora Q&As using BeautifulSoup使用 BeautifulSoup 抓取 Quora 问答
【发布时间】:2020-08-04 16:04:03
【问题描述】:

我用于抓取 Quora 问题的代码如下:

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "https://www.quora.com/What-is-the-best-workout-1"

page = requests.get(URL)

soup = BeautifulSoup(page.text, "html.parser")

print(soup.find_all("span", {"class": "q-box qu-userSelect--text"}))

结果是一个空列表。

问题是page.text 包含的源代码与我在 Quora 上检查元素时得到的源代码不同。

相反,它包含以下text,其中不包含任何<span> 元素

这是我使用Inspect Element时得到的代码

【问题讨论】:

    标签: python web-scraping beautifulsoup quora


    【解决方案1】:

    试试:

    from selenium import webdriver
    import time
    
    driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
    
    URL = "https://www.quora.com/What-is-the-best-workout-1"
    driver.get(URL)
    
    
    
    PAUSE_TIME = 2
    
    
    lh = driver.execute_script("return document.body.scrollHeight")
    
    while True:
    
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(PAUSE_TIME)
        nh = driver.execute_script("return document.body.scrollHeight")
        if nh == lh:
            break
        lh = nh
    spans = driver.find_elements_by_css_selector('span.q-box.qu-userSelect--text')
    for span in spans:
        print(span.text)
        print('-' * 80)
    

    打印:

    What is the best workout?
    --------------------------------------------------------------------------------
    The best workout is the one you don't skip.
    Look, you can discuss sets and reps, crossfit and powerlifting, diet and supplements endlessly. And there is some value in it, if only just for entertainment sometimes (especially on the internet). But let's just get one thing straight here - if you are doing any kind of workout then it's going to have a greater impact than if you weren't. Simple as that.
    Of course there are caveats. You don't want to get hurt, so they can pretty much all be summed up into one commandment: Thou shalt not be an idiot. Getting under a bar loaded with 495 lbs and squattin
    --------------------------------------------------------------------------------
    
    --------------------------------------------------------------------------------
    What are some at-home workouts?
    --------------------------------------------------------------------------------
    Gyms are closed here because of the Coronavirus. What are your top 3 bodyweight exercises for building muscle?
    --------------------------------------------------------------------------------
    What is the best body weight workout routine?
    --------------------------------------------------------------------------------
    

    等等……

    我不确定你真正想要的是q-box qu-userSelect--text。但这是你要求的......

    注意 selenium:您需要 seleniumgeckodriver 并且在此代码中 geckodriver 设置为从 c:/program/geckodriver.exe 导入

    【讨论】:

    • 干得好,我以为自己要使用硒。您能否更改代码以获得完整的答案,因为它会打印简短版本(无需单击阅读更多按钮)。
    猜你喜欢
    • 1970-01-01
    • 2020-11-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-03-27
    • 2011-03-10
    • 1970-01-01
    • 2021-09-01
    相关资源
    最近更新 更多