【问题标题】:How to extract the title and href attributes from the questions on reddit.com search page using Selenium Python如何使用 Selenium Python 从 reddit.com 搜索页面上的问题中提取 title 和 href 属性
【发布时间】:2019-03-07 23:23:35
【问题描述】:

我想抓取页面https://www.reddit.com/search?q=Expiration&type=link&sort=new上所有问题的链接和标题。元素具有以下结构:

<a data-click-id="body" class="SQnoC3ObvgnGjWt90zD9Z" href="/r/excel/comments/ayiahc/calculating_expiration_dates_previous_solution_no/">
    <h2 class="s1okktje-0 cDxKta">
        <span style="font-weight:normal">Calculating Expiration Dates - Previous Solution No Longer Works</span>
    </h2>
</a>

我使用questions = driver.find_elements_by_xpath('//a[@data-click-id="body"]') 获取问题,然后通过for 对其进行迭代。我可以使用question.get_attribute('href') 来获取链接。

但是,我不知道如何提取 span 中的标题(来自 question)。

有人知道怎么做吗?

【问题讨论】:

    标签: python selenium selenium-webdriver webdriver webdriverwait


    【解决方案1】:

    在硒中

    question.find_elements_by_xpath.('./h2/span').text
    

    将返回 for 循环中基础 span 元素的文本元素

    使用 lxml

    import requests
    from lxml import html
    
    UA = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0 Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'}
    
    page = requests.get('https://www.reddit.com/search?q=Expiration&type=link&sort=new',
                        headers = UA)
    
    tree = html.fromstring(page.content)
    
    questions = tree.xpath('//a[@data-click-id="body"]')
    
    parsed_q = []
    
    for question in questions:
        url = question.xpath('./@href')[0]
        title = question.xpath('./h2/span/text()')[0]
        print("Title: {} --- URL: {}".format(title,url))
        parsed_q.append(tuple([title,url]))
    
    print(parsed_q)
    

    【讨论】:

    • 在 selenium 中,应该是 question.find_elements_by_xpath('./h2/span')[0].get_attribute('innerHTML')
    【解决方案2】:

    要抓取webpage 上所有问题的titlehref 属性,您需要为visibility_of_all_elements_located() 诱导WebDriverWait您可以使用以下解决方案:

    • 代码块:

      from selenium import webdriver
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.support import expected_conditions as EC
      
      options = webdriver.ChromeOptions()
      options.add_argument("start-maximized")
      options.add_argument("--disable-extensions")
      options.add_argument('disable-infobars')
      driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
      driver.get("https://www.reddit.com/search?q=Expiration&type=link&sort=new")
      elements = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@data-click-id='body' and @href]")))
      question_title = [element.get_attribute("innerHTML") for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@data-click-id='body' and @href]/h2/span")))]
      question_link = [element.get_attribute("href") for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@data-click-id='body' and @href]")))]
      for i,j in zip(question_title, question_link):
          print("{} question link is {}".format(i, j))
      
    • 控制台输出:

      MeasureUp vouchers for Microsoft technical practice exams question link is https://www.reddit.com/r/IT_CERT_STUDY/comments/ayn846/measureup_vouchers_for_microsoft_technical/
      should i break up with him or not? question link is https://www.reddit.com/r/relationship_advice/comments/ayn2ux/should_i_break_up_with_him_or_not/
      Update on the blown up intervention. Things got worse. He went to jail. I am leaving to get treatment for trauma, PTSD, and codependency. question link is https://www.reddit.com/r/AlAnon/comments/aymg7u/update_on_the_blown_up_intervention_things_got/
      AITA for taking two days to consider a employment offer? question link is https://www.reddit.com/r/AmItheAsshole/comments/aymbkt/aita_for_taking_two_days_to_consider_a_employment/
      Hi, I am trying to find the most current application with which to renew DACA, but the only one I can find on the USCIS has 1/31/19 as the <em style="font-weight:700">expiration</em> date. Is this still a valid form, or will it be rejected? Current permit is to expire in September. question link is https://www.reddit.com/r/DACA/comments/aylocz/hi_i_am_trying_to_find_the_most_current/
      Just like food, humans have <em style="font-weight:700">expiration</em> dates too.. question link is https://www.reddit.com/r/Showerthoughts/comments/ayldhy/just_like_food_humans_have_expiration_dates_too/
      Drug <em style="font-weight:700">Expiration</em> Dates — Do They Mean Anything? - Harvard Health (This should be sticky'd btw) question link is https://www.reddit.com/r/opiates/comments/aykizn/drug_expiration_dates_do_they_mean_anything/
      Here is a scientific study made on how long prescription drugs last after their expiry date. I thought it would be relevant here. question link is https://www.reddit.com/r/preppers/comments/aykhcm/here_is_a_scientific_study_made_on_how_long/
      If poison is past its <em style="font-weight:700">expiration</em> date, is it more poisonous or less poisonous? question link is https://www.reddit.com/r/shittyaskscience/comments/ayjypt/if_poison_is_past_its_expiration_date_is_it_more/
      Is there any coming back from deep-seated resentment? question link is https://www.reddit.com/r/Marriage/comments/ayjrpd/is_there_any_coming_back_from_deepseated/
      29 Domains for sale! | All Priced between $19-$79 | BIN via Efty/Paypal question link is https://www.reddit.com/r/Domains/comments/ayji73/29_domains_for_sale_all_priced_between_1979_bin/
      This Pringles can with a Leap Day <em style="font-weight:700">expiration</em> date question link is https://www.reddit.com/r/mildlyinteresting/comments/ayjg9b/this_pringles_can_with_a_leap_day_expiration_date/
      Is it wrong for a relationship to have an <em style="font-weight:700">expiration</em> date? question link is https://www.reddit.com/r/relationships/comments/ayizdk/is_it_wrong_for_a_relationship_to_have_an/
      Buy Valtrex From a Usa Pharmacy Without a Prescription, How To Mail Order Valtrex Canada question link is https://www.reddit.com/r/Fermat/comments/ayit8c/buy_valtrex_from_a_usa_pharmacy_without_a/
      Fragment Bullets question link is https://www.reddit.com/r/Diepio/comments/ayify9/fragment_bullets/
      Calculating <em style="font-weight:700">Expiration</em> Dates - Previous Solution No Longer Works question link is https://www.reddit.com/r/excel/comments/ayiahc/calculating_expiration_dates_previous_solution_no/
      My current dilemma with excess backstory. TLDR at the bottom question link is https://www.reddit.com/r/atheism/comments/ayi45q/my_current_dilemma_with_excess_backstory_tldr_at/
      Worst (and not-so-bad) things for metric-born US resident question link is https://www.reddit.com/r/Metric/comments/ayhqw0/worst_and_notsobad_things_for_metricborn_us/
      Weird Question about sorting your papers? question link is https://www.reddit.com/r/konmari/comments/ayhgaj/weird_question_about_sorting_your_papers/
      Hot Cash Mega Thread question link is https://www.reddit.com/r/funkopop/comments/ayheji/hot_cash_mega_thread/
      [40k] What would a modern earth's tithe consist of? question link is https://www.reddit.com/r/AskScienceFiction/comments/ayh60x/40k_what_would_a_modern_earths_tithe_consist_of/
      TIL an FDA study requested by the military found 90% of more than 100 drugs, both prescription and over-the-counter, were still safe &amp; effective even 15 years after the <em style="font-weight: 700;">expiration</em> date. <em style="font-weight: 700;">Expiration</em> dates don’t really indicate a point at which the medication is no longer effective or unsafe to use. question link is https://www.reddit.com/r/unremovable/comments/aygnp6/til_an_fda_study_requested_by_the_military_found/
      Do I still have stock options? question link is https://www.reddit.com/r/stocks/comments/aygh3b/do_i_still_have_stock_options/
      CVS Coupon <em style="font-weight: 700;">Expiration</em> Policy question link is https://www.reddit.com/user/nowpromooff/comments/ayggbn/cvs_coupon_expiration_policy/
      CVS Coupon <em style="font-weight: 700;">Expiration</em> Date question link is https://www.reddit.com/user/nowpromooff/comments/aygg3n/cvs_coupon_expiration_date/
      

    【讨论】:

      【解决方案3】:

      试试下面的。

      question.find_element_by_tag_name('span').text
      

      或者干脆

      question.text
      

      【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-10-19
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-04-04
      相关资源
      最近更新 更多