【问题标题】:How to scrape iframe using selenium?如何使用硒刮 iframe?
【发布时间】:2021-04-07 02:53:56
【问题描述】:

我想提取网站中的所有评论。该网站使用 iframe 作为评论部分。我已经尝试使用硒刮掉它。但不幸的是,我只能抓取 1 条评论。如何抓取评论的其余部分并将其归档到 csv 或 xmls?

  • 代码:
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome()
    page = driver.get("https://finance.detik.com/berita-ekonomi-bisnis/d-5307853/ri-disebut-punya-risiko-korupsi-yang-tinggi?_ga=2.13736693.357978333.1608782559-293324864.1608782559")
    
    iframe = WebDriverWait(driver,20).until(EC.presence_of_element_located((By.XPATH, "//iframe[@class='xcomponent-component-frame xcomponent-visible']")))
    driver.switch_to.frame(iframe)
    
    xpath = '//*[@id="cmt66363941"]/div[1]/div[1]'
    extract_name = WebDriverWait(driver,20).until(EC.presence_of_element_located((By.XPATH, xpath)))
    username=extract_name.text
    
    xpath = '//*[@id="cmt66363941"]/div[1]/div[2]'
    extract_comment = WebDriverWait(driver,20).until(EC.presence_of_element_located((By.XPATH, xpath)))
    comment=extract_comment.text
    
    print(username, comment)
  • 输出
    King Akbarmachinery
    3 hari yang lalu selama korupsi tidak dihukum mati disanalah korupsi masih liar dan ada kalaupun dibuat hukum mati setidaknya bisa mengurangi angka korupsi itu
    Laporkan
    2BalasBagikan:

顺便问一下,如何从输出中删除这一行?

Laporkan
2BalasBagikan:

【问题讨论】:

    标签: python selenium web-scraping iframe


    【解决方案1】:

    您应该概括您的路径,以便同时获取所有用户和所有 cmets。你可以使用presence_of_all_elements_located获取所有的cmets和所有用户

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome()
    page = driver.get(
        "https://finance.detik.com/berita-ekonomi-bisnis/d-5307853/ri-disebut-punya-risiko-korupsi-yang-tinggi?_ga=2.13736693.357978333.1608782559-293324864.1608782559")
    
    iframe = WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.XPATH, "//iframe[@class='xcomponent-component-frame xcomponent-visible']")))
    driver.switch_to.frame(iframe)
    
    xpath_users = "//div[contains(@class, 'comment__cmt_dk_name___EGuzI ')]"
    extract_names = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, xpath_users)))
    
    xpath_comments = "//div[contains(@class, 'comment__cmt_box_text')]"
    extract_comments = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, xpath_comments)))
    
    for user, comment in zip(extract_names, extract_comments):
        user = user.text.split("\n")[0]
        comment = comment.text.split("\n")[0]
        print(user, comment)
    

    【讨论】:

    • 其实iframe中的cmets有1页以上。因此,并非 iframe 中的所有 cmets 都被提取。未提取下一页中的 cmets。
    【解决方案2】:

    这就是您如何使用 requests 模块发出带有适当参数的 post 请求来实现相同的目的,这些参数应该为您获取所有页面的内容。

    import requests
    from urllib.parse import unquote
    
    url = 'https://apicomment.detik.com/graphql'
    payload = {"query":"query search($type: String!, $size: Int!,$anchor: Int!, $sort: String!, $adsLabelKanal: String, $adsEnv: String, $query: [ElasticSearchAggregation]) {\nsearch(type: $type, size: $size,page: $anchor, sort: $sort,adsLabelKanal: $adsLabelKanal, adsEnv: $adsEnv, query: $query){\npaging sorting counter counterparent profile hits {\nposisi hasAds results {\n id author content like prokontra  status news create_date pilihanredaksi refer liker { id } reporter { id status_report } child { id child parent author content like prokontra status create_date pilihanredaksi refer liker { id } reporter { id status_report } authorRefer } } } }}","variables":{"type":"comment","sort":"newest","size":10,"anchor":1,"query":[{"name":"news.artikel","terms":5307853},{"name":"news.site","terms":"dtk"}],"adsLabelKanal":"detik_finance","adsEnv":"desktop"}}
    
    while True:
        r = requests.post(url,json=payload)
        container = r.json()['data']['search']['hits']['results']
        if not container:
            break
        else:
            for item in container:
                if not len(item['author']):continue
                print(item['author']['name'],unquote(item['content']))
    
        payload['variables']['anchor']+=1
    

    【讨论】:

    • 这对我帮助很大。我以前从未使用过 requests 模块,我可以得到一些关于它的参考吗?特别是关于网络报废。
    猜你喜欢
    • 1970-01-01
    • 2011-03-24
    • 2020-08-23
    • 2017-06-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-10-06
    • 1970-01-01
    相关资源
    最近更新 更多