【问题标题】:Scraping youtube all the comments and its replies with selenium in python在 python 中用 selenium 抓取 youtube 上的所有评论及其回复
【发布时间】:2019-10-16 02:17:08
【问题描述】:

我正在尝试抓取 youtube 视频 cmets 及其回复、评论喜欢、评论不喜欢、评论计数、回复计数。

首先,我尝试使用基于 id 的 python 中的 selenium google 驱动程序抓取 cmets 之类的文本数据及其回复。

我只能抓取页面中可用的 cmets,而不是其回复。

回复无法实现。

import time
import csv
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chrome_path = "/Users/Downloads/chromedriver"
page_url = "https://www.youtube.com/watch?v=AJesAlohO6I&t=" 


driver = webdriver.Chrome(executable_path=chrome_path)
driver.get(page_url)
time.sleep(2)  


title = driver.find_element_by_xpath('//*[@id="container"]/h1/yt-formatted-string').text
print(title)


SCROLL_PAUSE_TIME = 2
CYCLES = 100

html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.PAGE_DOWN)  
html.send_keys(Keys.PAGE_DOWN)  
time.sleep(SCROLL_PAUSE_TIME * 3)

for i in range(CYCLES):
    html.send_keys(Keys.END)
    time.sleep(SCROLL_PAUSE_TIME)


comment_elems = driver.find_elements_by_xpath('//*[@id="content-text"]')
all_comments = [elem.text for elem in comment_elems]
print(all_comments)

write_file = "output_testing.csv"
with open(write_file, "w") as output:
    for line in all_comments:
        output.write(line + '\n')

使用上面的代码,我只能抓取 cmets。如何在python中用selenium抓取那些cmets的回复,喜欢,不喜欢,日期。

谁能帮我指出我哪里出错了。

更新代码(空数组)

import time
import csv
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chrome_path = "/Users/Downloads/chromedriver"
page_url = "https://www.youtube.com/watch?v=qBp1rCz_yQU" 


driver = webdriver.Chrome(executable_path=chrome_path)
driver.get(page_url)
time.sleep(2)  


title = driver.find_element_by_xpath('//*[@id="container"]/h1/yt-formatted-string').text
print(title)


SCROLL_PAUSE_TIME = 2
CYCLES = 100

html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.PAGE_DOWN)  
html.send_keys(Keys.PAGE_DOWN)  
time.sleep(SCROLL_PAUSE_TIME * 3)

for i in range(CYCLES):
    html.send_keys(Keys.END)
    time.sleep(SCROLL_PAUSE_TIME)

driver.find_elements_by_xpath('//div[@id="replies"]/ytd-comment-replies-renderer/ytd-expander/paper-button[@id="more"]')

comment_elems = driver.find_elements_by_xpath('//div[@id="loaded-replies"]//yt-formatted-string[@id="content-text"]')
all_comments = [elem.text for elem in comment_elems]
print(all_comments)

write_file = "output_31may.csv"
with open(write_file, "w") as output:
    for line in all_comments:
        output.write(line + '\n')

我更新的代码:(1-05-2019)

import time
import csv
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chrome_path = "/Users/Downloads/chromedriver"
page_url = "https://www.youtube.com/watch?v=qBp1rCz_yQU" 


driver = webdriver.Chrome(executable_path=chrome_path)
driver.get(page_url)
time.sleep(2)  


title = driver.find_element_by_xpath('//*[@id="container"]/h1/yt-formatted-string').text
print(title)


SCROLL_PAUSE_TIME = 2
CYCLES = 100

html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.PAGE_DOWN)  
html.send_keys(Keys.PAGE_DOWN)  
time.sleep(SCROLL_PAUSE_TIME * 3)

for i in range(CYCLES):
    html.send_keys(Keys.END)
    time.sleep(SCROLL_PAUSE_TIME)


comment_elems = driver.find_elements_by_xpath('//*[@id="content-text"]')
all_comments = [elem.text for elem in comment_elems]
#print(all_comments)

replies_elems =driver.find_elements_by_xpath('//*[@id="replies"]')
all_replies = [elem.text for elem in replies_elems]
print(all_replies)

write_file = "output_replies.csv"
with open(write_file, "w") as output:
    for line in all_replies:
        output.write(line + '\n')

我的实际输出:

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'View 39 replies', '', '', 'View 2 replies', '', '', '', 'View reply', '', '', '', '', '', 'View reply', '', '', '', '', '', '', '', '', 'View reply', '', '', 'View reply', '', '', '', '', 'View 43 replies', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'View 2 replies', '', '', '', '', '', 'View 17 replies', '', '', '', '', 'View 13 replies', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'View reply', '', 'View reply', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'View 5 replies', '', '', '', '', '', 'View reply', '', 'View 28 replies', '', '', 'View 27 replies', '', '', 'View reply', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'View reply', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'View 9 replies', 'View reply', '', '', '', 'View reply', '', 'View 13 replies', '', '', '', 'View reply', 'View 9 replies', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'View 11 replies', '', '', '', '', 'View 2 replies', '', '', '', '', '', 'View reply', '', '', '', '', '', '', 'View reply', '', '', '', '', '', '', '', 'View reply', '', '', '', 'View 2 replies', '', '', '', '']

获得回复内容消息的预期输出。但我只能获得回复计数。

【问题讨论】:

    标签: python selenium web-scraping youtube selenium-chromedriver


    【解决方案1】:

    您需要点击查看重播来抓取评论回复。

    点击它,您可以执行以下操作:

    driver.find_elements_by_xpath("//ytd-button-renderer[@id='more-replies']/a/paper-button[@id="button"]").click()
    

    然后用于抓取回复

    driver.find_elements_by_xpath("//div[@id='loaded-replies']/ytd-comment-renderer//yt-formatted-string[@id='content-text']") 
    

    【讨论】:

    • 我添加了这些行。但我得到的是空数组
    • 我不确定我是否以正确的方式添加了这些 sn-p。我现在在我的问题中更新我的代码。请检查一次。
    • 我已经更新了我的答案忘了在最后添加点击。
    • AttributeError: 'list' object has no attribute 'click' 我收到了这个错误。
    • @WassimAlAhmad 感谢您提供的信息,现在网站的结构已更改。我会尽快更新代码。
    猜你喜欢
    • 2020-05-25
    • 2021-11-05
    • 2022-01-16
    • 2021-08-25
    • 1970-01-01
    • 2014-10-03
    • 2020-10-25
    • 1970-01-01
    相关资源
    最近更新 更多