【问题标题】:YouTube comments extractor infinite loop if there is too many comments如果评论太多,YouTube评论提取器无限循环
【发布时间】:2021-07-21 12:31:07
【问题描述】:

我编写了一个脚本来提取 YouTube 的视频 cmets 并将其存储在给定视频 ID 的文件中。如果视频少于 10-15 cmets,则没有问题,脚本运行良好,但是当有更多时,它会进入无限循环,我不知道为什么。

from googleapiclient.discovery import build 
import os
api_key = '...'

def video_comments(video_id): 
    # empty file for storing comments
    outputFile = open("comments_"+video_id+".txt", "w", encoding='utf-8')

    # empty dictionnary to store the data
    commentsDict = []

    # empty list for storing reply 
    replies = [] 

    # creating youtube resource object 
    youtube = build('youtube', 'v3', 
                    developerKey=api_key) 

    # retrieve youtube video results 
    video_response=youtube.commentThreads().list( 
    part='snippet,replies', 
    videoId=video_id 
    ).execute() 

    # iterate video response 
    while video_response: 
        
        # extracting required info 
        # from each result object 
        for item in video_response['items']: 
            # Extracting comments 
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay'] 
            commentEntrie = {"comment": comment, 'replies': []}
            
            # counting number of reply of comment 
            replycount = item['snippet']['totalReplyCount'] 

            # if reply is there 
            if replycount>0: 
                
                # iterate through all reply 
                for reply in item['replies']['comments']: 
                    
                    # Extract reply 
                    reply = reply['snippet']['textDisplay'] 
                    
                    # Store reply is list 
                    replies.append(reply) 
                    commentEntrie['replies'].append(reply)
                    
            # print comment with list of reply 
            print(comment, replies, end = '\n\n')
            outputFile.write("%s" % comment)
            outputFile.write("%s\n" % replies)
            commentsDict.append(commentEntrie)
            # empty reply list 
            replies = [] 

        # Again repeat 
        if 'nextPageToken' in video_response: 
            video_response = youtube.commentThreads().list( 
                    part = 'snippet,replies', 
                    videoId = video_id 
                ).execute() 
        else: 
            break
    outputFile.close()
    print(commentsDict)

# Enter video id 
video_id = "aDHYbM9OqUc" 

# Call function 
video_comments(video_id)  

我可以提供两个视频ID,这个LVgKlfw4DHc 工作正常,但这个以无限循环结束aDHYbM9OqUc 有什么想法吗?

[编辑] 我觉得nextPageToken 总是在这里,它会无限地运行

【问题讨论】:

    标签: python loops youtube-api youtube-data-api


    【解决方案1】:

    由于这段代码,您的循环 while video_response: 变为无限:

    if 'nextPageToken' in video_response: 
        video_response = youtube.commentThreads().list( 
            part = 'snippet,replies', 
            videoId = video_id 
        ).execute() 
    else: 
        break
    

    如果第一个video_response 包含属性nextPageToken,则循环内CommentThreads.list 的调用与循环外的调用完全相同。因此,通过第二次调用,您将得到完全与从前一次调用中获得的video_response 相同的video_response

    正确的实现应该是:

    if 'nextPageToken' in video_response: 
        video_response = youtube.commentThreads().list( 
            pageToken = video_response['nextPageToken'],
            part = 'snippet,replies', 
            videoId = video_id 
        ).execute() 
    else: 
        break
    

    由于您使用的是 Google 的 APIs Client Library for Python,因此在 CommentThreads.list API 端点上实现 result set paginationpythonic way 如下所示:

    request = youtube.commentThreads().list(
        part = 'snippet,replies', 
        videoId = video_id 
    )
    
    while request:
        response = request.execute()
    
        for item in response['items']:
            ...
    
        request = youtube.commentThreads().list_next(
            request, response)
    

    由于 Python 客户端库的实现方式,这很简单:根本不需要显式处理 API 响应对象的属性 nextPageToken 和 API 请求参数 pageToken

    【讨论】:

    • 感谢您的回答!感谢您的详细信息!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2012-11-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-05-14
    • 2019-08-11
    • 2021-08-25
    相关资源
    最近更新 更多