【问题标题】:Need to get actor name out of the Json file需要从 Json 文件中获取演员姓名
【发布时间】:2020-05-20 09:47:00
【问题描述】:

我想从这个 json 文件 page_title 中获取演员名称,然后将其与我尝试使用 nltk 和 spacy 的数据库进行匹配,但我必须在那里训练数据。我是否对每个句子都有训练,我有超过 10 万个句子。如果我坐下来训练数据,则需要一个月或更长时间。有什么方法可以转储 K_actor 数据库来训练 spacy、nltk 或任何其他方式。

{"page_title": "Sonakshi Sinha To Auction Sketch Of Buddha To Help Migrant Labourers", "description": "Sonakshi Sinha took to Instagram to share a timelapse video of a sketch of Buddha that she made to auction to raise funds for migrant workers affected by Covid-19 crisis. ", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589815261_1589815196489_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/sonakshi-sinha-to-auction-sketch-of-buddha-to-help-migrant-labourers-2626123.html"}
{"page_title": "Anushka Sharma Calls Virat Kohli 'A Liar' on IG Live, Nushrat Bharucha Gets Propositioned on Twitter", "description": "In an Instagram live interaction with Sunil Chhetri, Virat Kohli was left embarrassed after Anushka Sharma called him a 'jhootha' from behind the camera. This and more in today's wrap.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589813980_1589813933996_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/anushka-sharma-calls-virat-kohli-a-liar-on-ig-live-nushrat-bharucha-gets-propositioned-on-twitter-2626093.html"}
{"page_title": "Ranveer Singh Shares a Throwback to the Days When WWF was His Life", "description": "Ranveer Singh shared a throwback picture from his childhood where he could be seen posing in front of a poster of WWE legend Hulk Hogan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812401_screenshot_20200518-195906_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/ranveer-singh-shares-a-throwback-to-the-days-when-wwf-was-his-life-2626067.html"}
{"page_title": "Salman Khan's Love Song 'Tere Bina' Gets 26 Million Views", "description": "Salman Khan's song Tere Bina, which was launched a few days ago, had garnered 12 million views within 24 hours. As it continues to trend, it has garnered 26 million views in less than a week.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589099778_screenshot_20200510-135934_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/salman-khans-love-song-tere-bina-gets-26-million-views-2626077.html"}
{"page_title": "Yash And Radhika Pandit Pose With Their Kids For a Perfect Family Picture", "description": "Kannada actor Yash tied the knot with actress Radhika Pandit in 2016. The couple shares two kids together.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812187_yash.jpg", "post_url": "https://www.news18.com/news/movies/yash-and-radhika-pandit-pose-with-their-kids-for-a-perfect-family-picture-2626055.html"}
{"page_title": "Malaika Arora Shares Beach Vacay Boomerang With Hopeful Note", "description": "Malaika Arora shared a throwback boomerang from a beach vacation where she could be seen playfully spinning. She also shared a hopeful message along with it.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589810291_screenshot_20200518-192603_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/malaika-arora-shares-beach-vacay-boomerang-with-hopeful-note-2626019.html"}
{"page_title": "Actor Nawazuddin Siddiqui's Wife Aaliya Sends Legal Notice To Him Demanding Divorce, Maintenance", "description": "The notice was sent to the ", "image_url": "https://images.news18.com/ibnlive/uploads/2019/10/Nawazuddin-Siddiqui.jpg", "post_url": "https://www.news18.com/news/movies/actor-nawazuddin-siddiquis-wife-aaliya-sends-legal-notice-to-him-demanding-divorce-maintenance-2626035.html"}
{"page_title": "Lisa Haydon Celebrates Son Zack\u2019s 3rd Birthday With Homemade Cake And 'Spiderman' Surprise", "description": "Lisa Haydon took to Instagram to share some glimpses from the special day. In the pictures, we can spot a man wearing a Spiderman costume.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807960_lisa-rey.jpg", "post_url": "https://www.news18.com/news/movies/lisa-haydon-celebrates-son-zacks-3rd-birthday-with-homemade-cake-and-spiderman-surprise-2625953.html"}
{"page_title": "Chiranjeevi Recreates Old Picture with Wife, Says 'Time Has Changed'", "description": "Chiranjeevi was last seen in historical-drama Sye Raa Narasimha Reddy. He was shooting for his next film, Acharya, before the coronavirus lockdown.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589808242_pjimage.jpg", "post_url": "https://www.news18.com/news/movies/chiranjeevi-recreates-old-picture-with-wife-says-time-has-changed-2625973.html"}
{"page_title": "Amitabh Bachchan, Rishi Kapoor\u2019s Pout Selfie Recreated By Abhishek, Ranbir is Priceless", "description": "A throwback picture that has gone viral on the internet shows Ranbir Kapoor and Abhishek Bachchan recreating a selfie of their fathers Rishi Kapoor and Amitabh Bachchan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807772_screenshot_20200518-184521_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/amitabh-bachchan-rishi-kapoors-pout-selfie-recreated-by-abhishek-ranbir-is-priceless-2625867.html"}

【问题讨论】:

    标签: python-3.x scrapy nlp nltk spacy


    【解决方案1】:

    您可以做的事情是创建一个注释脚本,您可以在其中将演员姓名替换为“@@@”或其他字符串(稍后将替换为演员姓名(实体)以进行培训)。

    我用 i3 笔记本电脑在 9 小时内训练了 68K 数据/句子。您可以像这样转储数据,输出文件可用于训练模型。

    这将节省时间并为您提供现成的 SpaCy 训练数据格式。

    from nltk import word_tokenize
    from pandas import read_csv
    import re
    import os.path
    
    
    def annot(Label, entity, textlist) :
        finaldict = []
        for text_token in textlist:
            textbk=text_token
            for value in entity:
                #if entity has multi tokens        
                text=textbk
                text=text_token
                text=str(text).replace('@@@',value)
                text=text.lower()
                text = re.sub('[^a-zA-Z0-9\n\.]',' ', text)
                if len(word_tokenize(value))<2:
                    #print('I am here')
                    newtext=word_tokenize(text)
                    traindata=[]
                    prev_length=0
                    prev_pos=0
                    k=0
                    while k != len(newtext):
                        if k == 0:
                            prev_pos=0
                            prev_length=len(newtext[k])
                            if value.lower()== str(newtext[k]):
                                ent=Label
                                tup=(prev_pos,prev_length,ent)
                                traindata.append(tup)
                            else:
                                pass
                        else :
                            prev_pos=prev_length+1
                            prev_length=prev_length+len(newtext[k])+1
                            if value.lower()==str(newtext[k]):
                                ent=Label
                                tup=(prev_pos,prev_length,ent)
                                traindata.append(tup)
                            else:
                                pass
                        k=k+1
                    mydict={'entities':traindata}
                    finaldict.append((text,mydict))
                else:
                    traindata=[]
                    try:
                        begin=text.index(value.lower())
                        ent=Label
                        tup=(begin,len(value.lower()),ent)
                        traindata.append(tup)
                    except ValueError:
                        pass
                    mydict={'entities':traindata}
                    finaldict.append((text,mydict))
        return finaldict
    
    def getEntities(csv_file, column) :
        df = read_csv(csv_file)
        return df[column].to_list()
    
    def getSentences(file_name) :   
        with open(file_name) as file1 :
            sentences = [line1.rstrip('\n') for line1 in file1]
        return sentences
    
    def saveData (data, filename, path) :
        filename = os.path.join(path, filename)
        with open(filename, 'a') as file :
            for sent in data :
                file.write("{}\n".format(sent))
    
    ents = getEntities(csv_file, column_name) #Actor names in your case
    entities = [ent for ent in ents if str(ent) != 'nan']
    
    
    sentences = getSentences(filepathandname) #Considering you have the sentences in a text file
    label = 'ACTOR_NAMES'   
    data = annot(label, entities, sentences)
    saveData(data, 'train_data.txt', path)
    

    希望这是您问题的相关答案。

    【讨论】:

    • File "pandas/_libs/parsers.pyx", line 2071, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 29 fields in line 1592, saw 37 这是我每次都得到的。
    • 读取 csv 文件时会出现这种情况吗?你什么时候遇到错误?
    • csv_file = 'cast_dump.csv' column_name = 'name' filepathname = 'ndtv.txt' 这是我输入的内容,然后我运行代码并收到此错误
    猜你喜欢
    • 1970-01-01
    • 2017-01-07
    • 2015-03-27
    • 2013-11-22
    • 2022-01-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-10-29
    相关资源
    最近更新 更多