需要从 Json 文件中获取演员姓名答案

【问题标题】：Need to get actor name out of the Json file需要从 Json 文件中获取演员姓名
【发布时间】：2020-05-20 09:47:00
【问题描述】：

我想从这个 json 文件 page_title 中获取演员名称，然后将其与我尝试使用 nltk 和 spacy 的数据库进行匹配，但我必须在那里训练数据。我是否对每个句子都有训练，我有超过 10 万个句子。如果我坐下来训练数据，则需要一个月或更长时间。有什么方法可以转储 K_actor 数据库来训练 spacy、nltk 或任何其他方式。

{"page_title": "Sonakshi Sinha To Auction Sketch Of Buddha To Help Migrant Labourers", "description": "Sonakshi Sinha took to Instagram to share a timelapse video of a sketch of Buddha that she made to auction to raise funds for migrant workers affected by Covid-19 crisis. ", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589815261_1589815196489_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/sonakshi-sinha-to-auction-sketch-of-buddha-to-help-migrant-labourers-2626123.html"}
{"page_title": "Anushka Sharma Calls Virat Kohli 'A Liar' on IG Live, Nushrat Bharucha Gets Propositioned on Twitter", "description": "In an Instagram live interaction with Sunil Chhetri, Virat Kohli was left embarrassed after Anushka Sharma called him a 'jhootha' from behind the camera. This and more in today's wrap.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589813980_1589813933996_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/anushka-sharma-calls-virat-kohli-a-liar-on-ig-live-nushrat-bharucha-gets-propositioned-on-twitter-2626093.html"}
{"page_title": "Ranveer Singh Shares a Throwback to the Days When WWF was His Life", "description": "Ranveer Singh shared a throwback picture from his childhood where he could be seen posing in front of a poster of WWE legend Hulk Hogan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812401_screenshot_20200518-195906_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/ranveer-singh-shares-a-throwback-to-the-days-when-wwf-was-his-life-2626067.html"}
{"page_title": "Salman Khan's Love Song 'Tere Bina' Gets 26 Million Views", "description": "Salman Khan's song Tere Bina, which was launched a few days ago, had garnered 12 million views within 24 hours. As it continues to trend, it has garnered 26 million views in less than a week.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589099778_screenshot_20200510-135934_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/salman-khans-love-song-tere-bina-gets-26-million-views-2626077.html"}
{"page_title": "Yash And Radhika Pandit Pose With Their Kids For a Perfect Family Picture", "description": "Kannada actor Yash tied the knot with actress Radhika Pandit in 2016. The couple shares two kids together.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812187_yash.jpg", "post_url": "https://www.news18.com/news/movies/yash-and-radhika-pandit-pose-with-their-kids-for-a-perfect-family-picture-2626055.html"}
{"page_title": "Malaika Arora Shares Beach Vacay Boomerang With Hopeful Note", "description": "Malaika Arora shared a throwback boomerang from a beach vacation where she could be seen playfully spinning. She also shared a hopeful message along with it.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589810291_screenshot_20200518-192603_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/malaika-arora-shares-beach-vacay-boomerang-with-hopeful-note-2626019.html"}
{"page_title": "Actor Nawazuddin Siddiqui's Wife Aaliya Sends Legal Notice To Him Demanding Divorce, Maintenance", "description": "The notice was sent to the ", "image_url": "https://images.news18.com/ibnlive/uploads/2019/10/Nawazuddin-Siddiqui.jpg", "post_url": "https://www.news18.com/news/movies/actor-nawazuddin-siddiquis-wife-aaliya-sends-legal-notice-to-him-demanding-divorce-maintenance-2626035.html"}
{"page_title": "Lisa Haydon Celebrates Son Zack\u2019s 3rd Birthday With Homemade Cake And 'Spiderman' Surprise", "description": "Lisa Haydon took to Instagram to share some glimpses from the special day. In the pictures, we can spot a man wearing a Spiderman costume.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807960_lisa-rey.jpg", "post_url": "https://www.news18.com/news/movies/lisa-haydon-celebrates-son-zacks-3rd-birthday-with-homemade-cake-and-spiderman-surprise-2625953.html"}
{"page_title": "Chiranjeevi Recreates Old Picture with Wife, Says 'Time Has Changed'", "description": "Chiranjeevi was last seen in historical-drama Sye Raa Narasimha Reddy. He was shooting for his next film, Acharya, before the coronavirus lockdown.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589808242_pjimage.jpg", "post_url": "https://www.news18.com/news/movies/chiranjeevi-recreates-old-picture-with-wife-says-time-has-changed-2625973.html"}
{"page_title": "Amitabh Bachchan, Rishi Kapoor\u2019s Pout Selfie Recreated By Abhishek, Ranbir is Priceless", "description": "A throwback picture that has gone viral on the internet shows Ranbir Kapoor and Abhishek Bachchan recreating a selfie of their fathers Rishi Kapoor and Amitabh Bachchan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807772_screenshot_20200518-184521_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/amitabh-bachchan-rishi-kapoors-pout-selfie-recreated-by-abhishek-ranbir-is-priceless-2625867.html"}

【问题讨论】：

标签： python-3.x scrapy nlp nltk spacy

【解决方案1】：

您可以做的事情是创建一个注释脚本，您可以在其中将演员姓名替换为“@@@”或其他字符串（稍后将替换为演员姓名（实体）以进行培训）。

我用 i3 笔记本电脑在 9 小时内训练了 68K 数据/句子。您可以像这样转储数据，输出文件可用于训练模型。

这将节省时间并为您提供现成的 SpaCy 训练数据格式。

from nltk import word_tokenize
from pandas import read_csv
import re
import os.path


def annot(Label, entity, textlist) :
    finaldict = []
    for text_token in textlist:
        textbk=text_token
        for value in entity:
            #if entity has multi tokens        
            text=textbk
            text=text_token
            text=str(text).replace('@@@',value)
            text=text.lower()
            text = re.sub('[^a-zA-Z0-9\n\.]',' ', text)
            if len(word_tokenize(value))<2:
                #print('I am here')
                newtext=word_tokenize(text)
                traindata=[]
                prev_length=0
                prev_pos=0
                k=0
                while k != len(newtext):
                    if k == 0:
                        prev_pos=0
                        prev_length=len(newtext[k])
                        if value.lower()== str(newtext[k]):
                            ent=Label
                            tup=(prev_pos,prev_length,ent)
                            traindata.append(tup)
                        else:
                            pass
                    else :
                        prev_pos=prev_length+1
                        prev_length=prev_length+len(newtext[k])+1
                        if value.lower()==str(newtext[k]):
                            ent=Label
                            tup=(prev_pos,prev_length,ent)
                            traindata.append(tup)
                        else:
                            pass
                    k=k+1
                mydict={'entities':traindata}
                finaldict.append((text,mydict))
            else:
                traindata=[]
                try:
                    begin=text.index(value.lower())
                    ent=Label
                    tup=(begin,len(value.lower()),ent)
                    traindata.append(tup)
                except ValueError:
                    pass
                mydict={'entities':traindata}
                finaldict.append((text,mydict))
    return finaldict

def getEntities(csv_file, column) :
    df = read_csv(csv_file)
    return df[column].to_list()

def getSentences(file_name) :   
    with open(file_name) as file1 :
        sentences = [line1.rstrip('\n') for line1 in file1]
    return sentences

def saveData (data, filename, path) :
    filename = os.path.join(path, filename)
    with open(filename, 'a') as file :
        for sent in data :
            file.write("{}\n".format(sent))

ents = getEntities(csv_file, column_name) #Actor names in your case
entities = [ent for ent in ents if str(ent) != 'nan']


sentences = getSentences(filepathandname) #Considering you have the sentences in a text file
label = 'ACTOR_NAMES'   
data = annot(label, entities, sentences)
saveData(data, 'train_data.txt', path)

希望这是您问题的相关答案。

【讨论】：

File "pandas/_libs/parsers.pyx", line 2071, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 29 fields in line 1592, saw 37 这是我每次都得到的。
读取 csv 文件时会出现这种情况吗？你什么时候遇到错误？
csv_file = 'cast_dump.csv' column_name = 'name' filepathname = 'ndtv.txt' 这是我输入的内容，然后我运行代码并收到此错误