Python数据框从列表中删除句子编号答案

【问题标题】：Python dataframe delete sentences number from listPython数据框从列表中删除句子编号
【发布时间】：2021-09-19 12:51:24
【问题描述】：

我在数据框中有一列（相当长的）文本，对于每个文本，我想删除的句子索引列表。当我将文本拆分为句子时，Spacy 会生成句子索引。请考虑以下示例：

import pandas as pd
import spacy
nlp = spacy.load('en_core_web_sm')

data = {'text': ['I am A. I am 30 years old. I live in NY.','I am B. I am 25 years old. I live in SD.','I am C. I am 30 years old. I live in TX.'], 'todel': [[1, 2], [1], [1, 2]]}

df = pd.DataFrame(data)

def get_sentences(text):
    text_clean = nlp(text)
    sentences = text_clean.sents
    sents_list = []
    for sentence in sentences:
        sents_list.append(str(sentence))
    return sents_list

df['text'] = df['text'].apply(get_sentences)

print(df)

给出以下内容：

                                           text   todel
0  [I am A., I am 30 years old., I live in NY.]  [1, 2]
1   [I am B. I am 25 years old., I live in SD.]     [1]
2   [I am C. I am 30 years old., I live in TX.]  [1, 2]

如果知道我有一个非常大的数据集，每行要删除 50 多个句子，您将如何有效地删除存储在 todel 中的句子？

我的预期输出是：

                                  text   todel
0                      [I live in NY.]  [1, 2]
1  [I am 25 years old., I live in SD.]     [1]
2                      [I live in TX.]  [1, 2]

【问题讨论】：

你的预期输出是什么？
我在我的问题中补充了这一点

标签： python list dataframe apply spacy

【解决方案1】：

试试这个：

import pandas as pd

data = {'text': ['I am A. I am 30 years old. I live in NY.','I am B. I am 25 years old. I live in SD.','I am C. I am 30 years old. I live in TX.'], 'todel': [[1, 2], [1], [1, 2]]}

df = pd.DataFrame(data)

def fun(sen, lst):
    return  ('.'.join(s for idx, s in enumerate(sen.split('.')) if idx+1 not in lst))

df['text'] = df.apply(lambda row : fun(row['text'],row['todel']), axis=1)

输出：

                                text   todel
0                      I live in NY.  [1, 2]
1   I am 25 years old. I live in SD.     [1]
2                      I live in TX.  [1, 2]

编辑基于已编辑的问题：

如果df['text']你不需要拆分的句子列表，你可以试试这个：

data = {'text': [['I am A.', 'I am 30 years old.', 'I live in NY.'], 
                 ['I am B.', 'I am 25 years old.', 'I live in SD.'],
                 ['I am C.','I am 30 years old.',' I live in TX.']], 'todel': [[1, 2], [1], [1, 2]]}
df = pd.DataFrame(data)
#                                           text     todel
# 0   [I am A., I am 30 years old., I live in NY.]  [1, 2]
# 1   [I am B., I am 25 years old., I live in SD.]     [1]
# 2  [I am C., I am 30 years old.,  I live in TX.]  [1, 2]

def fun(sen, lst):
    return  [s for idx , s in enumerate(sen) if not idx+1 in lst]

df['text'] = df.apply(lambda row : fun(row['text'],row['todel']), axis=1)

print(df)

输出：

                                  text   todel
0                      [I live in NY.]  [1, 2]
1  [I am 25 years old., I live in SD.]     [1]
2                     [ I live in TX.]  [1, 2]

【讨论】：

谢谢，非常感谢。但是你确定sen.split('.') 给出的句子拆分与使用 Spacy 相同吗？
@krasnapolsky 什么是 spacy？
我也不知道 spacey，但你要么使用空格分割，要么不分割。sen.split('. ') 将删除空格，但你必须将其包含在 '. '.join(...)
Spacy 是一个字符串处理包。我在帖子中提到，我使用这个包获得了要删除的句子索引。因此，我需要确保 sen.split('. ') 提供与使用 Spacy 相同的句子索引。
@krasnapolsky 好的，我明白了。给我一秒钟

【解决方案2】：

根据@user1740577 的回答：

def fun(sen, lst):
    return [i for j, i in enumerate(sen) if j not in lst]

df['text'] = df.apply(lambda row : fun(row['text'],row['todel']), axis=1)

根据 Spacy 的索引产生想要的结果：

                           text   todel
0                     [I am A.]  [1, 2]
1  [I am B. I am 25 years old.]     [1]
2  [I am C. I am 30 years old.]  [1, 2]

【讨论】：