Pandas：如何从每行的单词中重建字符串答案

【问题标题】：Pandas: How to reconstruct strings from a word per rowPandas：如何从每行的单词中重建字符串
【发布时间】：2018-04-25 14:08:04
【问题描述】：

我在使用大型 Pandas DataFrame（1 500 000 行）重建句子时遇到问题。我的目标是将单词中的句子重构为一个新的数据框，以便每行有一个句子。我的 DataFrame 中有两个系列：单词和标签。每个句子都用感叹号分隔。最重要的是，我想使用原始 DataFrame 中的标签为形容词和名词/动词创建两个单独的系列到新的 DataFrame 中。所以这就是我所拥有的：

>df

word    tag

bike    NOUN
winner  NOUN
!       PUNCTUATION
red     ADJECTIVE
car     NOUN
is      VERB
fast    ADJECTIVE
!       PUNCTUATION
...     ...

这就是我想要的

>df2

sent             nounverb     adj

bike winner      bike winner  None
red car is fast  car is       red fast
...

我一直找不到解决方案，因为我是 Python 的初学者，所以我无法想出一个 for loop 来为我做这件事。

编辑：

感谢 Andy 和 Jesús 的快速解答。安迪的回答很顺利，尽管在创建新的 DataFrame 时我需要稍微修改一下。需要将单词称为字符串。

df2 = pd.DataFrame({
          "sent": g.apply(lambda sdf: " ".join(sdf.word.astype(str))),
          "nounverb": g.apply(lambda sdf: " ".join(sdf[sdf.is_nounverb].word.astype(str))),
          "adj": g.apply(lambda sdf: " ".join(sdf[sdf.tag == "ADJECTIVE"].word.astype(str)))
  })

【问题讨论】：

标签： python string pandas dataframe nlp

【解决方案1】：

如果您为 is "nounverb" 添加一个虚拟列，您可以使用普通的 ol' groupby：

In [11]: df["is_nounverb"] = (df.tag == "NOUN") | (df.tag == "VERB")

那你可以数一下你见过的!来枚举句子：

In [12]: df["sentence"] = (df.word == "!").cumsum()

In [13]: df = df[df.word != "!"]

In [14]: df
Out[14]:
     word        tag  sentence  is_nounverb
0    bike       NOUN         0         True
1  winner       NOUN         0         True
3     red  ADJECTIVE         1        False
4     car       NOUN         1         True
5      is       VERB         1         True
6    fast  ADJECTIVE         1        False

然后分组：

In [15]: g = df.groupby("sentence")

In [16]: g.apply(lambda sdf: " ".join(sdf.word))
Out[16]:
sentence
0        bike winner
1    red car is fast
dtype: object

In [17]: g.apply(lambda sdf: " ".join(sdf[sdf.is_nounverb].word))
Out[17]:
sentence
0    bike winner
1         car is
dtype: object

In [18]: g.apply(lambda sdf: " ".join(sdf[sdf.tag == "ADJECTIVE"].word))
Out[18]:
sentence
0
1    red fast
dtype: object

一起：

In [21]: df2 = pd.DataFrame({
              "sent": g.apply(lambda sdf: " ".join(sdf.word)),
              "nounverb": g.apply(lambda sdf: " ".join(sdf[sdf.is_nounverb].word)),
              "adj": g.apply(lambda sdf: " ".join(sdf[sdf.tag == "ADJECTIVE"].word))
      })

In [22]: df2
Out[22]:
               adj     nounverb             sent
sentence
0                   bike winner      bike winner
1         red fast       car is  red car is fast

【讨论】：

【解决方案2】：

解决方案会沿着数据框中的第一列运行并组装句子列表。例如，您可以使用跳过标点符号的循环条件来执行此操作。然后，对于您要组装成句子的每个临时单词，您应该组装一个描述（假设您在两者之间具有 1:1 的相关性）。

我提出了一个功能不全的小例子，但它应该为您指明正确的方向。

a = ['bike', 'winner', '!', 'red', 'car', 'is', 'fast', '!']
b = ['noun', 'noun', 'punctuation', 'adjective', 'noun', 'verb', 'adjective', 'punctuation']

temp_word = ''
temp_nounverb = ''
temp_adjective = ''
for index,word in enumerate(a):
    if word is not '!':
        temp_word += word + ' '
        if b[index] is 'noun' or b[index] is 'verb':
            temp_nounverb += word + ' '
            temp_adjective += 'None'
        else:
            temp_nounverb += 'None'
            temp_adjective += word + ' '
    else:
        print(temp_word + ' - ' + temp_nounverb + ' - ' + temp_adjective)
        temp_word = ''
        temp_nounverb = ''
        temp_adjective = ''

如果您需要进一步的指示，请告诉我，我很乐意提供帮助。

【讨论】：