如何为每个句子（行）创建标记化词（列）的数据框？答案

【问题标题】：How to create dataframe of tokenized words (columns) per sentence (rows)?如何为每个句子（行）创建标记化词（列）的数据框？
【发布时间】：2019-09-01 04:34:31
【问题描述】：

我有以下文字：

“大家好，我叫山姆！我喜欢辣手拉面。我也喜欢玩游戏。”

我的目标是将此段落转换为每个句子的标记词数据框。（其中行数等于句子数，列数等于最长句子中的单词数）。

我开始创建一个标记化句子的数据框：

from nltk.tokenize import sent_tokenize, word_tokenize

df = pd.DataFrame({"sentences": sent_tokenize(paragraph)})

结果是：

    sentences
0   Hi there, my name is sam!
1   I love spicy hand pulled noodles.
2   I also like to game alot.

然后我将每个句子（行）转换为标记词列表：

df["tokens"] = df.sentences.apply(word_tokenize)

结果是（如果我单独打印该列）：

0    [Hi, there, ,, my, name, is, sam, !]
1    [I, love, spicy, hand, pulled, noodles, .]
2    [I, also, like, to, game, alot, .]

接下来我希望发生的事情是这样的（需要帮助）：

      w1   w2     w3      w4     w5       w6       w7     w8
0     Hi   there  ,       my     name     is       sam    !
1     I    love   spicy   hand   pulled   noodles  .      NaN
2     I    also   like    to     game     alot     .      NaN

其中列数等于最长 word_tokenized 句子的长度。对于比最长的句子短的句子，我希望空列包含 NaN 值（甚至 0.0）。有没有办法通过 pandas 命令实现这一点？

【问题讨论】：

从0开始算好吗？比如w0、w1等等
是的，没关系

标签： python python-3.x pandas

【解决方案1】：

如果第一个前缀列以1 (w1) 开头：

In [350]: df.join(pd.DataFrame(df['tokens'].tolist(), columns=[f'w{i}' for i in range(1, df['tokens'].str.len().max() + 1)])).fillna(np.nan)               
Out[350]: 
                           sentences                                      tokens  w1     w2     w3    w4      w5       w6   w7   w8
0          Hi there, my name is sam!        [Hi, there, ,, my, name, is, sam, !]  Hi  there      ,    my    name       is  sam    !
1  I love spicy hand pulled noodles.  [I, love, spicy, hand, pulled, noodles, .]   I   love  spicy  hand  pulled  noodles    .  NaN
2          I also like to game alot.          [I, also, like, to, game, alot, .]   I   also   like    to    game     alot    .  NaN

如果您需要它作为单独的数据框：

In [352]: pd.DataFrame(df['tokens'].tolist(), columns=[f'w{i}' for i in range(1, df['tokens'].str.len().max() + 1)]).fillna(np.nan)                        
Out[352]: 
   w1     w2     w3    w4      w5       w6   w7   w8
0  Hi  there      ,    my    name       is  sam    !
1   I   love  spicy  hand  pulled  noodles    .  NaN
2   I   also   like    to    game     alot    .  NaN

【讨论】：

【解决方案2】：

你可以试试：

pd.DataFrame(data = df.tokens.tolist()).fillna(pd.np.nan).add_prefix('w')

输出：

【讨论】：