【问题标题】:Split Sentence using RegexpTokenizer in a dataframe [duplicate]在数据框中使用 RegexpTokenizer 拆分句子[重复]
【发布时间】:2020-05-15 08:55:39
【问题描述】:

我正在尝试将数据帧输入到我的文字处理器中,以先拆分成句子,然后再拆分成单词。

示例文本:

When the blow was repeated,together with an admonition in
childish sentences, he turned over upon his back, and held his paws in a peculiar manner.

1) This a numbered sentence
2) This is the second numbered sentence

At the same time with his ears and his eyes he offered a small prayer to the child.

Below are the examples
- This an example of bullet point sentence
- This is also an example of bullet point sentence

所需输出


[
['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'], 
['1', ')', 'This', 'a', 'numbered', 'sentence']
['2', ')', 'This', 'is', 'the', 'second', 'numbered', 'sentence']
['At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
['Below', 'are', 'the', 'examples']
['-', 'This', 'an', 'example', 'of', 'bullet', 'point', 'sentence']
['-', 'This', 'also','an', 'example', 'of', 'bullet', 'point', 'sentence']
]

到目前为止我尝试过的代码

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[^d\)\-\*?!]+')

df["Regexp"] = data[comments].apply(tokenizer.tokenize)

【问题讨论】:

    标签: python pandas dataframe nltk tokenize


    【解决方案1】:

    这可能是一个解决方案。您可以根据自己的数据进行自定义

    text = """When the blow was repeated,together with an admonition in
    childish sentences, he turned over upon his back, and held his paws in a peculiar manner.
    
    1) This a numbered sentence
    2) This is the second numbered sentence
    
    At the same time with his ears and his eyes he offered a small prayer to the child.
    
    Below are the examples
    - This an example of bullet point sentence
    - This is also an example of bullet point sentence"""
    
    
    
    import re
    import nltk
    
    sentences = nltk.sent_tokenize(text)
    results = []
    
    for sent in sentences:
        sent = re.sub(r'(\n)(-|[0-9])', r"\1\n\2", sent)
        sent = sent.split('\n\n')
        for s in sent:
            results.append(nltk.word_tokenize(s))
    
    results
    
    [
    ['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'], 
    ['1', ')', 'This', 'a', 'numbered', 'sentence']
    ['2', ')', 'This', 'is', 'the', 'second', 'numbered', 'sentence']
    ['At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
    ['Below', 'are', 'the', 'examples']
    ['-', 'This', 'an', 'example', 'of', 'bullet', 'point', 'sentence']
    ['-', 'This', 'also','an', 'example', 'of', 'bullet', 'point', 'sentence']
    ]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-07-06
      • 2020-05-05
      • 1970-01-01
      • 2019-11-17
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多