python - 如何根据python中的空格将文本文件拆分为多个列表？答案

【问题标题】：how to split a text file into multiple list based on whitespacing in python?python - 如何根据python中的空格将文本文件拆分为多个列表？
【发布时间】：2015-01-19 05:10:38
【问题描述】：

嗨，我是 python 编程的新手，请帮助我创建一个函数，该函数将文本文件作为参数并创建一个单词列表，从而删除所有标点符号和列表“拆分”双空格。我的意思是该列表应该在文本文件中的每个双空格出现时创建存在。

这是我的功能：

def tokenize(document):
    file = open("document.txt","r+").read()
    print re.findall(r'\w+', file)

输入文本文件有如下字符串：

What's did the little boy tell the game warden?     His dad was in the kitchen poaching eggs!

注意：守望者后有双倍间距？在他之前

我的函数给了我这样的输出

['what','s','did','the','little','boy','tell','the','game','warden','His','dad','was','in','the','kitchen','poaching','eggs']

期望的输出：

[['what','s','did','the','little','boy','tell','the','game','warden'],
['His','dad','was','in','the','kitchen','poaching','eggs']]

【问题讨论】：

标签： python list split tokenize

【解决方案1】：

首先split将整个文本放在双空格上，然后将每个项目传递给regex：

>>> file = "What's did the little boy tell the game warden?  His dad was in the kitchen poaching eggs!"
>>> file = text.split('  ')
>>> file
["What's did the little boy tell the game warden?", 'His dad was in the kitchen poaching eggs!']
>>> res = []
>>> for sen in file:
...    res.append(re.findall(r'\w+', sen))
... 
>>> res
[['What', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'warden'], ['His', 'dad', 'was', 'in', 'the', 'kitchen', 'poaching', 'eggs']]

【讨论】：

【解决方案2】：

这是一个合理的全 RE 方法：

def tokenize(document):
    with open("document.txt") as f:
        text = f.read()
    blocks = re.split(r'\s\s+', text)
    return [re.findall(r'\w+', b) for b in blocks]

【讨论】：

这里使用生成器方法会更好。
@BurhanKhalid，考虑到多个空格可以出现在任何地方，以适当重叠的块读取文件将非常复杂——但这是生成器的唯一潜在优势，处理如此巨大的文件否则没有足够的内存。所以“更好”在旁观者的眼中——是一个非常复杂、容易出错、很长的函数，可能能够处理巨大的文件（Q 中从未指定过）“更好”，而不是简单、简短、显然正确的一个？另外，Q 非常明确地将 lists 指定为输出，因此无论如何，O(N) 内存都是强制性的。

【解决方案3】：

内置拆分功能允许拆分多个空格。

这个：

a = "hello world.  How are you"
b = a.split('  ')
c = [ x.split(' ') for x in b ]

产量：

c = [['hello', 'world.'], ['how', 'are', 'you?']]

如果您也想删除标点符号，请将正则表达式应用于“b”中的元素或第三条语句中的“x”。

【讨论】：

【解决方案4】：

首先用标点符号分割文件，然后在第二遍用空格分割结果字符串。

def splitByPunct(s):
    return (x.group(0) for x in  re.finditer(r'[^\.\,\?\!]+', s) if x and x.group(0))

[x.split() for x in splitByPunct("some string, another   string! The phrase")]

这会产生

[['some', 'string'], ['another', 'string'], ['The', 'phrase']]

【讨论】：

第一次通过后如何在派生列表上应用 split() 函数？你能帮我写下这个模块，这样我就可以理解了..