Python 标记化答案

【问题标题】：Python TokenizationPython 标记化
【发布时间】：2016-03-26 11:25:56
【问题描述】：

我是 Python 新手，我有一个标记化作业输入是一个带有句子的 .txt 文件并且输出是带有令牌的 .txt 文件，当我说令牌时，我的意思是：简单的单词，'，'，'！' ，“？” , '.' '"'

我有这个功能：输入： Elemnt 是一个带或不带标点符号的词，可以是这样的词：Hi or said: or said" StrForCheck ：是我想从单词中分离出来的标点数组 TokenFile：是我的输出文件

def CheckIfSEmanExist(Elemnt,StrForCheck, TokenFile):

FirstOrLastIsSeman = 0

for seman in StrForCheck:
    WordSplitOnSeman = Elemnt.split(seman)
    if len(WordSplitOnSeman) > 1:
        if Elemnt[len(Elemnt)-1] == seman:
            FirstOrLastIsSeman = len(Elemnt)-1
        elif Elemnt[0] == seman:
            FirstOrLastIsSeman = 1

if FirstOrLastIsSeman == 1:
    TokenFile.write(Elemnt[0])
    TokenFile.write('\n')
    TokenFile.write(Elemnt[1:-1])
    TokenFile.write('\n')

elif FirstOrLastIsSeman == len(Elemnt)-1:
    TokenFile.write(Elemnt[0:-1])
    TokenFile.write('\n')
    TokenFile.write(Elemnt[len(Elemnt)-1])
    TokenFile.write('\n')

elif FirstOrLastIsSeman == 0:
    TokenFile.write(Elemnt)
    TokenFile.write('\n')

代码在标点数组上循环，如果他找到一个，我检查标点是单词中的第一个字母还是最后一个字母，然后在我的输出文件中将单词和标点分别写在不同的行中

但我的问题是，除了这些词之外，它对整个文本都非常有效：工作“、创建”、公共“、警察”

【问题讨论】：

标签： python tokenize

【解决方案1】：

注意

for l in open('some_file.txt', 'r'):
    ...

遍历每一行，所以你只需要考虑在一行内做什么。

考虑以下函数：

def tokenizer(l):
    prev_i = 0
    for (i, c) in enumerate(l):
        if c in ',.?!- ':
            if prev_i != i:
                yield l[prev_i: i]
            yield c
            prev_i = i + 1
    if prev_i != 0:
        yield l[prev_i: ]

随着它的进行，它会“吐出”标记。你可以这样使用它：

l = "hello, hello, what's all this shouting? We'll have no trouble here"
for tok in tokenizer(l):
    print tok

hello
,

hello
,

what's

all

this

shouting
?

We'll

have

no

trouble

here

【讨论】：

但是我还需要在我的文件中写下标点含义，根据你的句子我的输出应该是：helo , hello , what's all l this out?