在python中导入文本的问题答案

【问题标题】：problems importing text in python在python中导入文本的问题
【发布时间】：2023-03-20 07:44:01
【问题描述】：

使用 python，我正在尝试获取一个文本文件，然后创建一个长长的单词列表（单词按照它们在文档中出现的顺序排列）。

到目前为止，我已经遍历了每一行，然后基本上只是将单词添加到长列表中。

它应该将每个单词小写，并删除它找到的任何标点符号。

wordstory=[a.lower().strip(string.punctuation) for b in [line.split() for line in open('alice.txt')] for a in b]

.strip(string.punctuation) 似乎无法识别某些标点符号以进行删除，此外，在某些情况下，标点符号会转换为奇数代码。

我最终会遇到这样的情况，\xe2\x80\x94 根本不应该在那里。

..
 'she',
 'spoke\xe2\x80\x94fancy',
 'curtseying',
..

此外，当双引号旁边出现撇号时，.strip(string.punctuation) 不会删除撇号。我最终得到：

..
'she',
 "couldn't",
 'answer',
..

有人可以提供一些有用的代码，或者给我指出一个可以帮助我理解正在发生的事情的资源吗？

【问题讨论】：

你能提供一个示例文本文件和你到目前为止尝试过的代码吗？
听起来您正在尝试读取 unicode 文件？
我敢打赌，您的源文档中有多字节 unicode 标点符号。
请注意str.strip 只删除字符串开头和结尾的字符，因此parents' 变为parents 但parent's 根本没有改变

标签： python string file text

【解决方案1】：

我认为您遇到了 unicode 问题，以及不必要地混淆了列表理解。

我建议这样做：

# -*- coding: utf-8 -*-

import string

file = open("""text_file.txt""", "r")
raw_text = file.read()

# stripping punctuation
punctuation = set(string.punctuation)
trimmed_text = ''.join(char for char in raw_text if char not in punctuation)

# splitting into list
word_list = trimmed_text.split(" ")

# removing duplicates
unique_word_list = set(word_list)

# or if you're preserving the order, maybe try:
unique_word_list = []
for word in word_list:
    if word not in unique_word_list:
        unique_word_list.append(word)

print(unique_word_list)

【讨论】：

【解决方案2】：

如果您想删除所有标点符号，请使用translate和string.maketrans：

In [94]: import string

In [95]: a ="she's all foo!"

In [96]: a.lower().translate(string.maketrans("",""), string.punctuation)
Out[96]: 'shes all foo'

str.strip 只删除字符串结尾或开头的字符。

【讨论】：

原来的文本文件中似乎有一个奇怪的破折号，当它被导入python时变成了\xe2\x80\x94 - 这是为什么？感谢您告诉我有关翻译功能的信息，但它并没有摆脱撇号，因为我描述的这种情况很少发生在双引号旁边。