如何将文本文件中的某些单词大写？答案

【问题标题】：How to capitalize some words in a text file?如何将文本文件中的某些单词大写？
【发布时间】：2012-07-25 08:35:37
【问题描述】：

我有一个包含正常句子的文本文件。实际上我在输入那个文件时很着急，所以我只是将句子第一个单词的第一个字母大写（根据英语语法）。

但现在我希望每个单词的首字母大写会更好。比如：

这句话的每个单词都大写

上句要注意的是of和is都不是大写的，其实我想转义等于或小于3 个字母。

我该怎么办？

【问题讨论】：

"我想转义小于等于 3 个字母的单词。" - 标题中包含超过 3 个字符的单词不应大写。

标签： python formatting python-2.7 text-manipulation

【解决方案1】：

for line in text_file:
    print ' '.join(word.title() if len(word) > 3 else word for word in line.split())

编辑：要省略计数标点符号，请将len 替换为以下函数：

def letterlen(s):
    return sum(c.isalpha() for c in s)

【讨论】：

在计算单词长度时不考虑标点符号。
@martineau。已编辑以解决您的问题。
word.title() 将“can't”大写为“Can'T”。可以改用word.capitalize()，它只会将word 的第一个字母大写。

【解决方案2】：

看看NLTK。

标记每个单词，并大写。诸如“if”、“of”之类的词称为“停用词”。如果您的标准仅仅是长度，史蒂文的回答是这样做的好方法。如果您想查找停用词，SO 中有一个类似的问题：How to remove stop words using nltk or python。

【讨论】：

【解决方案3】：

你应该拆分单词，并且只将那些超过三个字母的单词大写。

words.txt:

each word of this sentence is capitalized
some more words
an other line

-

import string


with open('words.txt') as file:
    # List to store the capitalised lines.
    lines = []
    for line in file:
        # Split words by spaces.
        words = line.split(' ')
        for i, word in enumerate(words):
            if len(word.strip(string.punctuation + string.whitespace)) > 3:
                # Capitalise and replace words longer than 3 (without punctuation).
                words[i] = word.capitalize()
        # Join the capitalised words with spaces.
        lines.append(' '.join(words))
    # Join the capitalised lines.
    capitalised = ''.join(lines)

# Optionally, write the capitalised words back to the file.
with open('words.txt', 'w') as file:
    file.write(capitalised)

【讨论】：

关闭，但是增加“单词”的字母数的标点符号呢？
几乎完美，除了嵌入的标点符号（即“不能”）。无论如何 +1。
@ArturGaspar 如何防止此脚本表单最后写入/打印空行。
@Santosh 从输入文件中删除一个空行。
@ArturGaspar 我的输入文件没有空行，只有一行小写单词。

【解决方案4】：

您真正想要的是称为stop words 的列表。如果没有此列表，您可以自己构建一个并执行以下操作：

skipWords = set("of is".split())
punctuation = '.,<>{}][()\'"/\\?!@#$%^&*' # and any other punctuation that you want to strip out
answer = ""

with open('filepath') as f:
    for line in f:
        for word in line.split():
            for p in punctuation:
                # you end up losing the punctuation in the outpt. But this is easy to fix if you really care about it
                word = word.replace(p, '')  
            if word not in skipwords:
                answer += word.title() + " "
            else:
                answer += word + " "
    return answer # or you can write it to file continuously

【讨论】：

好方法，但需要考虑标点符号（通常不被视为单词中的字母）。
您的更新解决了标点符号问题，但我怀疑是以一种不太理想的蛮力方式完成的。
@martineau 你会如何优化它？
嗯，一方面，您可以创建一个标点符号集并使用它来避免大多数单词不需要的for 循环。其次，除非字符是 unicode，否则可以使用正则表达式 re.sub() 甚至 str.translate() 删除标点符号。
re.sub 有点矫枉过正，如果使用不当可能会变得过于复杂（傻瓜冲进天使害怕踩到的地方）。但我确实喜欢str.translate 的想法

【解决方案5】：

您可以将文本文件中的所有元素添加到列表中：

list = []
f.open('textdocument'.txt)
for elm in f (or text document, I\'m too tired):
   list.append(elm)

一旦你拥有了一个列表中的所有元素，运行一个 for 循环来检查每个元素的长度，如果它大于三个，则返回第一个大写的元素

new_list = []
for items in list:
   if len(item) > 3:
      item.title()    (might wanna check if this works in this case)
      new_list.append(item)
   else:
   new_list.append(item)    #doesn't change words smaller than three words, just adds them to the new list

然后看看这是否有效？

【讨论】：

stackoverflow.com/questions/1549641/… 如果我的大写方法不起作用，请尝试这里提到的方法......
for elm in f 会将文本文件的每一行放入列表中，而不是每个单词。最后一行的缩进有点乱。
是的，我没有复制/粘贴我在表单中编写的代码，结果通常效果不佳。