如何对文本文件执行数据清理？答案

【问题标题】：How to perform data cleaning for a text file?如何对文本文件执行数据清理？
【发布时间】：2022-01-17 07:45:33
【问题描述】：

我有一个包含很多行的文本文件，包括单词和数字，这是一个示例：

2021-12-06 05:07:09.266 INFO: Additional  ID 1638301749791
2021-12-06 05:07:09.266 INFO: Found 
2021-12-06 05:07:09.267 INFO: ObjectStatus-ok factor 1163 factor five and six computed as it was before best weight ID 1638301749796
2021-12-06 05:07:09.267 INFO: disabled; computing power weight factor factor 19025.
2021-12-06 05:07:10.041 INFO: Wrote big factor 0.3568357342, Classificationfactortype-fail
2021-12-06 05:07:10.042 DEBUG: Duiu.0.0.2588650814
2021-12-06 05:07:10.743 INFO: Wrote .3254806495

我的问题是如何保留 具有特定单词“Classificationfactortype-fail”和“ObjectStatus-ok”的行，并删除所有其他行？我想将新的文本文件保存在目录中。

这是我写的代码：

ans = []

with open('test. txt') as rf:
    for line in rf:
        line = line.strip()
        if "Classificationfactortype-fail" in line or "ObjectStatus-ok" in line:
          ans.append(line)

with open('extracted_data.txt', 'w') as wf:
    for line in ans:
        wf.write(line)

【问题讨论】：

这能回答你的问题吗？ Python - Check If Word Is In A String
什么exaclty 不适用于您的代码？
这能回答你的问题吗？ Does Python have a string 'contains' substring method?

标签： python data-cleaning txt

【解决方案1】：

如果每一行都以时间码开头，那么 str.startswith() 将不起作用。

你可以这样做：

if "Classificationfactortype-fail" in line or "ObjectStatus-ok" in line:
   ans.append(line)

在你的第一个循环中。

【讨论】：

没错。此外，您的第一条语句应该是with open('test. txt', 'r') as rf，而您的for 循环需要是for line in rf.readlines()。
@MarcoCouto 确实是个好习惯，但它们并不是绝对必要的：open() 的默认模式已经是 "r" 并且通过文件的循环返回每一行。但是，是的，它使代码更容易理解。
它将所有内容写在一行中，在我的新 txt 文件中。怎么让它分开写？
@MarcoCouto stupidpythonideas.blogspot.com/2013/06/…
@nikki Marco Couto 是对的：与 print() 不同，file.write() 不添加 newline 字符在末尾，因此您必须在第一个循环中手动添加它 ans.append(line + "\n") OR 在编写新文件时：wf.write(line + "\n")