调整 WEKA 的 arff 文件格式答案

【问题标题】：adjust arff file format for WEKA调整 WEKA 的 arff 文件格式
【发布时间】：2018-01-08 20:45:52
【问题描述】：

我想对包含 2000 行的 Weka arff 文件进行预处理用于nlp项目（情绪分析）

我想要一个只在每个句子的开头和结尾添加一个单引号的代码。例如，这是我的数据集的示例：

The Da Vinci Code is one of the most beautiful movies ive ever seen.,1
The Da Vinci Code is an * amazing * book, do not get me wrong.,1
then I turn on the light and the radio and enjoy my Da Vinci Code.,1
The Da Vinci Code was REALLY good.,1
i love da vinci code....,1

我希望输出是：

'The Da Vinci Code is one of the most beautiful movies ive ever seen.',1
'The Da Vinci Code is an * amazing * book, do not get me wrong.',1
'then I turn on the light and the radio and enjoy my Da Vinci Code.',1
'The Da Vinci Code was REALLY good.',1
'i love da vinci code....',1

只想在每个句子的开头和结尾添加一个单引号（在 1 之前）。

如果你能帮我做，我将不胜感激

有什么工具可以代替编写代码吗？

【问题讨论】：

您能否编辑您的问题，并提供有关您尝试完成的任务以及失败的地方的信息？还请解释为什么特别提到 C++。

标签： weka arff

【解决方案1】：

您可以使用正则表达式来实现这一点。 Regular expressions are a powerful formalism for pattern matching in strings. 现有的大量工具都支持正则表达式，可以让你匹配/替换你想要的文本，而无需自己编写任何代码。

要使用正则表达式 (regexp) 进行匹配和替换，您需要两个部分：

Match：匹配字符串或字符串中某些内容的表达式。
Substitution/Replace：表示用什么来替换匹配。

匹配：

/([^\.]+)(\.+)(,1\s+)/g

第 1 组：匹配除文字点以外的所有字符，至少 1 个字符。
第 2 组：仅匹配文字点，至少 1 个字符。
第 3 组：匹配文字逗号，后跟文字 1，然后是至少 1 个空格字符。
正则表达式标志 g（全局）：多个匹配项

换人：

'$1$2'$3

第 1 组和第 2 组用引号引起来，然后是第 3 组。

您可以查看上述匹配和替换的互动版本here

现在您可以使用该匹配和替换来使用您最喜欢的正则表达式工具。

点赞sed：

sed -i -E "s/([^\.]+)(\.+)(,1\s+)/'\1\2'\3/g" yourfile.txt

或 Windows PowerShell:

(Get-Content yourfile.txt) -replace '([^\.]+)(\.+)(,1\s+)', '''$1$2''$3' | Out-File output.txt

_{其他工具可能使用不同的语法。提供的匹配/替换模式可能可以进一步优化。}

【讨论】：

其实我并没有完全理解你刚才所说的，因为我不熟悉正则表达式。你能帮我做吗？
@A.Atiyah 我用一些关于 RegExp 的基本信息扩展了我的答案，并包含了第二个如何使用它的示例（使用 PowerShell）。