用于删除文本文件中的停用词的快速 shell 命令答案

【问题标题】：Fast shell command to remove stop words in a text file用于删除文本文件中的停用词的快速 shell 命令
【发布时间】：2015-10-13 10:38:29
【问题描述】：

我有一个 2GB 的文本文件。我正在尝试从此文件中删除频繁出现的英语停用词。

我有 stopwords.txt 包含这样的内容..

a
an
the
for
and
I

使用 tr、sed 或 awk 等 shell 命令执行此操作的快速方法是什么？

【问题讨论】：

您的意思是输入速度最快，还是执行速度最快？
是在大数据上执行？
这听起来是个坏主意——为什么你想要一个 2GB 的副本，其中包含不可读的文本？如果您打算进行信息检索，则无论如何都需要对文本进行预处理（标记化、词干提取）和索引，那么为什么不在稍后阶段跳过停用词呢？
他们在哪里说过信息检索？
@Dan 他们没有，我只是认为这是一个可能的情况。但我所说的也适用于我能想到的任何其他需要删除停用词的 NLP 任务。

标签： shell nlp text-processing

【解决方案1】：

这是一个使用命令行和perl的方法：

将下面的文字另存为replacesw.sh：

#! /bin/bash
MYREGEX=\\b\(`perl -pe 's/\n/|/g' $1`\)\\b
perl -pe "s/$MYREGEX//g" $2

然后，如果您将上面的文件保存为 stopwords.txt，并有另一个名为 testtext.txt 的文件（例如）包含：

This is a file with the stopwords from the stopwords.txt for testing.
More than one line in the file, for a better test.

那么在命令行下面会去掉stopwords:

KBs-MBP13:temp kbenoit$ ./replacesw.sh stopwords.txt testtext.txt 
This is  file with  stopwords from  stopwords.txt  testing.
More than one line in  file,   better test.

您可能需要先chmod u+x replacesw.sh。

【讨论】：

您需要安装 perl，并且在 Windows 平台上语法会有所不同。
仅供参考，我使用的是 Mac OSX。检查 perl 安装:)
请提示错误信息？
Perl 当然是安装的。没有任何错误，它输出输入文件的内容，没有替换任何单词。 :) ./replaceSW.sh stopwords.txt input.txt 和输出 how about i decide to look at it afterwards what across do you think is it a good idea to go out and about i think id rather go up and above。您可以清楚地看到停用词to,is,i... 没有被删除
奇怪 - 我只是重新运行它，它肯定对我有用。