【问题标题】:Perl remove StopWords from stringPerl 从字符串中删除 StopWords
【发布时间】:2015-01-05 08:55:42
【问题描述】:

我正在使用此脚本来删除 Perl 中的停用词,我在 Windows 上运行但我找不到 兼容版本:

Lingua::EN::StopWordList
Lingua::StopWords qw(getStopWords)

我有一个停用词数组,但是一旦我使用下面的正则表达式,我就会丢失导致单词冲突的关键空白。 注意 Stop-Word 数组中的每个单词都有两个空格,一个在右边,一个在左边。

如何在不丢失关键空格的情况下有效地删除停用词?

use strict;
use warnings;
use utf8;
use IO::File;
use String::Util 'trim';

my $inFile = "C:\\Users\\David\\Downloads\\InfoRet\\Explore the ways to get better grades.txt";
my $inFh = new IO::File $inFile, "r";
my $lineNum = 0;
my $line = undef;
my $loc = undef;
my $str = undef;

my @stopList = (" the ", " a ", " an ", " of ", " and ", " on ", " in ", " by ", " with ", " at ", " after ", " into ", " their ", " is ",  " that ", " they ", " for ", " to ", " it ", " them ", " which ");

for(my $i = 1; $i <= 4; $i++) {
    <$inFh>
}

while($line = <$inFh>) {
    $lineNum++;
    chomp $line;
    $line =~ s/[\$#@~!&*()\[\];.,:?^`\\\/]+//g;

    for my $planet (@stopList) {
        $loc = index($line, $planet);
        if($loc!=(-1)) {
            #$line =~ s/$str//g;
            $line =~ s/$planet//g;
        }
    }
    print "$line\n";
}

【问题讨论】:

  • 一个想法是不删除空格。不要循环遍历停止列表,而是使用停止词作为键及其值"" 进行散列。然后执行s#(\w+)# $hash{ lc($1) } // $1#g 请注意,您必须使用已定义或//,因为"" 是一个假值。另请注意,您必须从停用词列表中删除空格。

标签: regex perl stop-words


【解决方案1】:
my @stopList = ("the", "a", "an", "of", ..);
my ($rx) = map qr/(?:$_)/, join "|", map qr/\b\Q$_\E\b/, @stopList;

以后,

$line =~ s/$rx//g;

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2022-07-22
    • 1970-01-01
    • 2017-02-02
    • 1970-01-01
    • 2017-07-15
    • 2021-06-01
    • 1970-01-01
    相关资源
    最近更新 更多