【问题标题】:Perl word Stemming English textPerl 词干提取英文文本
【发布时间】:2019-10-22 15:21:14
【问题描述】:

我正在尝试阻止英文文本,我阅读了很多论坛,但我看不到一个明确的例子。我正在使用搬运工词干分析器,就像使用 Text::ENglish 一样。 这是我走了多远:

    use Lingua::StopWords qw(getStopWords);
    my $stopwords = getStopWords('en');
    use Text::English;

    @stopwords = grep { $stopwords->{$_} } (keys %$stopwords);

    chdir("c:/Test Facility/input");
    @files = <*>;

    foreach $file (@files) 
      {
        open (input, $file);

        while (<input>) 
          {
            open (output,">>c:/Test Facility/normalized/".$file);
        chomp;
        for my $w (@stopwords) 
        {
        s/\b\Q$w\E\b//ig;
        }
        $_ =~s/<[^>]*>//g;
        $_ =~ s/[[:punct:]]//g;
        ##What should I write here to apply porter stemming using Text::English##
        print output "$_\n";

          }

       }
    close (input);
    close (output);

【问题讨论】:

    标签: perl file porter-stemmer


    【解决方案1】:

    像这样运行以下代码:

    perl stemmer.pl /usr/lib/jvm/java-6-sun-1.6.0.26/jre/LICENSE

    它产生类似于以下的输出:

    operat system distributor licens java version sun microsystems inc sun willing to license java platform standard edition developer kit jdk

    请注意,除了停用词之外,长度为 1 和数值的字符串将被删除。

    #!/usr/bin/env perl
    use common::sense;
    
    use Encode;
    use Lingua::Stem::Snowball;
    use Lingua::StopWords qw(getStopWords);
    use Scalar::Util qw(looks_like_number);
    
    my $stemmer = Lingua::Stem::Snowball->new(
        encoding    => 'UTF-8',
        lang        => 'en',
    );
    
    my %stopwords = map {
        lc
    } keys %{getStopWords(en => 'UTF-8')};
    
    local $, = ' ';
    say map {
        sub {
            my @w =
                map {
                    encode_utf8 $_
                } grep {
                    length >= 2
                    and not looks_like_number($_)
                    and not exists $stopwords{lc($_)}
                } split
                    /[\W_]+/x,
                    shift;
    
            $stemmer->stem_in_place(\@w);
    
            map {
                lc decode_utf8 $_
            } @w
        }->($_);
    } <>;
    

    【讨论】:

      猜你喜欢
      • 2014-12-30
      • 2016-06-29
      • 2013-05-25
      • 1970-01-01
      • 2014-08-18
      • 2019-02-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多