Rabin Karp 在 Ruby 中的实现太慢了答案

【问题标题】：Rabin Karp Implementation too slow in RubyRabin Karp 在 Ruby 中的实现太慢了
【发布时间】：2011-12-30 17:47:16
【问题描述】：

我一直在研究一个小型抄袭检测引擎，它使用来自MOSS 的 Idea。我需要一个 Rolling Hash 函数，我的灵感来自 Rabin-Karp 算法。

我写的代码 -->

#!/usr/bin/env ruby
#Designing a rolling hash function.
#Inspired from the Rabin-Karp Algorithm

module Myth
  module Hasher

    #Defining a Hash Structure
    #A hash is a integer value + line number where the word for this hash existed in the source file
    Struct.new('Hash',:value,:line_number)

    #For hashing a piece of text we ned two sets of parameters
    #k-->For buildinf units of k grams hashes  
    #q-->Prime which lets calculations stay within range
    def calc_hash(text_to_process,k,q)

      text_length=text_to_process.length
      radix=26

      highorder=(radix**(text_length-1))%q

      #Individual hashes for k-grams
      text_hash=0

      #The entire k-grams hashes list for the document
      text_hash_string=""

      #Preprocessing
      for c in 0...k do
        text_hash=(radix*text_hash+text_to_process[c].ord)%q
      end

      text_hash_string << text_hash.to_s << " "

      loop=text_length-k

      for c in 0..loop do        
        puts text_hash
        text_hash=(radix*(text_hash-text_to_process[c].ord*highorder)+(text_hash[c+k].ord))%q
        text_hash_string << text_hash_string << " "
      end
    end

  end
end

我正在使用值运行它 --> calc_hash(text,5,101) 其中 text 是字符串输入。

代码很慢。我哪里错了？

【问题讨论】：

瓶颈（或瓶颈）在哪里？哪段代码占用的 CPU 时间最多？请至少运行一些简单的测试，“太慢”过于模糊。
我一直在尝试基于文本“算法简介”来实现算法，根据先前计算的哈希计算哈希的主循环很慢。如何进一步分析它？
作为编码风格建议，在运算符和值之间使用空格。它们不会对运行速度产生任何影响，并且会使您的代码更易于维护。

标签： ruby algorithm plagiarism-detection rabin-karp

【解决方案1】：

看看Ruby-Prof，Ruby 的分析器。使用gem install ruby-prof进行安装。

一旦您对代码滞后的地方有了一些想法，您可以使用 Ruby 的Benchmark 尝试不同的算法以找到最快的算法。

在 StackOverflow 上四处寻找，您会看到 lots of places，我们将在其中使用 Benchmark 测试各种方法，看看哪种方法最快。您还将了解设置测试的不同方法。

例如，查看您的代码，我不确定追加 << 是否比使用 + 或使用字符串插值进行连接更好。这是测试它的代码和结果：

require 'benchmark'
include Benchmark

n = 1_000_000
bm(13) do |x|
  x.report("interpolate") { n.times { foo = "foo"; bar = "bar"; "#{foo}#{bar}" } }
  x.report("concatenate") { n.times { foo = "foo"; bar = "bar"; foo + bar      } }
  x.report("append")      { n.times { foo = "foo"; bar = "bar"; foo << bar     } }
end

ruby test.rb; ruby test.rb
                   user     system      total        real
interpolate    1.090000   0.000000   1.090000 (  1.093071)
concatenate    0.860000   0.010000   0.870000 (  0.865982)
append         0.750000   0.000000   0.750000 (  0.753016)
                   user     system      total        real
interpolate    1.080000   0.000000   1.080000 (  1.085537)
concatenate    0.870000   0.000000   0.870000 (  0.864697)
append         0.750000   0.000000   0.750000 (  0.750866)

根据@Myth17 的以下评论，我想知道在附加字符串时使用固定变量与变量的效果：

require 'benchmark'
include Benchmark

n = 1_000_000
bm(13) do |x|
  x.report("interpolate") { n.times { foo = "foo"; bar = "bar"; "#{foo}#{bar}" } }
  x.report("concatenate") { n.times { foo = "foo"; bar = "bar"; foo + bar      } }
  x.report("append")      { n.times { foo = "foo"; bar = "bar"; foo << bar     } }
  x.report("append2")     { n.times { foo = "foo"; bar = "bar"; "foo" << bar   } }
  x.report("append3")     { n.times { foo = "foo"; bar = "bar"; "foo" << "bar" } }
end

导致：

ruby test.rb;ruby test.rb

                   user     system      total        real
interpolate    1.330000   0.000000   1.330000 (  1.326833)
concatenate    1.080000   0.000000   1.080000 (  1.084989)
append         0.940000   0.010000   0.950000 (  0.937635)
append2        1.160000   0.000000   1.160000 (  1.165974)
append3        1.400000   0.000000   1.400000 (  1.397616)

                   user     system      total        real
interpolate    1.320000   0.000000   1.320000 (  1.325286)
concatenate    1.100000   0.000000   1.100000 (  1.090585)
append         0.930000   0.000000   0.930000 (  0.936956)
append2        1.160000   0.000000   1.160000 (  1.157424)
append3        1.390000   0.000000   1.390000 (  1.392742)

这些值与我之前的测试不同，因为代码是在我的笔记本电脑上运行的。

附加两个变量比涉及固定字符串时更快，因为存在开销； Ruby 必须创建一个中间变量，然后附加到它。

这里的重要教训是，我们可以在编写代码时做出更明智的决定，因为我们知道什么运行得更快。同时，差异不是很大，因为大多数代码没有运行 1,000,000 次循环。您的里程可能会有所不同。

【讨论】：

谢谢。我对此一无所知。会看看。 :)
在内存中连接时，Append 也不会创建新的字符串副本。
正确。它会修改接收器，这可能会产生意想不到的副作用，或者换句话说，如果您不期望它会产生错误。
是字符串操作导致算法变慢。 :|