这里还有两种查找重复项的方法。
使用集合
require 'set'
def find_a_dup_using_set(arr)
s = Set.new
arr.find { |e| !s.add?(e) }
end
find_a_dup_using_set arr
#=> "hello"
使用select 代替find 返回一个包含所有重复项的数组。
使用Array#difference
class Array
def difference(other)
h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
reject { |e| h[e] > 0 && h[e] -= 1 }
end
end
def find_a_dup_using_difference(arr)
arr.difference(arr.uniq).first
end
find_a_dup_using_difference arr
#=> "hello"
删除 .first 以返回所有重复项的数组。
如果没有重复,这两个方法都返回nil。
我 proposed that Array#difference 被添加到 Ruby 核心。更多信息在我的回答here。
基准测试
让我们比较推荐的方法。首先,我们需要一个用于测试的数组:
CAPS = ('AAA'..'ZZZ').to_a.first(10_000)
def test_array(nelements, ndups)
arr = CAPS[0, nelements-ndups]
arr = arr.concat(arr[0,ndups]).shuffle
end
以及针对不同测试阵列运行基准测试的方法:
require 'fruity'
def benchmark(nelements, ndups)
arr = test_array nelements, ndups
puts "\n#{ndups} duplicates\n"
compare(
Naveed: -> {arr.detect{|e| arr.count(e) > 1}},
Sergio: -> {(arr.inject(Hash.new(0)) {|h,e| h[e] += 1; h}.find {|k,v| v > 1} ||
[nil]).first },
Ryan: -> {(arr.group_by{|e| e}.find {|k,v| v.size > 1} ||
[nil]).first},
Chris: -> {arr.detect {|e| arr.rindex(e) != arr.index(e)} },
Cary_set: -> {find_a_dup_using_set(arr)},
Cary_diff: -> {find_a_dup_using_difference(arr)}
)
end
我没有包含@JjP 的答案,因为只返回一个副本,并且当他/她的答案被修改为这样做时,它与@Naveed 之前的答案相同。我也没有包括@Marin 的答案,虽然在@Naveed 的答案之前发布,但它返回了所有重复项,而不仅仅是一个(一个小问题,但没有必要评估两者,因为它们在只返回一个重复项时是相同的)。
我还修改了返回所有重复项的其他答案以仅返回找到的第一个,但这对性能基本上没有影响,因为他们在选择一个之前计算了所有重复项。
每个基准测试的结果从最快到最慢列出:
首先假设数组包含 100 个元素:
benchmark(100, 0)
0 duplicates
Running each test 64 times. Test will take about 2 seconds.
Cary_set is similar to Cary_diff
Cary_diff is similar to Ryan
Ryan is similar to Sergio
Sergio is faster than Chris by 4x ± 1.0
Chris is faster than Naveed by 2x ± 1.0
benchmark(100, 1)
1 duplicates
Running each test 128 times. Test will take about 2 seconds.
Cary_set is similar to Cary_diff
Cary_diff is faster than Ryan by 2x ± 1.0
Ryan is similar to Sergio
Sergio is faster than Chris by 2x ± 1.0
Chris is faster than Naveed by 2x ± 1.0
benchmark(100, 10)
10 duplicates
Running each test 1024 times. Test will take about 3 seconds.
Chris is faster than Naveed by 2x ± 1.0
Naveed is faster than Cary_diff by 2x ± 1.0 (results differ: AAC vs AAF)
Cary_diff is similar to Cary_set
Cary_set is faster than Sergio by 3x ± 1.0 (results differ: AAF vs AAC)
Sergio is similar to Ryan
现在考虑一个包含 10,000 个元素的数组:
benchmark(10000, 0)
0 duplicates
Running each test once. Test will take about 4 minutes.
Ryan is similar to Sergio
Sergio is similar to Cary_set
Cary_set is similar to Cary_diff
Cary_diff is faster than Chris by 400x ± 100.0
Chris is faster than Naveed by 3x ± 0.1
benchmark(10000, 1)
1 duplicates
Running each test once. Test will take about 1 second.
Cary_set is similar to Cary_diff
Cary_diff is similar to Sergio
Sergio is similar to Ryan
Ryan is faster than Chris by 2x ± 1.0
Chris is faster than Naveed by 2x ± 1.0
benchmark(10000, 10)
10 duplicates
Running each test once. Test will take about 11 seconds.
Cary_set is similar to Cary_diff
Cary_diff is faster than Sergio by 3x ± 1.0 (results differ: AAE vs AAA)
Sergio is similar to Ryan
Ryan is faster than Chris by 20x ± 10.0
Chris is faster than Naveed by 3x ± 1.0
benchmark(10000, 100)
100 duplicates
Cary_set is similar to Cary_diff
Cary_diff is faster than Sergio by 11x ± 10.0 (results differ: ADG vs ACL)
Sergio is similar to Ryan
Ryan is similar to Chris
Chris is faster than Naveed by 3x ± 1.0
请注意,如果Array#difference 是在 C 中实现的,find_a_dup_using_difference(arr) 的效率会更高,如果将它添加到 Ruby 核心中就会出现这种情况。
结论
许多答案都是合理的,但使用 Set 显然是最佳选择。它在中等难度的情况下最快,在最困难的情况下最快,并且只有在计算量很小的情况下——当你的选择无论如何都不重要时——才能被击败。
您可能会选择 Chris 的解决方案的一种非常特殊的情况是,如果您想使用该方法分别对数千个小数组进行去重,并期望找到通常少于 10 个项目的重复项。这将是更快一点,因为它避免了创建 Set 的额外开销。