在一组 URL 中找到一个共同的模式答案

【问题标题】：Find a common pattern among a set of URLs在一组 URL 中找到一个共同的模式
【发布时间】：2017-01-13 17:53:07
【问题描述】：

我有一个 URL 文件。文件是这样的

http://www.example.com/images/1
http://www.example.com/images/2
.
.
.
http://www.example.com/images/2000
http://www.example.org/p/q/r/1/s/t
http://www.example.org/p/q/r/2/s/t
http://www.example.org/p/q/r/3/s/t
.
.
.
http://www.example.org/p/q/r/5000/s/t

等等。 URL 未排序。我只是整理出来解释清楚。

我必须处理这些 URL，以便如果 2 个 URL 之间有一个单词（两个斜杠之间的单词）不同且此类出现的次数大于 1000，我将用 * 替换该单词

例如，在上面的文件中，我会有

http://www.example.com/images/*
http://www.example.org/p/q/r/*/s/t

文件大小为数百 GB。有人可以帮我解决这个问题吗？

【问题讨论】：

这些文件是否存储在 S3 中？
是的。我什至可以使用 Map Reduce 解决方案。

标签： amazon-web-services url mapreduce distributed-computing

【解决方案1】：

我们在this paper（第 3 页末尾）中针对此问题提出了 MapReduce 算法，它是对another paper（第 4 页，算法 1）中介绍的顺序“中缀提取”算法的并行改编。

这里，我引用顺序算法：

1. sort URIs
2. tokenize URIs at all special characters
3. cluster URIs according to the first n tokens 
4. for all clusters do
5.   for all URIs in the cluster do
6.     for all possible prefixes do
7.       find the set of (distinct) next tokens T
8.     end for
9.   end for
10.   for all URIs in the cluster do
11.     set as a prefix the one with the largest |T|
12.     set as infix the substring following the prefix
13.   end for
14. end for

并行版本的主要思想是创建集群（第 3 步），通过使用 URI 的第二个标记（http:// 之后）作为映射输出键，然后执行与第 4 步类似的操作- 14）在每个reducer中，即在每个集群中。我们并行版本的源代码可以找到here。

提取每个 URI 的“中缀”后，您可以轻松地将其替换为您想要的任何字符，例如 '*'。请记住，这是一个昂贵的过程，只有在您拥有数百万个以上的 URI 时才有意义，在 MapReduce 中完成，而这些 URI 似乎您确实拥有。

【讨论】：