【发布时间】:2019-11-20 20:38:38
【问题描述】:
我有来自不同域的大量 URL 数据集。我必须通过 mapreduce 处理它们,以便将具有相似模式的 URL 组合在一起。例如
http://www.agricorner.com/price/onion-prices/
http://www.agricorner.com/price/potato-prices/
http://www.agricorner.com/price/ladyfinder-prices/
http://www.agricorner.com/tag/story/story-1.html
http://www.agricorner.com/tag/story/story-11.html
http://www.agricorner.com/tag/story/story-41.html
https://agrihunt.com/author/ramzan/page/3/
https://agrihunt.com/author/shahban/page/5/
https://agrihunt.com/author/Sufer/page/3/
我想根据它们的模式对这些 URL 进行分组,即,如果 URL 具有相似的模式(在 Map-reduce 的 reducer 阶段)。预期的输出可能像
group1, http://www.agricorner.com/price/onion-prices/, http://www.agricorner.com/price/potato-prices/, http://www.agricorner.com/price/ladyfinder-prices/
group2, http://www.agricorner.com/tag/story/story-1.html, http://www.agricorner.com/tag/story/story-11.html, http://www.agricorner.com/tag/story/story-41.html
group3, https://agrihunt.com/author/ramzan/page/3/, https://agrihunt.com/author/shahban/page/5/, https://agrihunt.com/author/Sufer/page/3/
这可能吗?有没有比假设的更好的方法?
类似模式的更新:
对于上面的示例,“/price/ladyfinder-prices”、“price/potato-prices/”和“/ladyfinder-prices/”被组合在一起,因为它们具有相同的域,路径可以达到某个级别。其他例子也一样。 我的场景非常接近 github 讨论的场景,但它如何用于 map-reduce ?
【问题讨论】:
-
详细定义“相似模式”的含义。对我来说,它看起来像简单的公共前缀(前缀路径)。
标签: java algorithm mapreduce grouping