【发布时间】:2012-12-14 16:50:25
【问题描述】:
假设文本文件中有一个 URL 列表(以百万为单位),而文本文件中有另一个列表包含列入黑名单的单词。
我愿意对URL列表做如下处理。
- Parse the URLs and store them in some DS
- Process the URLs and blacklist those URLs which contain atleast one of the
blacklisted words.
- If there exists a URL containing 50% or more blacklisted words, add the other
words of that URL in the list of blacklisted words.
- Since now the blacklisted words list has been modified then it's probable
that the URLs which were not blacklisted earlier can get blacklisted now. So,
the algorithm should handle this case as well and mark the earlier whitelisted
URLs as blacklisted if they contain these newly added blacklisted words.
最后我应该有一个列入白名单的 URL 列表
有什么建议可以用来实现最有效的时间和空间复杂度解决方案的最佳算法和 DS 吗?
【问题讨论】:
-
我强烈反对
If there exists a URL containing 50% or more blacklisted words, add the other words of that URL in the list of blacklisted words.。您很可能最终会禁止诸如a、that、the之类的字词,最后以空集作为“白名单”网址 -
小心这种方法。假设您有一个网站“theblacklistedwordblog.com”。运行此之后,单词 blog 和 the 也将被列入黑名单。我希望你不要限制。
-
如何定义 URL 的字词?
-
此外 - 请注意,此方法不是确定性的,并且取决于您扫描文档的顺序。 (因为经过几次算法迭代,获得 50% 列入黑名单的单词会更难),因此早期处理的文档更有可能将其单词“贡献”到黑名单中(与稍后处理的文档相比)
-
您可能应该阅读一些有关解决类似问题的现有方法的信息。你知道的,贝叶斯过滤、加权、文本处理、马尔可夫模型、机器学习等等。
标签: string algorithm machine-learning spam