扫描大量文档几十个单词答案

【问题标题】：Scanning a large number of documents for tens of words扫描大量文档几十个单词
【发布时间】：2015-01-18 19:01:08
【问题描述】：

我有大量文档（超过一百万），我需要定期扫描并匹配大约 100 个“多词关键字”（即不仅是“电影”等关键字，还有“北美”等关键字）。我有以下代码适用于单个单词关键字（即“书”）：

/** 
 * Scan a text for certain keywords
 * @param keywords the list of keywords we are searching for
 * @param text the text we will be scanning
 * @return a list of any keywords from the list which we could find in the text
 */
public static List<String> scanWords(List<String> keywords, String text) {

    // prepare the BreakIterator
    BreakIterator wb = BreakIterator.getWordInstance();
    wb.setText(text);

    List<String> results = new ArrayList<String>();

    // iterate word by word
    int start = wb.first();
    for (int end = wb.next(); end != BreakIterator.DONE; start = end, end = wb.next()) {

        String word = text.substring(start, end);

        if (!StringUtils.isEmpty(word) && keywords.contains(word)){

            // we have this word in our keywords so return it
            results.add(word);
        }
    }

    return results;
}

注意：我需要此代码尽可能高效，因为文档数量非常多。

我当前的代码无法找到 2 个关键字中的任何一个。关于如何修复的任何想法？我也可以采用完全不同的方法。

【问题讨论】：

为什么不试试Lucene 来完成这样的任务？
是的，使用维护文件的索引系统可能会更好。顺便说一句，你需要一百万个文件是什么样的文本？如果每个包含 10 个单词，那就是 1000 万个单词。想象一下 I/O 只是为了打开/关闭它们。
我只需要在我的数据库中存储找到的关键字。索引不是解决方案。
文档集或关键字集是否随时间而变化，或两者兼而有之？
我正在为文档中的关键字建立索引。现有文档将在很大程度上保留相同的关键字。将定期添加新文档。

标签： java regex algorithm matching string-matching

【解决方案1】：

扫描每个文档根本不会缩放。在inverted index 中更好地索引您的文档或者在评论中使用 Lucene。

【讨论】：

索引不适合我的任务。

【解决方案2】：

我相信创建Scanner 的实例可以解决这个问题。 Scanner 类有一个方法，允许您在文本中搜索模式，该模式将是您的案例中的单词。

Scanner scanner=new Scanner(text);
while(scanner.hasNext()){
    scanner.findInLine(String pattern);
    scanner.next();
}

Scanner 类非常适合做这样的事情，我相信它可以很好地满足您的需要。

【讨论】：

核心问题是他必须在每个文档中搜索许多不同的关键字/短语。您是否建议他重置扫描仪并为每个模式搜索一次整个文档？在你尝试之前你无法真正知道，但如果它表现得足够好，我会感到非常惊讶。你在用findInLine() 电话做什么？这对我来说毫无意义。