Java 字典搜索器答案

【问题标题】：Java Dictionary SearcherJava 字典搜索器
【发布时间】：2011-08-20 20:18:03
【问题描述】：

我正在尝试实现一个程序，该程序将接受用户输入，将该字符串拆分为标记，然后在字典中搜索该字符串中的单词。我对解析字符串的目标是让每个标记都是一个英文单词。

例如：

Input:
       aman

Split Method:
      a man
      a m an
      a m a n
      am an
      am a n
      ama n

Desired Output:
      a man

我目前有这段代码，它可以完成所有工作，直到所需的输出部分：

    import java.util.Scanner;
import java.io.*;

public class Words {

    public static String[] dic = new String[80368];

    public static void split(String head, String in) {

        // head + " " + in is a segmentation 
        String segment = head + " " + in;

        // count number of dictionary words
        int count = 0;
        Scanner phraseScan = new Scanner(segment);
        while (phraseScan.hasNext()) {
            String word = phraseScan.next();
            for (int i=0; i<dic.length; i++) {
                if (word.equalsIgnoreCase(dic[i])) count++;
            }
        }

        System.out.println(segment + "\t" + count + " English words");

        // recursive calls
        for (int i=1; i<in.length(); i++) {
            split(head+" "+in.substring(0,i), in.substring(i,in.length()));
        }   
    }

    public static void main (String[] args) throws IOException {
        Scanner scan = new Scanner(System.in);
        System.out.print("Enter a string: ");
        String input = scan.next();
        System.out.println();

        Scanner filescan = new Scanner(new File("src:\\dictionary.txt"));
        int wc = 0;
        while (filescan.hasNext()) {
            dic[wc] = filescan.nextLine();
            wc++;
        }

        System.out.println(wc + " words stored");

        split("", input);

    }
}

我知道有更好的方法来存储字典（例如二叉搜索树或哈希表），但我不知道如何实现这些。

我被困在如何实现一种检查拆分字符串以查看每个段是否是字典中的单词的方法。

任何帮助都会很棒，谢谢

【问题讨论】：

Word Is In Dictionary or Not 的可能重复项
你期望的最大输入字符串是多少？
它可以是任意长度，但我预计它可能不会超过 20 个字符......我会说 50 MAX

标签： java string hashtable binary-search

【解决方案1】：

如果您想支持 20 个或更多字符，那么以所有可能的方式拆分输入字符串不会在合理的时间内完成。这是一种更有效的方法，cmets inline：

public static void main(String[] args) throws IOException {
    // load the dictionary into a set for fast lookups
    Set<String> dictionary = new HashSet<String>();
    Scanner filescan = new Scanner(new File("dictionary.txt"));
    while (filescan.hasNext()) {
        dictionary.add(filescan.nextLine().toLowerCase());
    }

    // scan for input
    Scanner scan = new Scanner(System.in);
    System.out.print("Enter a string: ");
    String input = scan.next().toLowerCase();
    System.out.println();

    // place to store list of results, each result is a list of strings
    List<List<String>> results = new ArrayList<>();

    long time = System.currentTimeMillis();

    // start the search, pass empty stack to represent words found so far
    search(input, dictionary, new Stack<String>(), results);

    time = System.currentTimeMillis() - time;

    // list the results found
    for (List<String> result : results) {
        for (String word : result) {
            System.out.print(word + " ");
        }
        System.out.println("(" + result.size() + " words)");
    }
    System.out.println();
    System.out.println("Took " + time + "ms");
}

public static void search(String input, Set<String> dictionary,
        Stack<String> words, List<List<String>> results) {

    for (int i = 0; i < input.length(); i++) {
        // take the first i characters of the input and see if it is a word
        String substring = input.substring(0, i + 1);

        if (dictionary.contains(substring)) {
            // the beginning of the input matches a word, store on stack
            words.push(substring);

            if (i == input.length() - 1) {
                // there's no input left, copy the words stack to results
                results.add(new ArrayList<String>(words));
            } else {
                // there's more input left, search the remaining part
                search(input.substring(i + 1), dictionary, words, results);
            }

            // pop the matched word back off so we can move onto the next i
            words.pop();
        }
    }
}

示例输出：

Enter a string: aman

a man (2 words)
am an (2 words)

Took 0ms

这是一个更长的输入：

Enter a string: thequickbrownfoxjumpedoverthelazydog

the quick brown fox jump ed over the lazy dog (10 words)
the quick brown fox jump ed overt he lazy dog (10 words)
the quick brown fox jumped over the lazy dog (9 words)
the quick brown fox jumped overt he lazy dog (9 words)

Took 1ms

【讨论】：

另一种方法是将单词存储在数据库中。这将在处理大量单词（> 400 万）时提高性能。
@jmendeth：当然，如果字典足够大并且没有足够的可用内存，数据库会有所帮助。然而，大多数字典并没有那么大。我测试过的较大的有超过 400k 字，需要 38MB。 OP 不需要数据库，因为他的字典有 80k 个单词并且只消耗大约 7MB。对于大量的单词，我可能会在进入数据库之前尝试使用不同的数据结构，比如 trie。不过，数据库可以正常工作，在我提供的 36 个字符的示例输入中，只有 335 次查找。
你是对的，但有时（不是在这种情况下）其他语言/字符的字典可能有大约 1000 万字。
有没有实现二叉搜索树而不是HashSet？ Ty 为您解答
二叉搜索树会给你 O(lg(n)) 的搜索时间而不是 O(1)，所以这不是那么热门。尝试字母虽然可以实现startsWith或类似的。在当前的实现中，如果给定一个 3 PB 的字符串，该字符串恰好不是以单词“xzaszssxaa...”开头，那么您将扫描整个字符串，反复在字典中寻找越来越长的子字符串，而不是快速发现它不存在。使用 trie 实现，您会提前停止。

【解决方案2】：

如果我的回答看起来很傻，那是因为你真的很接近，我不确定你被困在哪里。

根据上面的代码，最简单的方法是简单地添加一个单词数计数器并将其与匹配的单词数进行比较

    int count = 0; int total = 0;
    Scanner phraseScan = new Scanner(segment);
    while (phraseScan.hasNext()) {
        total++
        String word = phraseScan.next();
        for (int i=0; i<dic.length; i++) {
            if (word.equalsIgnoreCase(dic[i])) count++;
        }
    }
    if(total==count) System.out.println(segment);

将其实现为哈希表可能会更好（当然更快），而且非常容易。

HashSet<String> dict = new HashSet<String>()
dict.add("foo")// add your data


int count = 0; int total = 0;
Scanner phraseScan = new Scanner(segment);
while (phraseScan.hasNext()) {
    total++
    String word = phraseScan.next();
    if(dict.contains(word)) count++;
}

还有其他更好的方法可以做到这一点。一种是 trie (http://en.wikipedia.org/wiki/Trie)，它的查找速度有点慢，但存储数据的效率更高。如果您有一个大字典，您可能无法将其放入内存中，因此您可以使用数据库或键值存储，例如 BDB (http://en.wikipedia.org/wiki/Berkeley_DB)

【讨论】：

【解决方案3】：

包链表；

导入 java.util.LinkedHashSet;

公共类字典检查{

private static LinkedHashSet<String> set;
private static int start = 0;
private static boolean flag;

public boolean checkDictionary(String str, int length) {

    if (start >= length) {
        return flag;
    } else {
        flag = false;
        for (String word : set) {

            int wordLen = word.length();

            if (start + wordLen <= length) {

                if (word.equals(str.substring(start, wordLen + start))) {
                    start = wordLen + start;
                    flag = true;
                    checkDictionary(str, length);

                }
            }
        }

    }

    return flag;
}

public static void main(String[] args) {
    // TODO Auto-generated method stub
    set = new LinkedHashSet<String>();
    set.add("Jose");
    set.add("Nithin");
    set.add("Joy");
    set.add("Justine");
    set.add("Jomin");
    set.add("Thomas");
    String str = "JoyJustine";
    int length = str.length();
    boolean c;

    dictionaryCheck obj = new dictionaryCheck();
    c = obj.checkDictionary(str, length);
    if (c) {
        System.out
                .println("String can be found out from those words in the Dictionary");
    } else {
        System.out.println("Not Possible");
    }

}

}

【讨论】：

简单有效的解决方案。如果我错过了什么，请告诉我。我猜它的时间复杂度是指数级的。多项式时间复杂度可以通过使用动态规划解决方案来实现。
虽然这段代码可能会解决 OP 的问题，但您确实应该添加一些关于代码的作用或它是如何作用的解释。 Just Code 答案不受欢迎。