在网页上查找最常用的单词（使用 Jsoup）？答案

【问题标题】：Find most frequent words on a webpage (using Jsoup)?在网页上查找最常用的单词（使用 Jsoup）？
【发布时间】：2015-04-04 14:27:27
【问题描述】：

在我的项目中，我必须计算维基百科文章中出现频率最高的单词。我找到了用于解析 HTML 格式的 Jsoup，但这仍然存在词频问题。 Jsoup 中是否有计算单词频率的功能，或者使用 Jsoup 查找网页上最常用的单词的任何方法？

谢谢。

【问题讨论】：

不，Jsoup 不是统计/直方图工具。它是简单的 XML/HTML 解析器。
有没有可以用来解决我的问题的 API？
也许有，但我不知道有任何关于工具推荐的问题在 Stack Overflow 上是题外话，所以你不应该在这里询问它们。但是您可以使用例如Map<String, Integer> 编写自己的代码，您将在其中存储每个单词及其计数。完成映射后，找到最大计数。

标签： java html jsoup webpage word-frequency

【解决方案1】：

是的，您可以使用 Jsoup 从网页中获取文本，如下所示：

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
String text = doc.body().text();

然后，您需要计算单词并找出哪些单词是最常用的。 This code 看起来很有希望。我们需要修改它以使用 Jsoup 的 String 输出，如下所示：

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupWordCount {

   public static void main(String[] args) throws IOException {
        long time = System.currentTimeMillis();

        Map<String, Word> countMap = new HashMap<String, Word>();

        //connect to wikipedia and get the HTML
        System.out.println("Downloading page...");
        Document doc = Jsoup.connect("http://en.wikipedia.org/").get();

        //Get the actual text from the page, excluding the HTML
        String text = doc.body().text();

        System.out.println("Analyzing text...");
        //Create BufferedReader so the words can be counted
        BufferedReader reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8))));
        String line;
        while ((line = reader.readLine()) != null) {
            String[] words = line.split("[^A-ZÃƒâ€¦Ãƒâ€žÃƒâ€“a-zÃƒÂ¥ÃƒÂ¤ÃƒÂ¶]+");
            for (String word : words) {
                if ("".equals(word)) {
                    continue;
                }

                Word wordObj = countMap.get(word);
                if (wordObj == null) {
                    wordObj = new Word();
                    wordObj.word = word;
                    wordObj.count = 0;
                    countMap.put(word, wordObj);
                }

                wordObj.count++;
            }
        }

        reader.close();

        SortedSet<Word> sortedWords = new TreeSet<Word>(countMap.values());
        int i = 0;
        int maxWordsToDisplay = 10;

        String[] wordsToIgnore = {"the", "and", "a"};

        for (Word word : sortedWords) {
            if (i >= maxWordsToDisplay) { //10 is the number of words you want to show frequency for
                break;
            }

            if (Arrays.asList(wordsToIgnore).contains(word.word)) {
                i++;
                maxWordsToDisplay++;
            } else {
                System.out.println(word.count + "\t" + word.word);
                i++;
            }

        }

        time = System.currentTimeMillis() - time;

        System.out.println("Finished in " + time + " ms");
    }

    public static class Word implements Comparable<Word> {
        String word;
        int count;

        @Override
        public int hashCode() { return word.hashCode(); }

        @Override
        public boolean equals(Object obj) { return word.equals(((Word)obj).word); }

        @Override
        public int compareTo(Word b) { return b.count - count; }
    }
}

输出：

Downloading page...
Analyzing text...
42  of
24  in
20  Wikipedia
19  to
16  is
11  that
10  The
9   was
8   articles
7   featured
Finished in 3300 ms

一些注意事项：

此代码可以忽略某些单词，例如“the”、“and”、“a”等。您必须对其进行自定义。
有时 unicode 字符似乎有问题。虽然我没有经历过，但 cmets 中有人经历过。
这可以做得更好，代码更少。
没有经过很好的测试。

享受吧！

【讨论】：

为什么不将单词映射到它们的计数？一个基本上是一个没有抽象更多功能的字符串装饰器的整个类似乎有点矫枉过正。只是好奇。
我运行了同样的代码，但得到了不同的输出。我数到了 27 次。
@LittlePanda，奇怪，你用的是什么网址？是的，这段代码并不完美，因为我在几分钟内就把它拼凑在一起了。
@JonasCz - 我复制并运行了这段代码。我没有改变任何东西。
想想你，这正是我需要的，现在它缺少删除停用词。