首先,您可以创建一个类来保存出现次数和行号(连同单词)的数据。这个类可以实现Comparable接口,提供基于词频的简单比较:
public class WordOccurrence implements Comparable<WordOccurrence> {
private final String word;
private int totalCount = 0;
private Set<Integer> lineNumbers = new TreeSet<>();
public WordOccurrence(String word, int firstLineNumber) {
this.word = word;
addOccurrence(firstLineNumber);
}
public final void addOccurrence(int lineNumber) {
totalCount++;
lineNumbers.add(lineNumber);
}
@Override
public int compareTo(WordOccurrence o) {
return totalCount - o.totalCount;
}
@Override
public String toString() {
StringBuilder lineNumberInfo = new StringBuilder("[");
for (int line : lineNumbers) {
if (lineNumberInfo.length() > 1) {
lineNumberInfo.append(", ");
}
lineNumberInfo.append(line);
}
lineNumberInfo.append("]");
return word + ", occurences: " + totalCount + ", on rows "
+ lineNumberInfo.toString();
}
}
从文件中读取单词时,将数据返回到Map<String, WordOccurrence> 中很有用,将单词映射到WordOccurrences。使用TreeMap,您将“免费”获得按字母顺序排序。此外,您可能希望从行中删除标点符号(例如,使用像 \\p{P} 这样的正则表达式)并忽略单词的大小写:
public TreeMap<String, WordOccurrence> countOccurrences(String filePath)
throws IOException {
TreeMap<String, WordOccurrence> words = new TreeMap<>();
File file = new File(filePath);
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(file)));
String line = null;
int lineNumber = 0;
while ((line = reader.readLine()) != null) {
// remove punctuation and normalize to lower-case
line = line.replaceAll("\\p{P}", "").toLowerCase();
lineNumber++;
String[] tokens = line.split("\\s+");
for (String token : tokens) {
if (words.containsKey(token)) {
words.get(token).addOccurrence(lineNumber);
} else {
words.put(token, new WordOccurrence(token, lineNumber));
}
}
}
return words;
}
使用上面的代码按字母顺序显示出现的次数就像
for (Map.Entry<String, WordOccurrence> entry :
countOccurrences("path/to/file").entrySet()) {
System.out.println(entry.getValue());
}
如果您不能使用Collections.sort()(和Comparator<WordOccurrence>)按出现次数排序,则需要自己编写排序。应该这样做:
public static void displayInOrderOfOccurrence(
Map<String, WordOccurrence> words) {
List<WordOccurrence> orderedByOccurrence = new ArrayList<>();
// sort
for (Map.Entry<String, WordOccurrence> entry : words.entrySet()) {
WordOccurrence wo = entry.getValue();
// initialize the list on the first round
if (orderedByOccurrence.isEmpty()) {
orderedByOccurrence.add(wo);
} else {
for (int i = 0; i < orderedByOccurrence.size(); i++) {
if (wo.compareTo(orderedByOccurrence.get(i)) > 0) {
orderedByOccurrence.add(i, wo);
break;
} else if (i == orderedByOccurrence.size() - 1) {
orderedByOccurrence.add(wo);
break;
}
}
}
}
// display
for (WordOccurrence wo : orderedByOccurence) {
System.out.println(wo);
}
}
使用以下测试数据运行上述代码:
土豆;橘子。
香蕉;苹果,苹果;土豆。
土豆。
将产生这个输出:
苹果,出现次数:2,在行 [2]
香蕉,出现次数:1,在行 [2]
橙色,出现次数:1,在行 [1]
马铃薯,出现次数:3,在行 [1, 2, 3]
马铃薯,出现次数:3,在行 [1, 2, 3]
苹果,出现次数:2,在行 [2]
香蕉,出现次数:1,在行 [2]
橙色,出现次数:1,在行 [1]