【发布时间】:2016-07-30 23:37:24
【问题描述】:
Dávid Horváth 的解决方案适用于返回最大最小单词:
import java.util.*;
public class SubWordsFinder
{
private Set<String> words;
public SubWordsFinder(Set<String> words)
{
this.words = words;
}
public List<String> findSubWords(String word) throws NoSolutionFoundException
{
List<String> bestSolution = new ArrayList<>();
if (word.isEmpty())
{
return bestSolution;
}
long length = word.length();
int[] pointer = new int[]{0, 0};
LinkedList<int[]> pointerStack = new LinkedList<>();
LinkedList<String> currentSolution = new LinkedList<>();
while (true)
{
boolean backtrack = false;
for (int end = pointer[1] + 1; end <= length; end++)
{
if (end == length)
{
backtrack = true;
}
pointer[1] = end;
String wordToFind = word.substring(pointer[0], end);
if (words.contains(wordToFind))
{
currentSolution.add(wordToFind);
if (backtrack)
{
if (bestSolution.isEmpty() || (currentSolution.size() <= bestSolution.size() && getSmallestSubWordLength(currentSolution) > getSmallestSubWordLength(bestSolution)))
{
bestSolution = new ArrayList<>(currentSolution);
}
currentSolution.removeLast();
} else if (!bestSolution.isEmpty() && currentSolution.size() == bestSolution.size())
{
currentSolution.removeLast();
backtrack = true;
} else
{
int[] nextPointer = new int[]{end, end};
pointerStack.add(pointer);
pointer = nextPointer;
}
break;
}
}
if (backtrack)
{
if (pointerStack.isEmpty())
{
break;
} else
{
currentSolution.removeLast();
pointer = pointerStack.removeLast();
}
}
}
if (bestSolution.isEmpty())
{
throw new NoSolutionFoundException();
} else
{
return bestSolution;
}
}
private int getSmallestSubWordLength(List<String> words)
{
int length = Integer.MAX_VALUE;
for (String word : words)
{
if (word.length() < length)
{
length = word.length();
}
}
return length;
}
public class NoSolutionFoundException extends Exception
{
private static final long serialVersionUID = 1L;
}
}
我有一个String,其中包含小写的常规英文单词。假设这个String 已经分解为所有可能子词的List:
public List<String> getSubWords(String text)
{
List<String> words = new ArrayList<>();
for (int startingIndex = 0; startingIndex < text.length() + 1; startingIndex++)
{
for (int endIndex = startingIndex + 1; endIndex < text.length() + 1; endIndex++)
{
String subString = text.substring(startingIndex, endIndex);
if (contains(subString))
{
words.add(subString);
}
}
}
Comparator<String> lengthComparator = (firstItem, secondItem) ->
{
if (firstItem.length() > secondItem.length())
{
return -1;
}
if (secondItem.length() > firstItem.length())
{
return 1;
}
return 0;
};
// Sort the list in descending String length order
Collections.sort(words, lengthComparator);
return words;
}
如何找到构成原始字符串的最少子词?
例如:
String text = "updatescrollbar";
List<String> leastWords = getLeastSubWords(text);
System.out.println(leastWords);
输出:
[update, scroll, bar]
我不确定如何遍历所有可能性,因为它们会根据所选单词而变化。开始会是这样的:
public List<String> getLeastSubWords(String text)
{
String textBackup = text;
List<String> subWords = getSubWords(text);
System.out.println(subWords);
List<List<String>> containing = new ArrayList<>();
List<String> validWords = new ArrayList<>();
for (String subWord : subWords)
{
if (text.startsWith(subWord))
{
validWords.add(subWord);
text = text.substring(subWord.length());
}
}
// Did we find a valid words distribution?
if (text.length() == 0)
{
System.out.println(validWords.size());
}
return null;
}
注意:
这与this 问题有关。
【问题讨论】:
-
第一次提取
text中包含的所有单词的列表?第二,您尝试找到构成该字符串的最少单词(确切地说?)?如果没有解决方案怎么办?我认为同时完成这两项任务会容易得多。 -
最好使用像
TreeSet这样的索引集合而不是ArrayList。
标签: java nlp text-segmentation