【问题标题】:How to programmatically find variations of a specific word in a sentence?如何以编程方式查找句子中特定单词的变体?
【发布时间】:2016-06-14 20:05:37
【问题描述】:

有时,您获得的数据并不干净,并且使用、拼写错误或被操纵的词语存在变体。我们能找到与句子中的单词最相似的实例吗?

例如,如果我正在寻找单词“Awesome”,它已被用作句子中的变体,例如

"We had an awwweesssommmeeee dinner at sea resort"
"We had an awesomeeee dinner at sea resort"
"We had an awwesooomee dinner at sea resort"
etc..

【问题讨论】:

  • 您必须考虑不小心选择了不应该匹配的单词,例如"awful"。没有简单的答案。从agrep("awesome", x, max.distance=0.5, ignore.case=TRUE) 开始,了解 Levenshtein 距离的工作原理。
  • 您可能正在寻找datascience.stackexchange.com

标签: r string fuzzy-search stringdist


【解决方案1】:

您想纯粹在 SQL 中执行此操作吗?

否则,您将需要一些模糊匹配的字符串比较函数来在 SQL 中调用。该函数将使用一些算法组合,例如 Jaro-Winkler、levenshtein、ngrams 等。或拼音匹配变音位双变音位、变音位 3、soundex

根据您使用的 sql-server,您可以安装和使用数据质量组件,该组件具有其中一些算法的自定义 CLR 实现。或者 SSIS 模糊匹配组件。或者.....

我个人已经为我编写了 c# .net clr 函数来完成它,但我只处理名称,句子变得更加复杂,您可能希望拆分为单词/标记以作为部分进行比较,然后作为整体进行比较。 ...

【讨论】:

    【解决方案2】:

    作为一种快速的解决方案,您可以将文档小写,用空格标记它们,然后折叠每个术语的连续字符:

    import java.util.Map;
    import java.util.Scanner;
    import java.util.Set;
    import java.util.TreeMap;
    import java.util.TreeSet;
    import java.util.stream.Collectors;
    
    public class CollapseConsecutiveCharsDemo {
    
        public static String collapse(final String term) {
            final StringBuilder buffer = new StringBuilder();
            if (!term.isEmpty()) {
                char prev = term.charAt(0);
                buffer.append(prev);
                for (int i = 1; i < term.length(); i += 1) {
                    final char curr = term.charAt(i);
                    if (curr != prev) {
                        buffer.append(curr);
                        prev = curr;
                    }
                }
            }
            return buffer.toString();
        }
    
        public static void main(final String... documents) {
            final Map<String, Set<String>> termVariations = new TreeMap<>();
    
            for (final String document : documents) {
                final Scanner scanner = new Scanner(document.toLowerCase());
                while (scanner.hasNext()) {
                    final String expandedTerm = scanner.next();
                    final String collapsedTerm = collapse(expandedTerm);
                    Set<String> variations = termVariations.get(collapsedTerm);
                    if (null == variations) {
                        variations = new TreeSet<String>();
                        termVariations.put(collapsedTerm, variations);
                    }
                    variations.add(expandedTerm);
                }
            }
    
            for (final Map.Entry<String, Set<String>> entry : termVariations.entrySet()) {
                final String term = entry.getKey();
                final Set<String> variations = entry.getValue();
                System.out.printf("variations(\"%s\") = {%s}%n",
                    term,
                    variations.stream()
                        .map((variation) -> String.format("\"%s\"", variation))
                        .collect(Collectors.joining(", ")));
            }
        }
    }
    

    示例运行:

    % java CollapseConsecutiveCharsDemo "We had an awwweesssommmeeee dinner at sea resort" "We had an awesomeeee dinner at sea resort" "We had an awwesooomee dinner at sea resort"
    variations("an") = {"an"}
    variations("at") = {"at"}
    variations("awesome") = {"awesomeeee", "awwesooomee", "awwweesssommmeeee"}
    variations("diner") = {"dinner"}
    variations("had") = {"had"}
    variations("resort") = {"resort"}
    variations("sea") = {"sea"}
    variations("we") = {"we"}
    

    要获得更详细的解决方案,您可以使用正确处理标点符号的Stanford CoreNLP tokenizer 标记您的文档,并将其与拼写更正结合起来,例如使用liblevenshtein

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2012-07-10
      • 1970-01-01
      • 2013-07-04
      • 2022-01-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-03-14
      相关资源
      最近更新 更多