使用 java 应用提取信息答案

【问题标题】：apply extraction information with java使用 java 应用提取信息
【发布时间】：2015-12-04 16:16:13
【问题描述】：

我试图在文本（文本文件）上应用字典（单词文件）：

我们测试该单词是否存在于文本的一行中，如果是，我们将打印它（该行）。我们为每一行文本测试字典中的所有单词。

我使用了 EXPREG 模式+匹配器，但问题是时间。手术耗时5H。

2 档有 3330ko 和 55ko . 我的问题是是否有另一种方法可以像 UNITEX 但在 java 中执行此操作

public class Tratemant_Dic extends Thread {

    Tratemant_Dic() {

    }

    public void run() {
        try {

            BufferedReader file_corpus = new BufferedReader(
                    new InputStreamReader(new FileInputStream(
                            "corpus-medical.TXT"), "UTF-16LE"));

            PrintWriter ecrire = new PrintWriter("sort.html");
            String line;
            String nom = null;

            ecrire.write("<mot><span style=\"color:red\">startsss</span></mot></br><ligne>start\n");
            while ((line = file_corpus.readLine()) != null) {

                BufferedReader file_nom = new BufferedReader(
                        new InputStreamReader(new FileInputStream(
                                "Fichie_sorte.DIC"), "UTF-16LE"));
                while ((nom = file_nom.readLine()) != null) {
                    nom = nom.substring(0, nom.length() - 3);
                    Pattern p = Pattern.compile("(.*)\\W+" + nom + "\\b.*",
                            Pattern.CASE_INSENSITIVE);
                    Matcher m = p.matcher(line);

                    if (m.find()) {

                        System.out.println(nom + "==>" + line);
                        ecrire.write("<mot><span style=\"color:red\">" + nom
                                + "</span></mot></br><ligne>" + line + "\n");

                    }

                }

                file_nom.close();

            }
            ecrire.close();
            System.out.println("FIN");
        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

}

【问题讨论】：

输入和字典的大小是多少？
只看代码，对于您从corpus-medical.TXT 读取的每一行 1. 为什么要为每一行读取Fichie_sorte.DIC？只需阅读一次，这应该已经节省了您的时间。 2. 由于您的Fichie_sorte.DIC 没有改变，这意味着您正在编译的所有正则表达式不需要为corpus-medical.TXT 的每一行编译。这也应该减少时间和内存。
检查 apache commons IO baeldung.com/java-read-lines-large-file 并优化您的正则表达式策略
查看此 SO 帖子 stackoverflow.com/questions/33645806/…

标签： java

【解决方案1】：

如果我理解您要正确执行的操作，我将不会使用正则表达式来执行此操作。它们很慢，您不需要它们。

这确实是一个字符串匹配问题。您的字典可能应该存储在哈希表中，使用 hashCode() 方法获取字符串的键。然后，您在字典中搜索文本中的每个单词（在阅读时计算适当的哈希码）。正确完成应该尽可能快。

请记住，散列码不保证是唯一的，因此即使在表中找到散列码，也要始终确保实际字符串匹配。

【讨论】：

很好，我没问题。我的问题是如何在没有 String.indexof 或 string.equal 的情况下测试一个单词是否在 String 行中退出。我想使用 Unitex 方法或 MITIE github.com/mit-nlp/MITIE
我想在我的应用程序中使用信息提取工具
您在中描述的是单词搜索，但如果我理解您的话，您现在所要求的也涉及语法匹配。这是一项非常复杂的任务。我误解你了吗？
Non .i 如果我的字典中的单词存在于文本行中，则尝试提取信息
我认为您要解释的问题是您认为需要将输入中的子字符串与字典字符串进行比较。您确实应该以一种检测单词开头和结尾的方式扫描文本，然后每次阅读一个完整的单词时，您应该只在字符串中找到该单词以在表中查找。

【解决方案2】：

我会首先尝试对您的应用程序执行的每项“事情”进行计时，而不是针对最慢的项目（正如 Jay 的评论中所提到的那样，您遇到的一个问题是您每次都在加载字典时间）而不是基于对错误的猜测（正则表达式很慢）来改进。

您可以使用System.nanoTime() 或众多秒表之一来执行此操作。我通常使用guava。

【讨论】：

单词字典的文件有1632行我无法保存它memore
每一行一个字？这一点也不多。
阅读其他 cmets 听起来你想要 NLP。也许值得将所有内容索引到 Lucene 之类的东西中并使用它？
为什么不能存储字典？假设每个单词 10 个字符，我的存储空间不到一兆。
字典文件有55ko

【解决方案3】：

为什么你不使用而不是

 Pattern p = Pattern.compile("(.*)\\W+" + nom + "\\b.*",
                Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(line);

        if (m.find()) {
         ...

只是

if(line.indexOf(nom) > -1) {
     ...

更新：如果你需要单词边界的东西使用：

String lineToLowerCase = line.toLowerCase();  // before second while
...
    int index = lineToLowerCase.indexOf(nom.toLowerCase());
    if(index > -1) {
        if(index ==0 || Character.isWhitespace(lineToLowerCase.charAt(index-1))) {
            int indexEnd = index + nom.length();
            if (indexEnd >= lineToLowerCase.length() || !Character.isAlphabetic(lineToLowerCase.charAt(indexEnd))) {
       ...

用于测试

public static void main(String[] s) {
    check("skdc s dcd dsf", "dcd"); // print true
    check("skdc sdcd dsf", "dcd"); // print false
    check("dcd dsf", "dcd"); // print true
    check("afasa dcd", "dcd"); // print true
    check("afasa dCD11", "dcD"); // print true
    check("skdc s dcda dsf", "dcd"); // print false
}

public static void check(String line, String nom) {
    String lineToLowerCase = line.toLowerCase();
    int index = lineToLowerCase.indexOf(nom.toLowerCase());
    if(index > -1) {
        if(index ==0 || Character.isWhitespace(lineToLowerCase.charAt(index-1))) {
            int indexEnd = index + nom.length();
            if (indexEnd >= lineToLowerCase.length() || !Character.isAlphabetic(lineToLowerCase.charAt(indexEnd))) {
                System.out.println("true");
                return;
            }
        }
    }
    System.out.println("false");
}

【讨论】：

正则表达式在你现在做的地方做单词边界的东西。
你的任务中的单词边界是什么？
\W+ 是另外一个none word char，\b 是word boundary。
我不确定它的工作方式与正则表达式完全相同，但我必须做一些检查才能确定。但它可能已经足够好了。
我没有使用 Indexof() 因为它返回子词的索引