【问题标题】:how to Read data from a text file in java to extract data using StanfordNLP rather than reading text from a simple String如何从 java 中的文本文件中读取数据以使用 StanfordNLP 提取数据,而不是从简单的字符串中读取文本
【发布时间】:2022-03-04 00:28:48
【问题描述】:

我尝试使用 注释文档 = new Annotation("这是一个简单的字符串"); 也试过 CoreDocument coreDocument = new CoreDocument(文本); stanfordCoreNLP.annotate(coreDocument); 但无法解决它从文本文件中读取

【问题讨论】:

    标签: java nlp extract


    【解决方案1】:

    如下使用(见here给出的例子):

    // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution 
    Properties props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
    // read some text from the file..
    File inputFile = new File("src/test/resources/sample-content.txt");
    String text = Files.asCharSource(inputFile, Charset.forName("UTF-8")).read();
    
    // create an empty Annotation just with the given text
    Annotation document = new Annotation(text);
    
    // run all Annotators on this text
    pipeline.annotate(document);
    
    // these are all the sentences in this document
    // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    
    for(CoreMap sentence: sentences) {
      // traversing the words in the current sentence
      // a CoreLabel is a CoreMap with additional token-specific methods
      for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
        // this is the text of the token
        String word = token.get(TextAnnotation.class);
        // this is the POS tag of the token
        String pos = token.get(PartOfSpeechAnnotation.class);
        // this is the NER label of the token
        String ne = token.get(NamedEntityTagAnnotation.class);
        
        System.out.println("word: " + word + " pos: " + pos + " ne:" + ne);
      }
    

    更新

    或者,为了读取文件内容,您可以使用下面使用 Java 内置包的内容;因此,不需要外部包。根据文本文件中的字符,您可以选择适当的Charset。如here 所述,“ISO-8859-1 是一个包罗万象的字符集,从某种意义上说,它保证不会抛出MalformedInputException”。下面使用Charset

    import java.io.IOException;
    import java.nio.charset.StandardCharsets;
    import java.nio.file.Files;
    import java.nio.file.Path;
    import java.nio.file.Paths;
    
    ...
            Path path = Paths.get("sample-content.txt");
            String text = "";
            try {
                text = Files.readString(path, StandardCharsets.ISO_8859_1); //StandardCharsets.UTF_8
            } catch (IOException e) {
                e.printStackTrace();
            }
    

    【讨论】:

    • 非常感谢您的回答,但 java:找不到符号符号:方法 asCharSource(java.io.File,java.nio.charset.Charset) 位置:类 java.nio.file.Files我收到此错误没有找到方法调用 Files.asCharSource(dir, StandardCharsets.UTF_8) 的候选者。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-06-10
    • 2015-05-09
    • 1970-01-01
    相关资源
    最近更新 更多