【问题标题】:Extract contents of a file using ContentHandler使用 ContentHandler 提取文件的内容
【发布时间】:2015-09-15 05:57:27
【问题描述】:

我正在尝试使用 ContentHandler 提取 txt 文件的内容,下面是我的代码,我的文件内容是

Sample content Sample contentSample contentSample contentSample contentSample contentSample contentSample contentSample contentSample contentSample contentSample contentSample

下面的代码没有显示提取的内容,我在这里遗漏了什么?

class Test { 
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;

public Test() {
    context = new ParseContext();
    detector = new DefaultDetector();
    parser = new AutoDetectParser(detector);
    context.set(Parser.class, parser);
    outputstream = new ByteArrayOutputStream();
    metadata = new Metadata();
}

public void process(String filename) throws Exception {
    URL url;
    File file = new File(filename);
    if (file.isFile()) {
        url = file.toURI().toURL();
    } else {
        url = new URL(filename);
    }
    InputStream input = TikaInputStream.get(url, metadata);
    ContentHandler handler = new BodyContentHandler(outputstream);
    parser.parse(input, handler, metadata, context); 
    input.close();
}

public void getString() {
    //Get the text into a String object
    extractedText = outputstream.toString();
    //Do whatever you want with this String object.
    System.out.println("extracted text "+extractedText);
}

public static void main(String args[]) throws Exception {
    if (args.length == 1) {
        Test textExtractor = new Test();
        textExtractor.process("D:\\docs\\sample.txt");
        textExtractor.getString();
    } else { 
        throw new Exception();
    }
}
}

【问题讨论】:

标签: java string file apache-tika


【解决方案1】:

在 apache tika-core 之外添加 apache tika-parsers 依赖。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-04-23
    • 1970-01-01
    • 2019-02-24
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多