【发布时间】:2015-09-15 05:57:27
【问题描述】:
我正在尝试使用 ContentHandler 提取 txt 文件的内容,下面是我的代码,我的文件内容是
Sample content Sample contentSample contentSample contentSample contentSample contentSample contentSample contentSample contentSample contentSample contentSample contentSample
下面的代码没有显示提取的内容,我在这里遗漏了什么?
class Test {
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;
public Test() {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
} else {
url = new URL(filename);
}
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
public void getString() {
//Get the text into a String object
extractedText = outputstream.toString();
//Do whatever you want with this String object.
System.out.println("extracted text "+extractedText);
}
public static void main(String args[]) throws Exception {
if (args.length == 1) {
Test textExtractor = new Test();
textExtractor.process("D:\\docs\\sample.txt");
textExtractor.getString();
} else {
throw new Exception();
}
}
}
【问题讨论】:
-
除了 tika-core 之外,您是否添加了 tika-parsers 依赖?如果没有添加所需的依赖,请重试。
标签: java string file apache-tika