【发布时间】:2018-04-16 11:27:50
【问题描述】:
我正在使用 sbt 程序集在 Eclipse 中构建一个项目。我有一个非常大且复杂的 build.sbt 文件,因为我有很多冲突。
使用 tika 1.16 中的 PDF、OOXML 和 OpenDocument 解析器对 pdf、pptx、odt 和 docx 文件一切正常。但是,当我尝试使用 TXTParser 解析 txt 文件(UTF-8 编码)时,出现以下错误:
org.apache.tika.exception.TikaException: Failed to detect the character encoding of a document
at org.apache.tika.detect.AutoDetectReader.detect(AutoDetectReader.java:77)
at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:108)
at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:114)
at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:79)`
从我的 Scala 代码中的这一行开始:
val content = theParser.parse(stream.open(), chandler, meta, pContext)
其中stream是一个PortableDataStream,chandler是一个新的BodyContentHandler,meta是一个新的元数据,pContext是一个新的ParseContext。
如果我改用 AutoDetectParser,则会收到以下错误:
org.apache.jena.shared.SyntaxError: unknown
at org.apache.jena.rdf.model.impl.NTripleReader.read(NTripleReader.java:73)
at org.apache.jena.rdf.model.impl.NTripleReader.read(NTripleReader.java:58)
at org.apache.jena.rdf.model.impl.ModelCom.read(ModelCom.java:305)
从我的 Scala 代码中的这一行开始:
val response = model.read(stream, null, "N-TRIPLES")
其中流是 InputStream。
我认为这是由于 Tika 的空响应(所以同样的问题)。
我很确定这可能是我过于复杂的 build.sbt 文件中的一个依赖问题,但经过数小时的尝试,我肯定需要帮助。
一个积极的方面是,如果没有输入 txt 文件,一切都会完美运行,所以这可能是我的最后一个问题!
最后,这是我使用 sbt clean assembly 构建的 build.sbt 文件:
scalaVersion := "2.11.8"
version := "1.0.0"
name := "crawldocs"
conflictManager := ConflictManager.strict
mainClass in assembly := Some("com.addlesee.crawling.CrawlHiccup")
libraryDependencies ++= Seq(
"org.apache.tika" % "tika-core" % "1.16",
"org.apache.tika" % "tika-parsers" % "1.16" excludeAll(
ExclusionRule(organization = "*", name = "guava")
),
"com.blazegraph" % "bigdata-core" % "2.0.0" excludeAll(
ExclusionRule(organization = "*", name = "collection-0.7"),
ExclusionRule(organization = "*", name = "commons-cli"),
ExclusionRule(organization = "*", name = "commons-codec"),
ExclusionRule(organization = "*", name = "commons-csv"),
ExclusionRule(organization = "*", name = "commons-io"),
ExclusionRule(organization = "*", name = "commons-lang3"),
ExclusionRule(organization = "*", name = "commons-logging"),
ExclusionRule(organization = "*", name = "httpclient"),
ExclusionRule(organization = "*", name = "httpclient-cache"),
ExclusionRule(organization = "*", name = "httpcore"),
ExclusionRule(organization = "*", name = "httpmime"),
ExclusionRule(organization = "*", name = "jackson-annotations"),
ExclusionRule(organization = "*", name = "jackson-core"),
ExclusionRule(organization = "*", name = "jackson-databind"),
ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
ExclusionRule(organization = "*", name = "jena-cmds"),
ExclusionRule(organization = "*", name = "jena-rdfconnection"),
ExclusionRule(organization = "*", name = "jena-tdb"),
ExclusionRule(organization = "*", name = "jsonld-java"),
ExclusionRule(organization = "*", name = "libthrift"),
ExclusionRule(organization = "*", name = "log4j"),
ExclusionRule(organization = "*", name = "slf4j-api"),
ExclusionRule(organization = "*", name = "slf4j-log4j12"),
ExclusionRule(organization = "*", name = "xercesImpl"),
ExclusionRule(organization = "*", name = "xml-apis")
),
"org.scalaj" %% "scalaj-http" % "2.3.0",
"org.apache.jena" % "apache-jena" % "3.4.0" excludeAll(
ExclusionRule(organization = "*", name = "commons-cli"),
ExclusionRule(organization = "*", name = "commons-codec"),
ExclusionRule(organization = "*", name = "commons-csv"),
ExclusionRule(organization = "*", name = "commons-lang3"),
ExclusionRule(organization = "*", name = "httpclient"),
ExclusionRule(organization = "*", name = "httpclient-cache"),
ExclusionRule(organization = "*", name = "httpcore"),
ExclusionRule(organization = "*", name = "jackson-core"),
ExclusionRule(organization = "*", name = "jackson-databind"),
ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
ExclusionRule(organization = "*", name = "jena-rdfconnection"),
ExclusionRule(organization = "*", name = "slf4j-api")
),
"org.apache.jena" % "apache-jena-libs" % "3.4.0" excludeAll(
ExclusionRule(organization = "*", name = "commons-cli"),
ExclusionRule(organization = "*", name = "commons-codec"),
ExclusionRule(organization = "*", name = "commons-csv"),
ExclusionRule(organization = "*", name = "commons-lang3"),
ExclusionRule(organization = "*", name = "httpclient"),
ExclusionRule(organization = "*", name = "httpclient-cache"),
ExclusionRule(organization = "*", name = "httpcore"),
ExclusionRule(organization = "*", name = "jackson-core"),
ExclusionRule(organization = "*", name = "jackson-databind"),
ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
ExclusionRule(organization = "*", name = "jena-rdfconnection"),
ExclusionRule(organization = "*", name = "slf4j-api")
),
"org.noggit" % "noggit" % "0.6",
"com.typesafe.scala-logging" %% "scala-logging" % "3.7.2" excludeAll(
ExclusionRule(organization = "*", name = "slf4j-api")
),
"org.apache.spark" % "spark-core_2.11" % "2.2.0" excludeAll(
ExclusionRule(organization = "*", name = "breeze_2.11"),
ExclusionRule(organization = "*", name = "hadoop-hdfs"),
ExclusionRule(organization = "*", name = "hadoop-annotations"),
ExclusionRule(organization = "*", name = "hadoop-common"),
ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-app"),
ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-common"),
ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-core"),
ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-jobclient"),
ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-shuffle"),
ExclusionRule(organization = "*", name = "hadoop-yarn-api"),
ExclusionRule(organization = "*", name = "hadoop-yarn-client"),
ExclusionRule(organization = "*", name = "hadoop-yarn-common"),
ExclusionRule(organization = "*", name = "hadoop-yarn-server-common"),
ExclusionRule(organization = "*", name = "hadoop-yarn-server-web-proxy"),
ExclusionRule(organization = "*", name = "activation"),
ExclusionRule(organization = "*", name = "hive-exec"),
ExclusionRule(organization = "*", name = "scala-compiler"),
ExclusionRule(organization = "*", name = "spire_2.11"),
ExclusionRule(organization = "*", name = "commons-compress"),
ExclusionRule(organization = "*", name = "slf4j-api"),
ExclusionRule(organization = "*", name = "guava"),
ExclusionRule(organization = "*", name = "commons-codec"),
ExclusionRule(organization = "*", name = "commons-io"),
ExclusionRule(organization = "*", name = "gson"),
ExclusionRule(organization = "*", name = "httpclient"),
ExclusionRule(organization = "*", name = "zookeeper"),
ExclusionRule(organization = "*", name = "jettison"),
ExclusionRule(organization = "*", name = "jackson-core"),
ExclusionRule(organization = "*", name = "httpcore"),
ExclusionRule(organization = "*", name = "bcprov-jdk15on"),
ExclusionRule(organization = "*", name = "jul-to-slf4j"),
ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
ExclusionRule(organization = "*", name = "commons-cli"),
ExclusionRule(organization = "*", name = "slf4j-log4j12"),
ExclusionRule(organization = "*", name = "curator-framework")
),
"org.scala-lang" % "scala-xml" % "2.11.0-M4",
"org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.7.3" excludeAll(
ExclusionRule(organization = "*", name = "commons-codec"),
ExclusionRule(organization = "*", name = "commons-cli"),
ExclusionRule(organization = "*", name = "slf4j-api"),
ExclusionRule(organization = "*", name = "commons-io"),
ExclusionRule(organization = "*", name = "jettison"),
ExclusionRule(organization = "*", name = "avro"),
ExclusionRule(organization = "*", name = "commons-compress"),
ExclusionRule(organization = "*", name = "slf4j-log4j12"),
ExclusionRule(organization = "*", name = "netty")
),
"org.apache.hadoop" % "hadoop-common" % "2.7.3" excludeAll(
ExclusionRule(organization = "*", name = "commons-codec"),
ExclusionRule(organization = "*", name = "commons-cli"),
ExclusionRule(organization = "*", name = "slf4j-api"),
ExclusionRule(organization = "*", name = "commons-math3"),
ExclusionRule(organization = "*", name = "commons-io"),
ExclusionRule(organization = "*", name = "jets3t"),
ExclusionRule(organization = "*", name = "gson"),
ExclusionRule(organization = "*", name = "avro"),
ExclusionRule(organization = "*", name = "httpclient"),
ExclusionRule(organization = "*", name = "zookeeper"),
ExclusionRule(organization = "*", name = "commons-compress"),
ExclusionRule(organization = "*", name = "slf4j-log4j12"),
ExclusionRule(organization = "*", name = "commons-net"),
ExclusionRule(organization = "*", name = "curator-recipes"),
ExclusionRule(organization = "*", name = "jsr305")
)
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
【问题讨论】:
-
如果您获取 Tika App 独立 jar 并尝试使用它,它是否能够处理您的文件?这会告诉你这是一个 Tika 错误,还是你如何在项目中包含 Tika 的问题
-
在我的 Eclipse 项目中,我的构建路径中有 tika-app 和 tika-parsers 独立 jar。那工作得很好......我试过在我的 build.sbt 中只包含 tika-app 依赖项,但我不记得这个问题了。我将再次测试并在星期二将其发布在这里(我需要在家中没有的服务来测试构建的 jar)。感谢您的回复!
-
MergeStrategy.concat 看起来很适合“META-INF/service/*(我不使用 sbt)。
-
好的,我已经修改了我的 build.sbt 以将 tika-app 作为依赖项,而不是核心和解析器。它的运行方式完全相同,令人讨厌......
标签: scala sbt jena apache-tika sbt-assembly