【问题标题】:Apache Tika 1.16 TXTParser Failed to detect character encoding in sbt buildApache Tika 1.16 TXTParser 无法检测 sbt 构建中的字符编码
【发布时间】:2018-04-16 11:27:50
【问题描述】:

我正在使用 sbt 程序集在 Eclipse 中构建一个项目。我有一个非常大且复杂的 build.sbt 文件,因为我有很多冲突。

使用 tika 1.16 中的 PDF、OOXML 和 OpenDocument 解析器对 pdf、pptx、odt 和 docx 文件一切正常。但是,当我尝试使用 TXTParser 解析 txt 文件(UTF-8 编码)时,出现以下错误:

org.apache.tika.exception.TikaException: Failed to detect the character encoding of a document
    at org.apache.tika.detect.AutoDetectReader.detect(AutoDetectReader.java:77)
    at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:108)
    at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:114)
    at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:79)`

从我的 Scala 代码中的这一行开始:

val content = theParser.parse(stream.open(), chandler, meta, pContext)

其中stream是一个PortableDataStream,chandler是一个新的BodyContentHandler,meta是一个新的元数据,pContext是一个新的ParseContext。

如果我改用 AutoDetectParser,则会收到以下错误:

org.apache.jena.shared.SyntaxError: unknown
    at org.apache.jena.rdf.model.impl.NTripleReader.read(NTripleReader.java:73)
    at org.apache.jena.rdf.model.impl.NTripleReader.read(NTripleReader.java:58)
    at org.apache.jena.rdf.model.impl.ModelCom.read(ModelCom.java:305)

从我的 Scala 代码中的这一行开始:

val response = model.read(stream, null, "N-TRIPLES")

其中流是 InputStream。

我认为这是由于 Tika 的空响应(所以同样的问题)。

我很确定这可能是我过于复杂的 build.sbt 文件中的一个依赖问题,但经过数小时的尝试,我肯定需要帮助。

一个积极的方面是,如果没有输入 txt 文件,一切都会完美运行,所以这可能是我的最后一个问题!

最后,这是我使用 sbt clean assembly 构建的 build.sbt 文件:

scalaVersion := "2.11.8"
version      := "1.0.0"
name := "crawldocs"
conflictManager := ConflictManager.strict
mainClass in assembly := Some("com.addlesee.crawling.CrawlHiccup")
libraryDependencies ++= Seq(
  "org.apache.tika" % "tika-core" % "1.16",
  "org.apache.tika" % "tika-parsers" % "1.16" excludeAll(
    ExclusionRule(organization = "*", name = "guava")
  ),
    "com.blazegraph" % "bigdata-core" % "2.0.0" excludeAll(
    ExclusionRule(organization = "*", name = "collection-0.7"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-csv"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "commons-lang3"),
    ExclusionRule(organization = "*", name = "commons-logging"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "httpclient-cache"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "httpmime"),
    ExclusionRule(organization = "*", name = "jackson-annotations"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "jackson-databind"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "jena-cmds"),
    ExclusionRule(organization = "*", name = "jena-rdfconnection"),
    ExclusionRule(organization = "*", name = "jena-tdb"),
    ExclusionRule(organization = "*", name = "jsonld-java"),
    ExclusionRule(organization = "*", name = "libthrift"),
    ExclusionRule(organization = "*", name = "log4j"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "xercesImpl"),
    ExclusionRule(organization = "*", name = "xml-apis")
  ),
    "org.scalaj" %% "scalaj-http" % "2.3.0",
  "org.apache.jena" % "apache-jena" % "3.4.0" excludeAll(
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-csv"),
    ExclusionRule(organization = "*", name = "commons-lang3"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "httpclient-cache"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "jackson-databind"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "jena-rdfconnection"),
    ExclusionRule(organization = "*", name = "slf4j-api")
  ),
    "org.apache.jena" % "apache-jena-libs" % "3.4.0" excludeAll(
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-csv"),
    ExclusionRule(organization = "*", name = "commons-lang3"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "httpclient-cache"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "jackson-databind"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "jena-rdfconnection"),
    ExclusionRule(organization = "*", name = "slf4j-api")
  ),
    "org.noggit" % "noggit" % "0.6",
    "com.typesafe.scala-logging" %% "scala-logging" % "3.7.2" excludeAll(
    ExclusionRule(organization = "*", name = "slf4j-api")
  ),
  "org.apache.spark" % "spark-core_2.11" % "2.2.0" excludeAll(
    ExclusionRule(organization = "*", name = "breeze_2.11"),
    ExclusionRule(organization = "*", name = "hadoop-hdfs"),
    ExclusionRule(organization = "*", name = "hadoop-annotations"),
    ExclusionRule(organization = "*", name = "hadoop-common"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-app"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-common"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-core"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-jobclient"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-shuffle"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-api"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-client"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-common"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-server-common"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-server-web-proxy"),
    ExclusionRule(organization = "*", name = "activation"),
    ExclusionRule(organization = "*", name = "hive-exec"),
    ExclusionRule(organization = "*", name = "scala-compiler"),
    ExclusionRule(organization = "*", name = "spire_2.11"),
    ExclusionRule(organization = "*", name = "commons-compress"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "guava"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "gson"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "zookeeper"),
    ExclusionRule(organization = "*", name = "jettison"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "bcprov-jdk15on"),
    ExclusionRule(organization = "*", name = "jul-to-slf4j"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "curator-framework")
  ),
  "org.scala-lang" % "scala-xml" % "2.11.0-M4",
  "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.7.3" excludeAll(
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "jettison"),
    ExclusionRule(organization = "*", name = "avro"),
    ExclusionRule(organization = "*", name = "commons-compress"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "netty")
  ),
  "org.apache.hadoop" % "hadoop-common" % "2.7.3" excludeAll(
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "commons-math3"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "jets3t"),
    ExclusionRule(organization = "*", name = "gson"),
    ExclusionRule(organization = "*", name = "avro"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "zookeeper"),
    ExclusionRule(organization = "*", name = "commons-compress"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "commons-net"),
    ExclusionRule(organization = "*", name = "curator-recipes"),
    ExclusionRule(organization = "*", name = "jsr305")
  )
)
assemblyMergeStrategy in assembly := {
 case PathList("META-INF", xs @ _*) => MergeStrategy.discard
 case x => MergeStrategy.first
}

【问题讨论】:

  • 如果您获取 Tika App 独立 jar 并尝试使用它,它是否能够处理您的文件?这会告诉你这是一个 Tika 错误,还是你如何在项目中包含 Tika 的问题
  • 在我的 Eclipse 项目中,我的构建路径中有 tika-app 和 tika-parsers 独立 jar。那工作得很好......我试过在我的 build.sbt 中只包含 tika-app 依赖项,但我不记得这个问题了。我将再次测试并在星期二将其发布在这里(我需要在家中没有的服务来测试构建的 jar)。感谢您的回复!
  • MergeStrategy.concat 看起来很适合“META-INF/service/*(我不使用 sbt)。
  • 好的,我已经修改了我的 build.sbt 以将 tika-app 作为依赖项,而不是核心和解析器。它的运行方式完全相同,令人讨厌......

标签: scala sbt jena apache-tika sbt-assembly


【解决方案1】:

上面的代码调用了旧的 N-triples 解析,它仅出于遗留原因而存在。旧版阅读器只有 ASCII。 UTF-8 会破坏它。

apache-jena-libs(即 type=pom)没有被处理,或者您正在重新打包 jar 并且没有处理 Java 的 ServiceLoader 放置文件的 META-INF/服务。 Jena 使用它进行初始化。您必须通过连接同名文件来组合 META_INF/service/* 文件。

详情:https://jena.apache.org/documentation/notes/jena-repack.html

【讨论】:

  • 我没有忽略这一点,我只能在星期二之前测试它,因为构建的 jar 需要我无法在家访问的服务。我确实有几个问题:当 txt 有问题时,其他文件如何与 N-triples 解析器一起工作?我应该尝试将 maven-shade-plugin 添加为依赖项吗?我使用第一个策略在 build.sbt 底部处理 META-INF 合并。我应该修改它吗?感谢您的帮助和链接!
  • MergeStrategy.concat 看起来很适合“META-INF/service/*”(虽然我不使用 sbt)
  • 产生以下错误:Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.ClassFormatError: Extra bytes at the end of class file我会尝试其他合并策略。
  • 这是平台的严重错误。合并必须导致多行META-INF/services/org.apache.jena.system.JenaSubsystemLifecycle
  • 我一直在通过添加像case x if x.contains("txt") =&gt; MergeStrategy.concat 这样的行来搞乱合并策略,但还没有找到要连接的正确类。你知道我需要在哪个类加上引号吗?
【解决方案2】:

终于修好了……

我在 MergeStrategy 中的丢弃行上方添加了case x if x.contains("EncodingDetector") =&gt; MergeStrategy.deduplicate。 build.sbt 底部的以下 assemblyMergeStrategy 解决了我的问题:

assemblyMergeStrategy in assembly := {
 case x if x.contains("EncodingDetector") => MergeStrategy.deduplicate
 case PathList("META-INF", xs @ _*) => MergeStrategy.discard
 case x => MergeStrategy.first
}

【讨论】:

  • 你能帮忙告诉如何在java中解决这个问题吗?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2010-10-20
  • 2016-10-06
  • 1970-01-01
  • 2019-05-31
  • 2015-06-26
相关资源
最近更新 更多