【问题标题】:Spark select query fails on a large dataset in hive table在 Hive 表中的大型数据集上 Spark 选择查询失败
【发布时间】:2018-04-19 12:57:22
【问题描述】:

我下面的代码是使用 spark 从蜂巢表中读取数据。 该表中有 1 亿条记录。当我在我的 Rdd 中选择这么多记录并尝试执行 result.show() 时,它会给出严重的问题异常。

我基本上想通过从该表中选择几列来插入其他表中的记录,以获得 1 亿条记录集。

这是我的代码:

import org.apache.spark.sql.functions._
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

val result=sqlContext.sql("Select * from ******reception.recp_customer")

result: org.apache.spark.sql.DataFrame = [data_source_id: smallint, customer_bkey: string ... 129 more fields]

result.show()

java.lang.RuntimeException: serious problem
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1064)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1091)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2773)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2803)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
  ... 52 elided
Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0000312_0000"
  at java.util.concurrent.FutureTask.report(FutureTask.java:122)
  at java.util.concurrent.FutureTask.get(FutureTask.java:188)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1041)
  ... 94 more
Caused by: java.lang.NumberFormatException: For input string: "0000312_0000"
  at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
  at java.lang.Long.parseLong(Long.java:441)
  at java.lang.Long.parseLong(Long.java:483)
  at org.apache.hadoop.hive.ql.io.AcidUtils.parseDelta(AcidUtils.java:323)
  at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:394)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.callInternal(OrcInputFormat.java:658)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:648)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:645)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:421)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:645)
  at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:626)
  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
  at java.lang.Thread.run(Thread.java:748)

不知道是什么原因造成的。我知道如何处理它的数据集很大。

【问题讨论】:

  • 明确说明NumberFormatException
  • 嗨,如果我在 hive shell 中触发相同的查询,它工作正常。所以,我还使用 hive 使用此表中的记录在其他一些表上触发了插入查询,尽管我必须为歪斜的数据。在我尝试的火花中,它给出了错误。假设它是 1 亿数据集的问题。

标签: apache scala hadoop apache-spark-sql spark-dataframe


【解决方案1】:

看起来您的hive 表是ACID 表。对于acid 表,您只能使用hive 来查询,您不能使用spark 来查询它们,因为spark 尚不支持此功能。

您可以关注以下JIRA票务以供参考 https://issues.apache.org/jira/browse/SPARK-15348

【讨论】:

    【解决方案2】:

    在您的 hive 表上运行主要压缩,然后尝试从 spark shell 或您的代码中读取您的表,它必须工作。

    连接hive后运行

    ALTER TABLE .recp_customer COMPACT 'major';

    【讨论】:

      【解决方案3】:

      行: 引起:java.lang.NumberFormatException:对于输入字符串:“0000312_0000” 表明您正在尝试使用字符串值,因为它是数字。看看吧。

      【讨论】:

      • 嗨,如果我在 hive shell 中触发相同的查询,它工作正常。所以,我还使用 hive 使用此表中的记录在其他一些表上触发了插入查询,尽管我必须为歪斜的数据。在我尝试的火花中,它给出了错误。假设它是 1 亿数据集的问题。
      猜你喜欢
      • 1970-01-01
      • 2017-02-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-11-19
      • 2020-04-28
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多