【问题标题】:Concat in SparkSQLSparkSQL 中的连接
【发布时间】:2016-08-26 14:32:32
【问题描述】:

我有一个 Phoenix 架构表(电子邮件、产品名称、购买日期、数量)。 我将这张表加载到 Spark Dataframe 中进行处理。

Email        |   ProductName   | PurchaseDate             | Quantity
a@test.com       Dell             2016-03-31 14:30:00.0         5
b@test.com       Lenovo           2016-03-31 14:30:00.0         2
a@test.com       Intel            2016-04-21 14:30:00.0         14
c@test.com       Lenovo           2016-06-31 14:30:00.0         3
a@test.com       Nokia            2016-03-21 14:30:00.0         5

输入将是一个 emailid 和一个截止日期,输出应该是以下格式: 例如:输入 'a@test.com' 和 Date 小于 '2016-04-01 00:00:00.0

电子邮件 |产品列表 |总数(量 a@test.com 诺基亚 --> 戴尔 --> 英特尔 19

这基本上是列和数量之和的连接。

我无法获得上述格式的输出。

有什么帮助吗? 提前致谢!

更新: 正如@Daniel de Paula 所建议的那样,由于collect_list 属于HiveContext,我已更改为val sqlContext = new HiveContext(sc)。我收到了这个错误,
2611 [main] INFO DataNucleus.Persistence - 属性 datanucleus.cache.level2 未知 - 将被忽略

2611 [main] INFO  DataNucleus.Persistence  - Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
4805 [main] INFO  DataNucleus.Datastore  - The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
4806 [main] INFO  DataNucleus.Datastore  - The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
5859 [main] INFO  DataNucleus.Datastore  - The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
5859 [main] INFO  DataNucleus.Datastore  - The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
6671 [main] INFO  DataNucleus.Datastore  - The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
7785 [main] INFO  DataNucleus.Persistence  - Property datanucleus.cache.level2 unknown - will be ignored
7785 [main] INFO  DataNucleus.Persistence  - Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
10488 [main] INFO  DataNucleus.Datastore  - The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
10489 [main] INFO  DataNucleus.Datastore  - The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
10778 [main] INFO  DataNucleus.Datastore  - The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
10778 [main] INFO  DataNucleus.Datastore  - The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
11008 [main] INFO  DataNucleus.Query  - Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
11128 [main] ERROR org.apache.hadoop.hive.metastore.ObjectStore  - Version information found in metastore differs 2.1.0 from expected schema version 1.2.0. Schema verififcation is disabled hive.metastore.schema.verification so setting version.
11271 [main] WARN  hive.ql.metadata.Hive  - Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
    at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
    at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
    at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)
    at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:327)
    at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
    at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
    at org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:226)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:229)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
    at SimpleApp$.<init>(SimpleApp.scala:50)
    at SimpleApp$.<clinit>(SimpleApp.scala)
    at SimpleApp.main(SimpleApp.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
    at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
    at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
    at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
    ... 27 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
    ... 33 more
Caused by: java.lang.NoSuchMethodError: org.apache.thrift.EncodingUtils.setBit(BIZ)B
    at org.apache.hadoop.hive.metastore.api.PrivilegeGrantInfo.setCreateTimeIsSet(PrivilegeGrantInfo.java:245)
    at org.apache.hadoop.hive.metastore.api.PrivilegeGrantInfo.<init>(PrivilegeGrantInfo.java:163)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultRoles_core(HiveMetaStore.java:675)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultRoles(HiveMetaStore.java:645)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:462)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
    at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
    at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
    ... 38 more
11288 [main] INFO  DataNucleus.Query  - Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
Exception in thread "main" java.lang.ExceptionInInitializerError
    at SimpleApp.main(SimpleApp.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)
    at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:327)
    at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
    at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
    at org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:226)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:229)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
    at SimpleApp$.<init>(SimpleApp.scala:50)
    at SimpleApp$.<clinit>(SimpleApp.scala)
    ... 10 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
    at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194)
    ... 23 more
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
    at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
    at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
    ... 24 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
    ... 30 more
Caused by: java.lang.NoSuchMethodError: org.apache.thrift.EncodingUtils.setBit(BIZ)B
    at org.apache.hadoop.hive.metastore.api.PrivilegeGrantInfo.setCreateTimeIsSet(PrivilegeGrantInfo.java:245)
    at org.apache.hadoop.hive.metastore.api.PrivilegeGrantInfo.<init>(PrivilegeGrantInfo.java:163)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultRoles_core(HiveMetaStore.java:675)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultRoles(HiveMetaStore.java:645)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:462)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
    at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
    at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
    ... 35 more

【问题讨论】:

    标签: apache-spark apache-spark-sql phoenix


    【解决方案1】:

    我的建议是进行分组并将产品名称收集为列表;然后,您使用 UserDefinedFunction 连接列表的值:

    import org.apache.spark.sql.functions._
    
    val newDF = df.groupBy("Email").agg(
      collect_list("ProductName").as("ProductList"),
      sum("Quantity").as("TotalQuantity")
    )
    
    val concatList = (xs: Seq[String]) => xs.foldLeft("")({
      case (s1, s2) => if (s1 == "") s2 else s1 + " --> " + s2
    })
    
    val myUDF = udf(concatList)
    
    val result = newDF.withColumn("ProductList", myUDF(col("ProductList")))
    

    【讨论】:

    • 你能检查更新的问题吗?
    猜你喜欢
    • 2015-05-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-01-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多