【问题标题】:RDD join, not returning keys, that have an ID starting with a letterRDD 连接,不返回键,ID 以字母开头
【发布时间】:2016-01-07 05:27:38
【问题描述】:

在加入 RDD 时,我看到使用 spark join 的一个很奇怪的问题。 我有两个具有相同密钥的 RDD,一个来自服务器访问日志,显示客户端尝试购买的内容,具有客户端添加的特定订单 ID,其结构如下: OrderKey(ClientID,UserID,ClientsOrderID,ItemOrdered,Price), OrderValues(纳秒时间戳,延迟时间戳,分组号)

示例如下:

(OrderKey(CLI1,USR1,BDC11111222,APPLE,0.8031),OrderValues(1431698956999379357,12176,143169895699))
(OrderKey(CLI1,USR1,PRO22222223333,PEAR,0.8031),OrderValues(1431698956999367181,0,143169895699))
(OrderKey(CLI3,USR1,10000956556,ORANGE,4.0555),OrderValues(1431676103249289077,132193,143167610324))
(OrderKey(CLI2,USR2,PRO33335555,ORANGE,0.8031),OrderValues(1431698956999369005,1824,143169895699))
(OrderKey(CLI4,USR1,418,ORANGE,0.8038),OrderValues(1431676103249156884,0,143167610324))
(OrderKey(CLI5,USR1,15D11111999,TOMATO,0.8052),OrderValues(1431651108750149274,0,143165110875))
(OrderKey(CLI6,USR2,21698,TOMATO,0.8052),OrderValues(1431651108749265019,10976,143165110874))

然后,我尝试在实际下订单时将数据加入到我的数据库数据中。这具有相同的 Order Key,但它的值是 DB 详细信息:

DbDetails(dbOrderDateTime,dbOrderNo,quantity,hasBeenDelivered, typeOfDelivery)

(OrderKey(CLI1,USR1,BDC11111222,APPLE,0.8031),DbDetails(15-may-15 14:09:17.002,877490,1,false,AUTOMATIC))
(OrderKey(CLI1,USR1,PRO22222223333,PEAR,0.8031),DbDetails(15-may-15 14:09:17.002,877487,1,false,AUTOMATIC))
(OrderKey(CLI3,USR1,10000956556,ORANGE,4.0555),DbDetails(15-may-15 07:48:23.251,255857,2,false,AUTOMATIC))
(OrderKey(CLI2,USR2,PRO33335555,ORANGE,0.8031),DbDetails(15-may-15 14:09:17.002,877488,1,false,AUTOMATIC))
(OrderKey(CLI4,USR1,418,ORANGE,0.8038),DbDetails(15-may-15 07:48:23.251,822188,1,false,AUTOMATIC))
(OrderKey(CLI5,USR1,15D11111999,TOMATO,0.8052),DbDetails(15-may-15 00:51:48.752,769075,1,false,AUTOMATIC))
(OrderKey(CLI6,USR2,21698,TOMATO,0.8052),DbDetails(15-may-15 00:51:48.752,769070,1,false,AUTOMATIC))

我正在尝试加入如下的 RDD:

val fullOrderDetails = accessRDD.join(dbRDD).map{
  case (orderKey,dbDetails) =>
    FullOrderDetails(
      dbDetails._1.orderDate,
      orderKey.clientName, orderKey.userName,orderKey.market,dbDetails._1.orderID,
      rK.clientOrderID,orderKey.price,  dbDetails._1.orderQty,
      dbDetails._1.entryType,      dbDetails._1.versionReason,dbDetails._1.userType,
      dbDetails._2.accessTs,dbDetails._2.krakenTsDelta, dbDetails._2.groupingNumber 
    )
}

知道吗,为什么当我输出结果 RDD 时返回的唯一结果是以数字开头的?

谢谢!

【问题讨论】:

  • 以数字开头是什么意思?什么是OrderKeyDbDetailsFullOrderDetails?请提供MCVE

标签: scala join apache-spark rdd


【解决方案1】:

假设您的第一组示例是accessRDD,而您的第二组示例是dbRDD,那么您的大小写在随后的map 上似乎是错误的。应该是:

...
case (orderKey: OrderKey, (orderValues: OrderValues, dbDetails: DbDetails)) =>
...

这是因为join 生成了一个pair RDD,其值是加入派系值的Tuple2

【讨论】:

  • Rohan,感谢 scala 风格的建议。事实证明这不是问题,而是用户名的格式。它已得到修复,现在可以使用了。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2014-11-03
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-01-22
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多