【问题标题】:Pyspark Spliting List inside a list and tuple列表和元组中的 Pyspark 拆分列表
【发布时间】:2017-08-15 22:14:49
【问题描述】:

我有以下

[('HOMICIDE', [('2017', 1)]), 
 ('DECEPTIVE PRACTICE', [('2017', 14), ('2016', 14), ('2015', 10), ('2013', 4), ('2014', 3)]), 
 ('ROBBERY', [('2017', 1)])]

如何转换成

[('HOMICIDE', ('2017', 1)), 
 ('DECEPTIVE PRACTICE', ('2015', 10)), 
 ('DECEPTIVE PRACTICE', ('2014', 3)), 
 ('DECEPTIVE PRACTICE', ('2017', 14)), 
 ('DECEPTIVE PRACTICE', ('2016', 14))]

当我尝试使用地图时,它的抛出为 " AttributeError: 'list' object has no attribute 'map' "

rdd = sc.parallelize([('HOMICIDE', [('2017', 1)]), ('DECEPTIVE PRACTICE', [('2017', 14), ('2016', 14), ('2015', 10), ('2013', 4), ('2014', 3)])])
y = rdd.map(lambda x : (x[0],tuple(x[1])))

【问题讨论】:

    标签: python apache-spark pyspark


    【解决方案1】:

    maprdd上的一个方法,而不是python列表,所以你需要先并行化列表,然后你可以使用flatMap来展平内部列表:

    rdd = sc.parallelize([('HOMICIDE', [('2017', 1)]), 
                          ('DECEPTIVE PRACTICE', [('2017', 14), ('2016', 14), ('2015', 10), ('2013', 4), ('2014', 3)]), 
                          ('ROBBERY', [('2017', 1)])])
    
    rdd.flatMap(lambda x: [(x[0], y) for y in x[1]]).collect()
    
    # [('HOMICIDE', ('2017', 1)), 
    #  ('DECEPTIVE PRACTICE', ('2017', 14)), 
    #  ('DECEPTIVE PRACTICE', ('2016', 14)), 
    #  ('DECEPTIVE PRACTICE', ('2015', 10)), 
    #  ('DECEPTIVE PRACTICE', ('2013', 4)), 
    #  ('DECEPTIVE PRACTICE', ('2014', 3)), 
    #  ('ROBBERY', ('2017', 1))]
    

    【讨论】:

      【解决方案2】:

      改为列表推导怎么样?

      y = [(x[0], i) for x in rdd for i in x[1]]
      

      返回

      [('HOMICIDE', ('2017', 1)), ('DECEPTIVE PRACTICE', ('2017', 14)), ('DECEPTIVE PRACTICE', ('2016', 14)), ('DECEPTIVE PRACTICE', ('2015', 10)), ('DECEPTIVE PRACTICE', ('2013', 4)), ('DECEPTIVE PRACTICE', ('2014', 3))]
      

      【讨论】:

      • 它在 python 中运行良好,当我使用 pyspark 时,我必须将数据移动到磁盘......我认为在我的问题中没有提到 sc.parallelize 是我的坏事。谢谢@ason​​gtoruin
      • @SachinSukumaran 我的错!无论如何,另一个答案似乎已经涵盖了。
      猜你喜欢
      • 2014-04-27
      • 2020-05-12
      • 1970-01-01
      • 2019-10-20
      • 1970-01-01
      • 2015-08-02
      • 1970-01-01
      • 2020-05-08
      • 1970-01-01
      相关资源
      最近更新 更多