在火花中创建一个以字长为键、以排序字为值的字典？答案

【问题标题】：Create a dictionary with word length as key and sorted words as value in spark?在火花中创建一个以字长为键、以排序字为值的字典？
【发布时间】：2018-07-01 02:49:47
【问题描述】：

我是 spark 新手，我正在尝试按如下方式创建字典：

{4: {'aenr': ['earn', 'rane'], 'aerr': ['rare', 'rear'], 'aenw': ['anew', 'wane', 'wean'], 'derw': ['drew']}

基本上这应该是使用spark的结构

{len(word): {sorted(word):[word1,word2,etc]}

我有一个包含英文单词的大文件，结构如下：

{
  "biennials": 0, 
  "tripolitan": 0, 
  "oblocutor": 0, 
  "leucosyenite": 0, 
  "chilitis": 0, 
  "fabianist": 0, 
  "diazeutic": 0, 
  "alible": 0, 
  "deciet":0
}

所以我想逐行读取文件并创建一个可以保存这个的rdd：

{len(word): {sorted(word):[word1,word2,etc]}

我试过这个：

    r = rdd.map(lambda x: {len(x):sorted(x)})


    items = r.flatMap(lambda line: (line.items()))
    items.take(items.count())
    groupedItems = items.groupByKey().mapValues(list)
    groupedItems.take(groupedItems.count())#j = filter2_rdd


    d = groupedItems.collectAsMap()

但这会打印以下内容：

[
{1: {u'{': [u'{']}},
{9: {u'abeiilnns': [u'  "biennials": 0, ']}}, 
{10: {u'aiilnoprtt': [u'  "tripolitan": 0, ']}}, 
{9: {u'bclooortu': [u'  "oblocutor": 0, ']}}, 
{12: {u'ceeeilnostuy': [u'  "leucosyenite": 0, ']}}, 
{8: {u'chiiilst': [u'  "chilitis": 0, ']}}, 
{9: {u'aabfiinst': [u'  "fabianist": 0, ']}}, 
{9: {u'acdeiituz': [u'  "diazeutic": 0, ']}}, 
{6: {u'abeill': [u'  "alible": 0, ']}}, 
{6: {u'cdeeit': [u'  "deciet":0,']}}, 
{5: {u'doosw': [u'  "woods": 4601, ']}}, 
{14: {u'adeejmnnoprrtu': [u'  "preadjournment": 0, ']}}, 
{7: {u'deiprss': [u'  "spiders": 0, ']}}, 
{9: {u'aabfiimns': [u'  "fabianism": 0, ']}}, 
{11: {u'cdgilnoostu': [u'  "outscolding": 0, ']}}, 
{10: {u'eeilprrsty': [u'  "sperrylite": 0, ']}}, 
{8: {u'agilnrtw': [u'  "trawling": 0, ']}}, 
{13: {u'acdeimmoprrsu': [u'  "cardiospermum": 0, ']}}, 
{10: {u'gghhiilttt': [u'  "lighttight": 0, ']}}, 
{7: {u'deiprsy': [u'  "spidery": 0, ']}}
}

我需要将它们按长度和列表中的所有单词分组

【问题讨论】：

你能展示一下你尝试过的东西吗？
添加了更多代码

标签： python apache-spark dictionary pyspark

【解决方案1】：

您不能立即将map() 转换为len() 和sorted()，因为您失去了原来的价值。这是一种方法：

map 创建密钥 sorted(x)
groupByKey - sorted(x)
map 创建密钥 len(x)
groupByKey - len(x)
collectAsMap()

如果您想将其打印出来，您可能需要将 ResultIterables 转换为特定的 python 类型：

例如（假设您已将所有单词并行化为rdd）：

In []:
(rdd
 .map(lambda x: (''.join(sorted(x)), x))
 .groupByKey()
 .mapValues(lambda x: list(x))
 .map(lambda x: (len(x[0]), x))
 .groupByKey()
 .mapValues(lambda x: dict(x))
 .collectAsMap())

Out[]:
{6: {'abeill': ['alible'], 'cdeeit': ['deciet']},
 8: {'chiiilst': ['chilitis']},
 9: {'aabfiinst': ['fabianist'],
  'abeiilnns': ['biennials'],
  'acdeiituz': ['diazeutic'],
  'bclooortu': ['oblocutor']},
 10: {'aiilnoprtt': ['tripolitan']},
 12: {'ceeeilnostuy': ['leucosyenite']}}

【讨论】：