pyspark：在 txt 文件中的不同行之间叠加答案

【问题标题】：pyspark: shingling between different lines in a txt filepyspark：在 txt 文件中的不同行之间叠加
【发布时间】：2020-12-15 06:08:29
【问题描述】：

我需要以 mapreduce 方式在 txt 文件（带有标题和文本的体育文章）中找到所有 3 克带状疱疹。但是，txt 文件的格式为

This is the title
Content is here on the next line.
This is another line.

如果我使用sc.textFile() 而不进行处理，text = sc.textFile().collect() 会像

['This is the title',
 '',
 'Content is here on the next line.',
 '',
 'This is another line.']

因为文本文件有多行。结果，3-gram shingling 就像

[['This is the',
  'is the title'],
 [],
 ['Content is here', 
  'is here on',
  'here on the',
  'here on the',
  'the next line.'],
 [],
 ['This is another',
  'is another line.']]

如果我使用地图功能text.map(shingling)

k = 3
def shingling(text):
    tokens = text.split()
    shingle = [' '.join(tokens[i:i+k])
                     for i in range(len(tokens) - k + 1)]
    return shingle

我想要的是什么

['This is the',
 'is the title',
 'the title Content',
 'title Content is',
 ......]

我想知道是否有任何功能可以使用，或者我应该如何修改我的代码才能做到这一点。

【问题讨论】：

标签： python apache-spark pyspark mapreduce

【解决方案1】：

您可能需要使用以下代码合并这些行：

rdd = sc.textFile('text')

rdd2 = sc.parallelize([rdd.fold('', lambda x, y: x + ' ' + y)]).map(shingling)

>>> rdd2.collect()
[['This is the', 'is the title', 'the title Content', 'title Content is',
  'Content is here', 'is here on', 'here on the', 'on the next', 'the next line.',
  'next line. This', 'line. This is', 'This is another', 'is another line.']]

【讨论】：