【发布时间】:2021-01-02 19:28:25
【问题描述】:
我正在尝试从元组列表创建数据框,但我收到错误列表'对象没有属性'toDF'。我怎样才能避免这个错误。 (https://www.gutenberg.org/files/63207/63207-0.txt)
with open('/files/63207-0.txt', 'r') as content_file:
material = content_file.read()
material = remove_white_spaces(normalize_text(content))
beginning_string = 'Introduction To Book'
end_string = 'End of Book'
real_material = material[material.find(beginning_string)+len(beginning_string):material.rfind(end_string)]
Chapters = re.split(" Chapter [0-9]+ ", actual_content, flags=re.IGNORECASE)[1:]
save_data = []
for i in range(1,1+len(chapters)):
save_data.append((i,chapters[i-1]))
Get the dataframe from a list of tuples with columns ["page_number", "text"]
from pyspark.sql import SparkSession
from pyspark.sql.functions import SparkContext
from pyspark.sql import Row
data = sc.parallelize(save_data)
data_converted = data.map(lambda x: (x[0], x[1], x[1], x[1])
schema = StructType([StructField("chapter"), StringType(), True), StuctField("text"), StringType(), True)
df = SqlContext.createDataFrame(data_converted, schema)
df.show(5)
Binning using Bucketizer
splits = [0, 11, 21, 31, 41, 51, float("inf")]
bucketizer = Bucketizer(splits=splits, inputCol="chapters", outputCol="buckets")
df_buck = bucketizer.transform(df)
df_buck.show(20)
【问题讨论】:
-
多么糟糕的代码,使用你的
RDDobject,而不是list