【问题标题】:Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' using PySpark将 rdd 转换为数据框:AttributeError: 'RDD' object has no attribute 'toDF' using PySpark
【发布时间】:2020-08-16 20:01:04
【问题描述】:

我正在尝试使用 PySpark 将 RDD 转换为 DataFrame。下面是我的代码。

from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import *
from pyspark.sql import SparkSession

conf = SparkConf().setMaster("local").setAppName("Dataframe_examples")
sc = SparkContext(conf=conf)

def parsedLine(line):
    fields = line.split(',')
    movieId = fields[0]
    movieName = fields[1]
    genres = fields[2]
    return movieId, movieName, genres

movies = sc.textFile("file:///home/ajit/ml-25m/movies.csv")
parsedLines = movies.map(parsedLine)
print(parsedLines.count())

dataFrame = parsedLines.toDF(["movieId"])
dataFrame.printSchema()

我正在使用 PyCharm IDE 运行此代码。

我得到了错误:

File "/home/ajit/PycharmProjects/pythonProject/Dataframe_examples.py", line 19, in <module>
    dataFrame = parsedLines.toDF(["movieId"])
AttributeError: 'PipelinedRDD' object has no attribute 'toDF'

由于我是新手,请告诉我我缺少什么?

【问题讨论】:

    标签: python apache-spark pyspark


    【解决方案1】:

    通过传递 sparkcontext 初始化 SparkSession

    Example:

    from pyspark import SparkConf, SparkContext
    from pyspark.sql.functions import *
    from pyspark.sql import SparkSession
    
    conf = SparkConf().setMaster("local").setAppName("Dataframe_examples")
    sc = SparkContext(conf=conf)
    
    spark = SparkSession(sc)
    
    def parsedLine(line):
        fields = line.split(',')
        movieId = fields[0]
        movieName = fields[1]
        genres = fields[2]
        return movieId, movieName, genres
    
    movies = sc.textFile("file:///home/ajit/ml-25m/movies.csv")
    
    #or using spark.sparkContext
    movies = spark.sparkContext.textFile("file:///home/ajit/ml-25m/movies.csv")
    
    parsedLines = movies.map(parsedLine)
    print(parsedLines.count())
    
    dataFrame = parsedLines.toDF(["movieId"])
    dataFrame.printSchema()
    

    【讨论】:

      【解决方案2】:

      使用SparkSession制作RDD数据框如下:

      movies = sc.textFile("file:///home/ajit/ml-25m/movies.csv")
      parsedLines = movies.map(parsedLine)
      print(parsedLines.count())
      
      spark = SparkSession.builder.getOrCreate()
      dataFrame = spark.createDataFrame(parsedLines).toDF(["movieId"])
      dataFrame.printSchema()
      

      或首先使用会话中的火花上下文。

      spark = SparkSession.builder.master("local").appName("Dataframe_examples").getOrCreate()
      sc = spark.sparkContext
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2022-12-01
        • 2019-06-04
        • 2019-01-08
        • 1970-01-01
        • 2020-11-28
        • 2019-06-25
        • 2020-04-26
        • 2020-11-21
        相关资源
        最近更新 更多