【问题标题】:getting error name 'spark' is not defined获取错误名称'spark'未定义
【发布时间】:2020-05-07 22:22:11
【问题描述】:

这是我使用的代码:

df = None

from pyspark.sql.functions import lit

for category in file_list_filtered:
    data_files = os.listdir('HMP_Dataset/'+category)

    for data_file in data_files:
        print(data_file)
        temp_df = spark.read.option('header', 'false').option('delimiter', ' ').csv('HMP_Dataset/'+category+'/'+data_file, schema = schema)
        temp_df = temp_df.withColumn('class', lit(category))
        temp_df = temp_df.withColumn('source', lit(data_file))

        if df is None:
            df = temp_df
        else:
            df = df.union(temp_df)

我得到了这个错误:

NameError                                 Traceback (most recent call last)
<ipython-input-4-4296b4e97942> in <module>
      9     for data_file in data_files:
     10         print(data_file)
---> 11         temp_df = spark.read.option('header', 'false').option('delimiter', ' ').csv('HMP_Dataset/'+category+'/'+data_file, schema = schema)
     12         temp_df = temp_df.withColumn('class', lit(category))
     13         temp_df = temp_df.withColumn('source', lit(data_file))

NameError: name 'spark' is not defined

我该如何解决?

【问题讨论】:

    标签: python apache-spark pyspark


    【解决方案1】:

    初始化 Spark 会话,然后在循环中使用 spark

    df = None
    
    from pyspark.sql.functions import lit
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName('app_name').getOrCreate()
    
    for category in file_list_filtered:
    ...
    

    【讨论】:

    • NameError Traceback(最近一次调用最后) in 10 for data_files in data_files: 11 print(data_file) ---> 12 temp_df = spark.read .option('header', 'false').option('delimiter', '').csv('HMP_Dataset/'+category+'/'+data_file, schema = schema) 13 temp_df = temp_df.withColumn('class' , lit(category)) 14 temp_df = temp_df.withColumn('source', lit(data_file)) NameError: name 'schema' is not defined
    • @ParamitaBhattacharjee,您正在读取带有schema 的csv 文件,因此需要定义架构stackoverflow.com/a/56504339(或者)您可以从spark.read.csv 中删除schema=schema
    • 谢谢,实际上我使用的是 jupyter notebook,所以我遇到了很多错误,但如果我在 google colab 中做同样的事情,它可以正常工作,谢谢
    【解决方案2】:

    尝试定义spark var

    from pyspark.context import SparkContext
    from pyspark.sql.session import SparkSession
    sc = SparkContext('local')
    spark = SparkSession(sc)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-02-21
      • 1970-01-01
      • 2022-01-01
      • 1970-01-01
      • 2015-01-28
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多