【问题标题】:How to convert sql output to Dataframe?如何将sql输出转换为Dataframe?
【发布时间】:2023-01-14 18:05:47
【问题描述】:

我有一个数据框,从中创建一个临时视图以运行 sql 查询。经过几次 sql 查询后,我想将 sql 查询的输出转换为新的 Dataframe。我想要数据返回 Dataframe 的原因是我可以将它保存到 blob 存储中。

所以,问题是:将 sql 查询输出转换为 Dataframe 的正确方法是什么?

这是我到目前为止的代码:

%scala
//read data from Azure blob
...
var df = spark.read.parquet(some_path)

// create temp view
df.createOrReplaceTempView("data_sample")

%sql
//have some sqlqueries, the one below is just an example
SELECT
   date,
   count(*) as cnt
FROM
   data_sample
GROUP BY
   date

//Now I want to have a dataframe  that has the above sql output. How to do that?
Preferably the code would be in python or scala.


【问题讨论】:

    标签: pyspark databricks azure-databricks


    【解决方案1】:

    斯卡拉:

    var df = spark.sql(s"""
    SELECT
       date,
       count(*) as cnt
    FROM
       data_sample
    GROUP BY
       date
    """)
    

    派斯帕克:

    df = spark.sql(f'''
    SELECT
       date,
       count(*) as cnt
    FROM
       data_sample
    GROUP BY
       date
    ''')
    

    【讨论】:

    • sql字符串可以参数化吗?像 f''' SELECT date + {timezone} as date, … ''' where timezone is a parameter?
    • 是的,比如 f'''SELECT date, {timezone} from ... '''
    • 在 PySpark 中:table = 'schema.my_table'df = spark.sql(f'''select * from {table}''')
    【解决方案2】:

    您可以在 %%sql 代码中创建临时视图,然后从 pyspark 或 scala 代码中引用它,如下所示:

    %sql
    create temporary view sql_result as
    SELECT ...
    
    %scala
    var df = spark.sql("SELECT * FROM sql_result")
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-11-16
      • 1970-01-01
      • 2012-05-09
      • 1970-01-01
      • 1970-01-01
      • 2022-12-18
      • 1970-01-01
      • 2020-07-08
      相关资源
      最近更新 更多