【问题标题】:How to use spark_read_avro from sparklyr R package?如何使用 sparklyr R 包中的 spark_read_avro?
【发布时间】:2021-12-01 05:08:55
【问题描述】:

我正在使用: R 版本 4.1.1 sparklyr 版本‘1.7.2’

我已通过 databricks-connect 连接到我的 databricks 集群,并尝试使用以下代码读取 avro 文件:

library(sparklyr)
library(dplyr)

sc <- spark_connect(
  method = "databricks", 
  spark_home = "my_spark_home_path",
  version = "3.1.1",
  packages = c("avro")
  )

df_path = "s3a://my_s3_path"
df = spark_read_avro(sc, path = df_path, memory = FALSE)

我也尝试过显式添加包:

library(sparklyr)
library(dplyr)

sc <- spark_connect(
  method = "databricks", 
  spark_home = "my_spark_home_path",
  version = "3.1.1",
  packages = "org.apache.spark:spark-avro_2.12:3.1.1"
  ) 

df_path = "s3a://my_s3_path"
df = spark_read_avro(sc, path = df_path, memory = FALSE)

spark 连接正常,我可以正常读取 parquet 文件,但是读取 avro 文件时总是得到:

Error in validate_spark_avro_pkg_version(sc) : 
  Avro support must be enabled with `spark_connect(..., version = <version>, packages = c("avro", <other package(s)>), ...)`  or by explicitly including 'org.apache.spark:spark-avro_2.12:3.1.1-SNAPSHOT' for Spark version 3.1.1-SNAPSHOT in list of packages

有人知道如何解决这个问题吗?

【问题讨论】:

    标签: r databricks avro sparklyr spark-avro


    【解决方案1】:

    我找到了使用 sparkavro 包的解决方法:

    library(sparklyr)
    library(dplyr)
    library(sparkavro)
    
    sc <- spark_connect(
      method = "databricks", 
      spark_home = "my_spark_home_path") 
    
    df_path = "s3a://my_s3_path"
    df = spark_read_avro(
       sc, 
       path = df_path, 
       name = "my_table_name", 
       memory = FALSE)
    

    【讨论】:

      猜你喜欢
      • 2018-06-24
      • 1970-01-01
      • 2019-11-15
      • 2019-06-16
      • 2018-01-15
      • 2017-02-17
      • 1970-01-01
      • 2017-11-15
      • 2019-03-16
      相关资源
      最近更新 更多