【问题标题】:Null values from a csv on Scala and Apache Spark来自 Scala 和 Apache Spark 上 csv 的空值
【发布时间】:2019-03-16 19:54:00
【问题描述】:

我使用的是 Apache Spark 2.3.0。当我上传一个 csv 文件然后我放 df.show 时,它会向我显示所有空值的表,我想知道为什么,因为 csv 中的一切看起来都很好

val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")

val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",IntegerType,true),StructField("Videoviews",IntegerType,true)))

Rank,Grade,Channelname,VideoUploads,Subscribers,Videoviews
1st,A++ ,Zee TV,82757,18752951,20869786591
2nd,A++ ,T-Series,12661,61196302,47548839843
3rd,A++ ,Cocomelon - Nursery Rhymes,373,19238251,9793305082
4th,A++ ,SET India,27323,31180559,22675948293
5th,A++ ,WWE,36756,32852346,26273668433
6th,A++ ,Movieclips,30243,17149705,16618094724
7th,A++ ,netd müzik,8500,11373567,23898730764
8th,A++ ,ABS-CBN Entertainment,100147,12149206,17202609850
9th,A++ ,Ryan ToysReview,1140,16082927,24518098041
10th,A++ ,Zee Marathi,74607,2841811,2591830307
11th,A+ ,5-Minute Crafts,2085,33492951,8587520379
12th,A+ ,Canal KondZilla,822,39409726,19291034467
13th,A+ ,Like Nastya Vlog,150,7662886,2540099931
14th,A+ ,Ozuna,50,18824912,8727783225
15th,A+ ,Wave Music,16119,15899764,10989179147
16th,A+ ,Ch3Thailand,49239,11569723,9388600275
17th,A+ ,WORLDSTARHIPHOP,4778,15830098,11102158475
18th,A+ ,Vlad and Nikita,53,-- ,1428274554

【问题讨论】:

  • 文件中的分隔符是逗号(,)还是别的什么?
  • 你能贴出csv文件的前几行吗?
  • @TerryDactyl 我在上面添加了它们

标签: scala csv apache-spark apache-spark-mllib


【解决方案1】:

null 值的原因是因为 csv API 的默认“模式”是PERMISSIVE

mode(默认PERMISSIVE):允许处理损坏的模式 解析过程中的记录。它支持以下不区分大小写 模式。
- PERMISSIVE :当它遇到一个时将其他字段设置为空 损坏的记录,并将格式错误的字符串放入字段中 由 columnNameOfCorruptRecord 配置。为了保存损坏的记录, 用户可以在一个名为 columnNameOfCorruptRecord 的字符串类型字段中设置一个 用户定义的架构。如果架构没有该字段,则丢弃 解析过程中损坏的记录。当解析的 CSV 令牌的长度为 比模式的预期长度短,它将 null 设置为额外的 字段。
- DROPMALFORMED:忽略整个损坏的记录。
- FAILFAST:遇到损坏的记录时抛出异常

csv API

【讨论】:

    【解决方案2】:

    因此,如果我们在没有架构的情况下加载,我们会看到以下内容:

    scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").load("data.csv")
    
    df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
    
    scala> df.show
    +----+-----+--------------------+------------+-----------+-----------+
    |Rank|Grade|         Channelname|VideoUploads|Subscribers| Videoviews|
    +----+-----+--------------------+------------+-----------+-----------+
    | 1st| A++ |              Zee TV|       82757|   18752951|20869786591|
    | 2nd| A++ |            T-Series|       12661|   61196302|47548839843|
    | 3rd| A++ |Cocomelon - Nurse...|         373|   19238251| 9793305082|
    | 4th| A++ |           SET India|       27323|   31180559|22675948293|
    | 5th| A++ |                 WWE|       36756|   32852346|26273668433|
    | 6th| A++ |          Movieclips|       30243|   17149705|16618094724|
    | 7th| A++ |          netd müzik|        8500|   11373567|23898730764|
    | 8th| A++ |ABS-CBN Entertain...|      100147|   12149206|17202609850|
    | 9th| A++ |     Ryan ToysReview|        1140|   16082927|24518098041|
    |10th| A++ |         Zee Marathi|       74607|    2841811| 2591830307|
    |11th|  A+ |     5-Minute Crafts|        2085|   33492951| 8587520379|
    |12th|  A+ |     Canal KondZilla|         822|   39409726|19291034467|
    |13th|  A+ |    Like Nastya Vlog|         150|    7662886| 2540099931|
    |14th|  A+ |               Ozuna|          50|   18824912| 8727783225|
    |15th|  A+ |          Wave Music|       16119|   15899764|10989179147|
    |16th|  A+ |         Ch3Thailand|       49239|   11569723| 9388600275|
    |17th|  A+ |     WORLDSTARHIPHOP|        4778|   15830098|11102158475|
    |18th|  A+ |     Vlad and Nikita|          53|        -- | 1428274554|
    +----+-----+--------------------+------------+-----------+-----------+
    

    如果我们应用您的架构,我们会看到:

    scala> val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",IntegerType,true),StructField("Videoviews",IntegerType,true)))
    
    scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
    df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
    
    scala> df.show
    +----+-----+-----------+-------------+----------+----------+
    |Rank|Grade|Channelname|Video Uploads|Suscribers|Videoviews|
    +----+-----+-----------+-------------+----------+----------+
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    |null| null|       null|         null|      null|      null|
    +----+-----+-----------+-------------+----------+----------+
    

    现在,如果我们查看您的数据,我们会看到订阅者包含非整数值 ("--"),而视频视图包含超过整数最大值 (2,147,483,647) 的值

    因此,如果我们更改架构以符合数据:

    scala> val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",StringType,true),StructField("Videoviews",LongType,true)))
    schema: org.apache.spark.sql.types.StructType = StructType(StructField(Rank,StringType,true), StructField(Grade,StringType,true), StructField(Channelname,StringType,true), StructField(Video Uploads,IntegerType,true), StructField(Suscribers,StringType,true), StructField(Videoviews,LongType,true))
    
    scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
    df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
    
    scala> df.show
    +----+-----+--------------------+-------------+----------+-----------+
    |Rank|Grade|         Channelname|Video Uploads|Suscribers| Videoviews|
    +----+-----+--------------------+-------------+----------+-----------+
    | 1st| A++ |              Zee TV|        82757|  18752951|20869786591|
    | 2nd| A++ |            T-Series|        12661|  61196302|47548839843|
    | 3rd| A++ |Cocomelon - Nurse...|          373|  19238251| 9793305082|
    | 4th| A++ |           SET India|        27323|  31180559|22675948293|
    | 5th| A++ |                 WWE|        36756|  32852346|26273668433|
    | 6th| A++ |          Movieclips|        30243|  17149705|16618094724|
    | 7th| A++ |          netd müzik|         8500|  11373567|23898730764|
    | 8th| A++ |ABS-CBN Entertain...|       100147|  12149206|17202609850|
    | 9th| A++ |     Ryan ToysReview|         1140|  16082927|24518098041|
    |10th| A++ |         Zee Marathi|        74607|   2841811| 2591830307|
    |11th|  A+ |     5-Minute Crafts|         2085|  33492951| 8587520379|
    |12th|  A+ |     Canal KondZilla|          822|  39409726|19291034467|
    |13th|  A+ |    Like Nastya Vlog|          150|   7662886| 2540099931|
    |14th|  A+ |               Ozuna|           50|  18824912| 8727783225|
    |15th|  A+ |          Wave Music|        16119|  15899764|10989179147|
    |16th|  A+ |         Ch3Thailand|        49239|  11569723| 9388600275|
    |17th|  A+ |     WORLDSTARHIPHOP|         4778|  15830098|11102158475|
    |18th|  A+ |     Vlad and Nikita|           53|       -- | 1428274554|
    +----+-----+--------------------+-------------+----------+-----------+ 
    

    【讨论】:

      猜你喜欢
      • 2021-07-13
      • 1970-01-01
      • 1970-01-01
      • 2017-07-09
      • 2015-10-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多