【问题标题】:Why did Spark interchange values of two columns?为什么 Spark 交换两列的值?
【发布时间】:2022-07-04 01:49:33
【问题描述】:

请有人解释一下为什么 spark 在查询 DataFrame 时会交换两列的值?

ProposedAction 的值返回给SimpleMatchRate,反之亦然。

这是代码示例:

import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType as ST, StructField as SF, StringType as STR

spark = (SparkSession.builder
    .master("local")
    .appName("Fuzzy")
    .config("spark.jars", "../jars/mysql-connector-java-8.0.29.jar")
    .config("spark.driver.extraClassPath", "../jars/mysql-connector-java-8.0.29.jar")
    .getOrCreate())

customschema = ST([
  SF("Matched", STR()),
  SF("MatchRate", STR()),
  SF("ProposedAction", STR()), # e.g. is_new
  SF("SimpleMatchRate", STR()), # e.g. 76.99800
  SF("Status", STR())])

files = [file for file in glob.glob('../source_files/*fuzzy*')]
df = spark.read.csv(files, sep="\t", header="true", encoding="UTF-8", schema=customschema)
df.printSchema()
root
 |-- Matched: string (nullable = true)
 |-- MatchRate: string (nullable = true)
 |-- ProposedAction: string (nullable = true)
 |-- SimpleMatchRate: string (nullable = true)
 |-- Status: string (nullable = true)

现在,如果我尝试将 df 作为表格查询:

df.createOrReplaceTempView("tmp_table")

spark.sql("""SELECT MatchRate, ProposedAction, SimpleMatchRate
          FROM tmp_table  LIMIT 5""").show()

我明白了:

+-----------+----------------+-----------------+
| MatchRate | ProposedAction | SimpleMatchRate |
+-----------+----------------+-----------------+
|  0.043169 |       0.000000 |          is_new |
|  88.67153 |       98.96907 |       is_linked |
|  89.50349 |       98.94736 |       is_linked |
|  99.44025 |      100.00000 |         is_dupe |
|  90.78082 |       98.92473 |       is_linked |
+-----------+----------------+-----------------+

【问题讨论】:

    标签: apache-spark apache-spark-sql


    【解决方案1】:

    我发现我做错了什么。我的架构定义没有正确遵循输入文件中的列顺序。 ProposedAction 出现在 SimpleMatchRate 之后,如下所示:

    . . .Matched MatchRate   SimpleMatchRate ProposedAction  status
    

    我将定义修改为以下,问题已解决:

    customschema = ST([
      SF("Matched", STR()),
      SF("MatchRate", STR()),
      SF("SimpleMatchRate", STR()),
      SF("ProposedAction", STR()), # Now in the correct position as in the input file
      SF("Status", STR())])
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-08-20
      • 2017-07-16
      • 1970-01-01
      • 1970-01-01
      • 2018-08-17
      • 1970-01-01
      相关资源
      最近更新 更多