【问题标题】：How to get the same percent_rank in SQL and pandas?如何在 SQL 和 pandas 中获得相同的 percent_rank？
【发布时间】：2020-11-14 13:52:33
【问题描述】：

我正在学习使用 HiveQL 的 pyspark 并发现有趣的是，百分比排名为 pyspark-sql 和 pandas 提供了两个不同的答案。

带有sql代码的问题来源：https://www.windowfunctions.com/questions/ranking/3

如何在 pandas 中得到与 SQL 相同的结果？

两个问题

与 SQL 给出相同结果的 python 代码是什么？
与 pandas 给出相同结果的 SQL 代码是什么？

pyspark-sql

q = """
select name, weight,
       percent_rank() over (order by weight) as percent_rank_wt
from cats
order by weight
"""
spark.sql(q).show()

SQL gives this table. I would like same table using pandas.

+-------+------+-------------------+
|   name|weight|    percent_rank_wt|
+-------+------+-------------------+
| Tigger|   3.8|                0.0|
|  Molly|   4.2|0.09090909090909091|
|  Ashes|   4.5|0.18181818181818182|
|Charlie|   4.8| 0.2727272727272727|
| Smudge|   4.9|0.36363636363636365|
|  Felix|   5.0|0.45454545454545453|
|   Puss|   5.1| 0.5454545454545454|
| Millie|   5.4| 0.6363636363636364|
|  Alfie|   5.5| 0.7272727272727273|
|  Misty|   5.7| 0.8181818181818182|
|  Oscar|   6.1| 0.9090909090909091|
| Smokey|   6.1| 0.9090909090909091|
+-------+------+-------------------+

熊猫

methods = {'average', 'min', 'max', 'first', 'dense'}

df[['name','weight']].sort_values('weight').assign(
     pct_avg=df['weight'].rank(pct=True,method='average'),
     pct_min=df['weight'].rank(pct=True,method='min'),
     pct_max=df['weight'].rank(pct=True,method='max'),
     pct_first=df['weight'].rank(pct=True,method='first'),
     pct_dense=df['weight'].rank(pct=True,method='dense')
).sort_values('weight')
       name  weight   pct_avg   pct_min   pct_max  pct_first  pct_dense
4    Tigger     3.8  0.083333  0.083333  0.083333   0.083333   0.090909
0     Molly     4.2  0.166667  0.166667  0.166667   0.166667   0.181818
1     Ashes     4.5  0.250000  0.250000  0.250000   0.250000   0.272727
11  Charlie     4.8  0.333333  0.333333  0.333333   0.333333   0.363636
3    Smudge     4.9  0.416667  0.416667  0.416667   0.416667   0.454545
2     Felix     5.0  0.500000  0.500000  0.500000   0.500000   0.545455
9      Puss     5.1  0.583333  0.583333  0.583333   0.583333   0.636364
7    Millie     5.4  0.666667  0.666667  0.666667   0.666667   0.727273
5     Alfie     5.5  0.750000  0.750000  0.750000   0.750000   0.818182
8     Misty     5.7  0.833333  0.833333  0.833333   0.833333   0.909091
6     Oscar     6.1  0.958333  0.916667  1.000000   0.916667   1.000000
10   Smokey     6.1  0.958333  0.916667  1.000000   1.000000   1.000000

设置

import numpy as np
import pandas as pd

import pyspark
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark import SparkConf, SparkContext, SQLContext
spark = pyspark.sql.SparkSession.builder.appName('app').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

df = pd.DataFrame({
    'name': [
        'Molly', 'Ashes', 'Felix', 'Smudge', 'Tigger', 'Alfie', 'Oscar',
        'Millie', 'Misty', 'Puss', 'Smokey', 'Charlie'
    ],
    'breed': [
        'Persian', 'Persian', 'Persian', 'British Shorthair',
        'British Shorthair', 'Siamese', 'Siamese', 'Maine Coon', 'Maine Coon',
        'Maine Coon', 'Maine Coon', 'British Shorthair'
    ],
    'weight': [4.2, 4.5, 5.0, 4.9, 3.8, 5.5, 6.1, 5.4, 5.7, 5.1, 6.1, 4.8],
    'color': [
        'Black', 'Black', 'Tortoiseshell', 'Black', 'Tortoiseshell', 'Brown',
        'Black', 'Tortoiseshell', 'Brown', 'Tortoiseshell', 'Brown', 'Black'
    ],
    'age': [1, 5, 2, 4, 2, 5, 1, 5, 2, 2, 4, 4]
})

schema = StructType([
    StructField('name', StringType(), True),
    StructField('breed', StringType(), True),
    StructField('weight', DoubleType(), True),
    StructField('color', StringType(), True),
    StructField('age', IntegerType(), True),
])

sdf = sqlContext.createDataFrame(df, schema)
sdf.createOrReplaceTempView("cats")

【问题讨论】：

在问题中我给出了 SQL 代码的输出。我的 python 代码给出了不同的结果。这意味着我的 python 代码是错误的。我想要一个“正确”的 python 代码，它给出的结果与 SQL 代码 (percent_rank) 给出的结果相同。
可以加个方法dense .rank(pct=True,method='dense')吗？
仍然给出不同的答案。
看起来 pct_dense 离那里 1 shift 差不多 :)

标签： python sql pandas pyspark hiveql

【解决方案1】：

SQL 的 percent_rank 与 pandas 的 rank 并不完全相同。主要有两点不同：

SQL 的percent_rank 从计算中排除当前行。因此，如果表有 11 行，则对于每一行，它将仅使用其他 10 行来计算结果。 pandas rank 包含所有行。
SQL 的percent_rank 给出了严格小于当前行的行数。 pandas rank 不支持这样做的方法。

给出与 SQL 相同结果的 python 代码是什么？

要在 pandas 中获得相当于 SQL 的 percent_rank，您实际上可以对 rank 结果执行一个小计算：

(df['weight'].rank(method='min')-1) / (len(df['weight'])-1)

分子中的-1是得到严格小于当前行的行数，分母中的-1是得到不包括当前行的计算结果。

给出与 pandas 相同结果的 SQL 代码是什么？

这取决于您在 pandas rank 中使用的方法，但您可能需要SQL's cume_dist。

【讨论】：