正确查询
Python
# Pass a df and apply the lambda function to column stars
reviews.groupby('business_id').apply(lambda df: sum(df.stars > 3))
代码说明
lambda df: sum(df.stars > 3)
这个 lambda 函数需要一个 pandas DataFrame 实例,然后过滤 if df.stars > 3。如果是这样,则 lambda 函数将获得 True 否则为 False。最后,sumTrue 记录。
由于我在执行此 lambda 函数之前应用了groupby,因此它将为每个组提供sumif df.stars > 3。
等效的 SQL 语句
SELECT
business_id,
SUM(IF(starts > 3, 1, 0)) AS starts_>3
FROM reviews
GROUP BY business_id;
或
SELECT
business_id,
COUNT(IF(starts > 3, 1, NULL)) AS starts_>3
FROM reviews
GROUP BY business_id;
查询错误
Python
reviews[reviews.stars > 3].groupby('business_id').size()
或
reviews[reviews.stars > 3].groupby('business_id')['stars'].count()
等效的 SQL 语句
SELECT
business_id,
SUM(IF(starts > 3, 1, 0)) AS starts_>3
WHERE starts > 3
FROM reviews
GROUP BY business_id;
或
SELECT
business_id,
COUNT(IF(starts > 3, 1, NULL)) AS starts_>3
FROM reviews
WHERE starts > 3
GROUP BY business_id;
为什么错了?
如您所见,错误的Python查询使用reviews[reviews.stars > 3]过滤groupby('business_id)之前大于3的星数,相当于在SQL中在GROUP BY business_id之前应用WHERE stars > 3。
因此,假设您有一个 business_id,其中只有记录 stars <= 3。错误的查询将忽略这个business_id。而且你不会数他们。
有什么改善吗?
是的。您可以改进 python 查询以重命名查询结果。 Pandas 不如 PySpark 方便,但我们仍然可以命名列名。
# Pass a df and apply the lambda function to column stars
lambda_func = lambda df: pd.Series({'stars_>3': df.stars > 3})
reviews.groupby('business_id').apply(lambda_func)
评估
生成样本数据集
您可以使用以下代码进行评估:
import pandas as pd
import random
# define business_ids
business_ids = range(1, 4)
# define stars
stars = range(1, 6)
# Generate a sample table reviews
reviews = pd.DataFrame(columns = ['review_id', 'business_id', 'stars'])
for business_id in business_ids:
for i in range(random.randrange(1, 5)): # Assume each business_id has 1~4 reviews
review = [len(reviews)+1, business_id, random.choice(stars)]
reviews.loc[len(reviews)] = review
reviews
我的示例数据集:
|
review_id |
business_id |
stars |
| 0 |
1 |
1 |
4 |
| 1 |
2 |
1 |
5 |
| 2 |
3 |
1 |
4 |
| 3 |
4 |
1 |
1 |
| 4 |
5 |
2 |
3 |
| 5 |
6 |
2 |
5 |
| 6 |
7 |
2 |
2 |
| 7 |
8 |
2 |
3 |
| 8 |
9 |
3 |
3 |
| 9 |
10 |
3 |
1 |
| 10 |
11 |
3 |
3 |
正确的 Python 查询
"""
business_id, stars_>3
1, 3
2, 1
3, 0
"""
# Pass a df and apply the lambda function to column stars
lambda_func = lambda df: pd.Series({'stars_>3': sum(df.stars > 3)})
reviews.groupby('business_id').apply(lambda_func)
|
stars_>3 |
| business_id |
|
| 1 |
3 |
| 2 |
1 |
| 3 |
0 |
错误的 Python 查询
reviews[reviews.stars > 3].groupby('business_id')['stars'].count()
输出:
business_id
1 3
2 1