【问题标题】:Adding or deriving two columns from a single column从单列添加或派生两列
【发布时间】:2021-02-03 00:19:55
【问题描述】:

给定以下数据框:

-

+------------------+--------------------+
|Customer_ID|Project_ID|QUESTION_TYP|ANSWER|
+---------+----------+-------------+---------+
   1            1         2nd QUES      YES
   1            1          2nd QUES     YES
   1            2          2nd QUES      NO 
   1            2          2nd Ques.     Yes

如何在上述数据框中再添加 2 列

块引用

比如 col_1 有 Yes 答案的计数,col_2 有 No 答案的计数。

 -+------------------+--------------------++--------------------+
|Customer_ID|Project_ID|QUESTION_TYP|ANSWER|col_1|col_2
+---------+----------+-------------+---------+------+-------------
   1            1         2nd QUES      YES.    2     0
   1            2          2nd QUES     NO      0.    1
   1            2          2nd Ques.    Yes     1.    0

请帮忙。我已经尝试了大多数解决方案,但我得到的结果是肯定的或否定的,但我需要按行排列。请帮助

【问题讨论】:

  • 你试过什么?

标签: python sql dataframe pyspark


【解决方案1】:

这看起来像窗口函数。假设您希望通过customer_id 计数:

select t.*,
       sum(case when answer = 'YES' then 1 else 0 end) over (partition by customer_id) as col_1,
       sum(case when answer = 'NO' then 1 else 0 end) over (partition by customer_id) as col_2
from t;

或者,如果您想计算所有数据,只需将 (partition by customer_id) 替换为 ()

注意:这会将计数放在两个列中。如果您只想要与该行答案匹配的列中的计数:

select t.*,
       (case when answer = 'YES'
             then sum(case when answer = 'YES' then 1 else 0 end) over (partition by customer_id)
             else 0
        end) as col_1,
       (case when answer = 'NO'
             then sum(case when answer = 'NO' then 1 else 0 end) over (partition by customer_id)
             else 0
        end) as col_2
from t;

但是,我认为两者兼有是有道理的。

【讨论】:

    猜你喜欢
    • 2014-09-20
    • 2016-11-05
    • 2020-05-31
    • 2022-11-22
    • 2015-11-18
    • 2015-09-28
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多