【问题标题】:How to reshape using two variables如何使用两个变量重塑
【发布时间】:2015-07-11 19:26:32
【问题描述】:

假设我有这个数据:

group    obs    data    data_A    data_B
1        1      7_a     7_a       
1        2      4_b               4_b  
1        3      1_a     1_a     
2        1      5_b               5_b
3        1                  
4        1      3_b               3_b
4        2      4_b               4_b
4        3      9_a     9_a     
4        4      8_b               8_b   

data_Adata_B 是基于data 构造的。他们遵循这样的规则:如果dataa 结束data_Abdata_B 结束,则它们采用data 的值;如果数据为空白,data_Adata_B 都保持空白。

我想将数据重塑如下:

group    data_A1    data_A2    data_B1    data_B2    data_B3
1        7_a        1_a        4_b                     
2                              5_b              
3                                            
4        9_a                   3_b        4_b         8_b    

列数由值的数量自动确定。

7_a9_a 位于 data_A1 中,因为它们是各自组中 a 变量的第一个实例。 1_adata_A2 中,因为它是a 变量在其组中的第二个实例,依此类推。

如何做到这一点?

(我知道reshape,这可以用于类似的情况。)

【问题讨论】:

  • 我冒昧地根据您对我提供的初始答案的评论重新表述了这个问题,我将删除。我提供了一个更适合您的问题的新答案。

标签: stata reshape


【解决方案1】:

一种方法是使用循环。不是很优雅,但很有效。

clear
set more off

*----- example data -----

input ///
group    obs    str3(data    data_A    data_B)
1        1      7_a     7_a           ""
1        2      4_b       ""        4_b  
1        3      1_a     1_a          ""
2        1      5_b      ""         5_b
3        1       ""        ""       ""
4        1      3_b       ""        3_b
4        2      4_b       ""        4_b
4        3      9_a     9_a          ""
4        4      8_b       ""        8_b   
end

drop data
list, sepby(group)

*----- what you want -----

quietly foreach i in A B {

    bysort group (obs) : gen count_`i' = sum(!missing(data_`i'))
    summarize count_`i', meanonly

    forvalues j = 1/`r(max)' {
        gen data_`i'`j' = ""
        replace data_`i'`j' = data_`i' if count_`i' == `j'
    }

    drop count_`i'
}

drop data_?

collapse (firstnm) data_*, by(group)

list

另一种方式使用reshapes 和fillin

clear
set more off

*----- example data -----

input ///
group    obs    str3(data    data_A    data_B)
1        1      7_a     7_a           ""
1        2      4_b       ""        4_b  
1        3      1_a     1_a          ""
2        1      5_b      ""         5_b
3        1       ""        ""       ""
4        1      3_b       ""        3_b
4        2      4_b       ""        4_b
4        3      9_a     9_a          ""
4        4      8_b       ""        8_b   
end

drop data

list, sepby(group)

*----- what you want -----

// first reshape
reshape long data_ , i(group obs) j(j) string

// counts per group j
bysort group j (obs) : gen count = sum(!missing(data_))

// concatenate and rectangularize
gen j2 = j + string(count)
fillin group j2

// drop some observations
bysort group j2 (data_) : drop if _n < _N | inlist(j2, "A0", "B0")

// keep necessary variables
keep group j2 data_

// second reshape
reshape wide data_, i(group) j(j2) string

list

我发现循环的解决方案更直观。

您的目标数据结构相当奇怪。在您的最终目标中插入一些上下文总是一个好主意。

【讨论】:

  • 我最终使用了循环代码。我加了:foreach i of varlist data_A1- data_B3 {bys group obs: replace i' = i'[_n-1] if !missing(i'[_n-1])`} 然后只保留最后的观察分组。
【解决方案2】:

我同意罗伯托的观点,这是一件很奇怪的事情。这是到达那里的另一种有趣的方式:

clear
input float(group obs) str3(data data_A data_B)
1 1 "7_a" "7_a" "" 
1 2 "4_b" "" "4_b" 
1 3 "1_a" "1_a" "" 
2 1 "5_b" "" "5_b" 
3 1 "" "" "" 
4 1 "3_b" "" "3_b" 
4 2 "4_b" "" "4_b" 
4 3 "9_a" "9_a" "" 
4 4 "8_b" "" "8_b" 
end

* verify assumptions about the data
isid group obs, sort

* concatenate values across obs
by group (obs): replace data_A = data_A[_n-1] + " " + data_A
by group (obs): replace data_B = data_B[_n-1] + " " + data_B

* the last obs of the group contains all values
by group: keep if _n == _N

* split each concatenated string
split data_A
split data_B

drop obs data data_A data_B
list

【讨论】:

    猜你喜欢
    • 2012-04-20
    • 2020-02-03
    • 2012-10-02
    • 2014-05-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多