【问题标题】:subsetting data loses all my observations子集数据丢失了我所有的观察结果
【发布时间】:2015-04-03 02:26:34
【问题描述】:

我有一个数据框“测试”,我希望对其进行子集化,但是当我尝试时,我会丢失所有观察结果。为什么会这样?

> str(Test)
'data.frame':   157025 obs. of  13 variables:
$ Cancellations    : int  1 1 1 1 1 1 1 1 1 1 ...
$ Benefit          : chr  "Single Parent Support                          "               "Single Parent Support                          " "Job Seeker                                         " "Job Seeker                                     " ...
$ Region           : chr  "        Northland    " "        Northland    " "            Northland    " "        Northland    " ...
$ Month            : chr  "Jun 14" "Jun 14" "Jun 14" "Jun 14" ...
$ CanReason        : chr  "Change in Marital Status           " "Change in     Marital Status           " "Change in Marital Status           " "Change in     Marital Status           " ...
$ Age              : chr  " 20-24 " " 20-24 " " 20-24 " " 20-24 " ...
$ Ethnicity        : chr  "NZ European/Pakeha" "Maori             " "Other                      " "NZ European/Pakeha" ...
$ SMS              : chr  "General Case Management               " "Work     Focused Case Management          " "Work Focused Case Management          " "Work     Search Support                   " ...
$ Duration         : chr  "2-4 yrs " "2-4 yrs " "6-9 mth " "0-3 mth " ...
$ SMSDuration      : int  361 348 59 69 150 37 63 294 107 107 ...
$ AgeYoungest      : chr  "0-4 yrs    " "0-4 yrs    " "No Children" "No    Children" ...
$ AgeYoungestNonSub: chr  "0-4 yrs" "0-4 yrs" "No Children" "No Children" ...
$ Liability        : chr  " 166,000 " " 166,000 " " 102,000 " " 102,000 " ...


> subDie <- Test[CanReason == "Died",]

> str(subDie)
'data.frame':   0 obs. of  13 variables:
$ Cancellations    : int 
$ Benefit          : chr 
$ Region           : chr 
$ Month            : chr 
$ CanReason        : chr 
$ Age              : chr 
$ Ethnicity        : chr 
$ SMS              : chr 
$ Duration         : chr 
$ SMSDuration      : int 
$ AgeYoungest      : chr 
$ AgeYoungestNonSub: chr 
$ Liability        : chr 

我尝试将因子变量转换为字符。当我将逗号放在“CanReason”索引行前面时 (subDie

【问题讨论】:

  • 是“死”还是“死”+多余的空格?
  • 这可能是因为@Pascal 提到的额外空格,但dput(head(Test)) 会比str 更有用。
  • 我刚试过-“死”(1个额外空格)-“死”(2个额外空格)-“死”(3个额外空格)没有这样的运气。
  • 例如,“婚姻状况的改变”有 10 个额外的空格,您可以在 str 的输出中看到。您不能偶然猜出额外空间的数量。
  • 你能显示sort(unique(Test$CanReason))的输出吗?

标签: r subset


【解决方案1】:

使用正则表达式在字符向量CanReason中搜索字符串"Died",使用grepl()返回一个表示匹配与否的逻辑向量。用它来子集Test

例如

set.seed(12)
CanReason <- sample(c("Change in      Marital status",
                      "Change in   Marital status ",
                      " Died    ",
                      "Died                ",
                      "Died"), 10000, replace = TRUE)
ind <- grepl("Died", CanReason)

sum(ind)
length(CanReason[ind])

给予:

> sum(ind)
[1] 6037
> length(CanReason[ind])
[1] 6037
> head(CanReason[ind])
[1] "Died"                 "Died"                 "Died                "
[4] "Died"                 " Died    "            " Died    "

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-08-17
    • 2013-07-12
    • 1970-01-01
    • 2021-07-19
    • 1970-01-01
    • 1970-01-01
    • 2015-09-06
    • 2019-03-16
    相关资源
    最近更新 更多