【问题标题】:How to use tapply to match specific condition如何使用tapply匹配特定条件
【发布时间】:2018-02-18 23:23:03
【问题描述】:

我有一个事故数据集,其中包含被报告的事故数量。我正在尝试使用tapply 功能,它将显示“星期四”报告的事故总数。但是,不是返回特定日期报告的事故数量。它显示了我在数据集中拥有的总行数。我在下面使用tapply 函数。:

tapply(myfinal$VEHICLE_COUNT,myfinal$DAY_OF_WEEK=='THURSDAY',length)

我的示例数据集如下:

> dput(tail(myfinal,5))
structure(list(CASE_NUMBER = c("1251045636", "1251045630", "1251045591", 
"1251045574", "1250010434"), BARRACK = c("Frederick", "Frederick", 
"Frederick", "Frederick", "Jessup"), ACC_DATE = c("2012-12-31T00:00:00", 
"2012-12-31T00:00:00", "2012-12-31T00:00:00", "2012-12-31T00:00:00", 
"2012-12-31T00:00:00"), ACC_TIME = c("18:12", "18:12", "12:12", 
"9:12", "11:12"), ACC_TIME_CODE = c("5", "5", "4", "3", "3"), 
    DAY_OF_WEEK = c("MONDAY   ", "MONDAY   ", "MONDAY   ", "MONDAY   ", 
    "MONDAY   "), ROAD = c("IS 00070 EISENHOWER MEMOR HWY", "MD 00077 ROCKY RIDGE RD", 
    "MD 00085 BUCKEYSTOWN PIKE", "MD 00017 MYERSVILLE RD", "IS 00070 No Name"
    ), INTERSECT_ROAD = c("CO 00248 MONUMENT RD", "MD 00076 MOTTERS STATION RD", 
    "CO 00308 MANOR WOODS RD", "CO 00941 DAWN CT", "US 00029 Columbia Pike"
    ), DIST_FROM_INTERSECT = c("300", "0", "400", "500", "0.25"
    ), DIST_DIRECTION = c("E", "U", "S", "S", "E"), CITY_NAME = c("Not Applicable", 
    "Not Applicable", "Not Applicable", "Not Applicable", NA), 
    COUNTY_CODE = c("10", "10", "10", "10", "13"), COUNTY_NAME = c("Frederick", 
    "Frederick", "Frederick", "Frederick", "Howard"), VEHICLE_COUNT = c(1, 
    2, 2, 1, 2), PROP_DEST = c("NO", "YES", "YES", "NO", "NO"
    ), INJURY = c("YES", "NO", "NO", "YES", "YES"), COLLISION_WITH_1 = c("FIXED OBJ", 
    "VEH", "VEH", "NON-COLLISION", "VEH"), COLLISION_WITH_2 = c("OTHER-COLLISION", 
    "OTHER-COLLISION", "OTHER-COLLISION", "OTHER-COLLISION", 
    "OTHER-COLLISION")), .Names = c("CASE_NUMBER", "BARRACK", 
"ACC_DATE", "ACC_TIME", "ACC_TIME_CODE", "DAY_OF_WEEK", "ROAD", 
"INTERSECT_ROAD", "DIST_FROM_INTERSECT", "DIST_DIRECTION", "CITY_NAME", 
"COUNTY_CODE", "COUNTY_NAME", "VEHICLE_COUNT", "PROP_DEST", "INJURY", 
"COLLISION_WITH_1", "COLLISION_WITH_2"), row.names = 18634:18638, class = "data.frame")

关于如何修复它的任何建议!提前致谢!

【问题讨论】:

    标签: r tapply


    【解决方案1】:

    如果出于某种原因只能使用 tapply,那么,根据 Maurits 的回答,您应该能够做到这一点:

    tapply(myfinal$VEHICLE_COUNT,trimws(myfinal$DAY_OF_WEEK)=='THURSDAY',length)
    

    或类似的。看来 DAY_OF_WEEK 变量中的字符串末尾有很多空格。您需要删除它们(通过trimws)或修改比较字符串以包含这些空格(例如,myfinal$DAY_OF_WEEK=="THURSDAY ")。使用比较运算符,R 将匹配两个字符串,前提是它们完全匹配一个字符,因此任何一个字符串中的任何额外空格都会对您不利。

    【讨论】:

    • 有没有办法只显示 TRUE 条件。目前,它同时显示 TRUE 和 FALSE 条件?
    • 最直接的方法是tapply(myfinal$VEHICLE_COUNT,trimws(myfinal$DAY_OF_WEEK)=='THURSDAY',length)["TRUE"]。将来,有更好的方法来编写这种看起来更干净并且可能更易于维护的代码。例如,sum(trimws(myfinal$DAY_OF_WEEK)=='THURSDAY')
    • 您能回答与同一主题相关的另一个问题吗?
    • 当然,虽然如果这个问题与这个问题足够分开,你在技术上应该发一个新的帖子。
    • @Daniel..ok。谢谢!它仅在 tapply 上。我正在尝试查找造成伤害的事故数量。仅使用 tapply 将如何解决? INJURY 和 VEHICLE_COUNT 列中有一些 NA
    【解决方案2】:

    基本 R 解决方案是按“星期四”对 DAY_OF_WEEK 进行子集化,然后返回行数:

    nrow(df[df$DAY_OF_WEEK == "THURSDAY",])
    

    【讨论】:

    • 它返回的值为 0。我也在考虑同样的情况
    • 那是因为他的数据集引入了空格。在 Daniels 的回答中,他使用命令 trimws 删除空格。我只是忽略了它们,因为这是问题的(通用)解决方案。
    【解决方案3】:

    在这里使用tapply真的没有意义!

    方法一

    使用dplyr:

    require(tidyverse);
    df %>% filter(trimws(DAY_OF_WEEK) == "MONDAY") %>% summarise(count = n());
    #  count
    #1     5
    

    方法二

    在基础 R 中,使用 subsettable

    table(subset(df, trimws(DAY_OF_WEEK) == "MONDAY")$DAY_OF_WEEK);
    #MONDAY
    #    5
    

    我在这里使用了“MONDAY”,因为您没有带有 DAY_OF_WEEK = "THURSDAY" 的条目。


    样本数据

    df <- structure(list(CASE_NUMBER = c("1251045636", "1251045630", "1251045591",
    "1251045574", "1250010434"), BARRACK = c("Frederick", "Frederick",
    "Frederick", "Frederick", "Jessup"), ACC_DATE = c("2012-12-31T00:00:00",
    "2012-12-31T00:00:00", "2012-12-31T00:00:00", "2012-12-31T00:00:00",
    "2012-12-31T00:00:00"), ACC_TIME = c("18:12", "18:12", "12:12",
    "9:12", "11:12"), ACC_TIME_CODE = c("5", "5", "4", "3", "3"),
        DAY_OF_WEEK = c("MONDAY   ", "MONDAY   ", "MONDAY   ", "MONDAY   ",
        "MONDAY   "), ROAD = c("IS 00070 EISENHOWER MEMOR HWY", "MD 00077 ROCKY RIDGE RD",
        "MD 00085 BUCKEYSTOWN PIKE", "MD 00017 MYERSVILLE RD", "IS 00070 No Name"
        ), INTERSECT_ROAD = c("CO 00248 MONUMENT RD", "MD 00076 MOTTERS STATION RD",
        "CO 00308 MANOR WOODS RD", "CO 00941 DAWN CT", "US 00029 Columbia Pike"
        ), DIST_FROM_INTERSECT = c("300", "0", "400", "500", "0.25"
        ), DIST_DIRECTION = c("E", "U", "S", "S", "E"), CITY_NAME = c("Not Applicable",
        "Not Applicable", "Not Applicable", "Not Applicable", NA),
        COUNTY_CODE = c("10", "10", "10", "10", "13"), COUNTY_NAME = c("Frederick",
        "Frederick", "Frederick", "Frederick", "Howard"), VEHICLE_COUNT = c(1,
        2, 2, 1, 2), PROP_DEST = c("NO", "YES", "YES", "NO", "NO"
        ), INJURY = c("YES", "NO", "NO", "YES", "YES"), COLLISION_WITH_1 = c("FIXED OBJ",
        "VEH", "VEH", "NON-COLLISION", "VEH"), COLLISION_WITH_2 = c("OTHER-COLLISION",
        "OTHER-COLLISION", "OTHER-COLLISION", "OTHER-COLLISION",
        "OTHER-COLLISION")), .Names = c("CASE_NUMBER", "BARRACK",
    "ACC_DATE", "ACC_TIME", "ACC_TIME_CODE", "DAY_OF_WEEK", "ROAD",
    "INTERSECT_ROAD", "DIST_FROM_INTERSECT", "DIST_DIRECTION", "CITY_NAME",
    "COUNTY_CODE", "COUNTY_NAME", "VEHICLE_COUNT", "PROP_DEST", "INJURY",
    "COLLISION_WITH_1", "COLLISION_WITH_2"), row.names = 18634:18638, class = "data.frame")
    

    【讨论】:

    • 我只想使用 tapply 功能
    • @YesBoss 在这里使用tapply 真的没有意义。我更新了我的解决方案,向您展示了两种方法。
    • 感谢您提出替代方法。再次感谢您的帮助!
    猜你喜欢
    • 2013-04-25
    • 1970-01-01
    • 1970-01-01
    • 2020-09-26
    • 2018-01-14
    • 2021-07-12
    • 2022-11-01
    • 1970-01-01
    • 2013-08-31
    相关资源
    最近更新 更多