【问题标题】:How to create a new column with sub string based on a column with long string using ifelse and grepl?如何使用 ifelse 和 grepl 基于具有长字符串的列创建具有子字符串的新列?
【发布时间】:2017-06-06 15:59:55
【问题描述】:

ac$summary 列的第一个视图行

    1
    during a demonstration flight, a u.s. army flyer flown by orville wright nose-dived into the ground from a height of approximately 75 feet, killing lt. thomas e. selfridge who was a passenger. this was the first recorded airplane fatality in history. one of two propellers separated in flight, tearing loose the wires bracing the rudder and causing the loss of control of the aircraft. orville wright suffered broken ribs, pelvis and a leg. selfridge suffered a crushed skull and died a short time later.
    2
    first u.s. dirigible akron exploded just offshore at an altitude of 1,000 ft. during a test flight.
    3
    the first fatal airplane accident in canada occurred when american barnstormer, john m. bryant, california aviator was killed.
    4
    the airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of helgoland island into the sea. the ship broke in two and the control car immediately sank drowning its occupants.
    5
    hydrogen gas which was being vented was sucked into the forward engine and ignited causing the airship to explode and burn at 3,000 ft..
    6
    crashed into trees while attempting to land after being shot down by british and french aircraft.
    7
    exploded and burned near neuwerk island, when hydrogen gas, being vented, was ignited by lightning.
    8
    crashed near the black sea, cause unknown.
    9
    shot down by british aircraft crashing in flames.
    10
    shot down in flames by the british 39th home defence squadron.
    11
    crashed in a storm.
    12
    shot down by british anti-aircraft fire and aircraft and crashed into the north sea.
    13
    caught fire and crashed. 

我想根据 ac$summary 制作 ac$sumnew 列

我编写了以下代码,但它没有返回所需的输出 & 和 |被使用了。什么时候 |使用过,结果不规律。有时对,有时错。

    ac$sumnew = ifelse(grepl("missing & crashed",ac$Summary),"missing and crashed",
        ifelse(grepl("shot | crashed",ac$Summary),"shot down and crashed",
        ifelse(grepl("struck | lightening",ac$Summary),"struck by lightening and crashed",
         ifelse(grepl("struck | bird & crashed",ac$Summary),"struck by bird and crashed",
         ifelse(grepl("exploded | crashed",ac$Summary),"exploded and crashed",
         ifelse(grepl("engine | failure",ac$Summary),"engine failure",
         ifelse(grepl("fog | crashed",ac$Summary),"crashed due to heavy fog",
         ifelse(grepl("fire | crashed",ac$Summary),"caught fire and crashed",
         ifelse(grepl("shot",ac$Summary),"shot down",             
         ifelse(grepl("crashed",ac$Summary),"Crashed",
         ifelse(grepl("shot",ac$Summary),"Shot down",
         ifelse(grepl("disappeared",ac$Summary),"Disappeared",
         ifelse(grepl("struck | obstacle | crashed ",ac$Summary),"struck by obstacle and Crashed",
         ifelse(grepl("crashed",ac$Summary),"crashed",
         ifelse(grepl("exploded",ac$Summary),"exploded",
         ifelse(grepl("fire",ac$Summary),"caught fire","others"))))))))))))))))

例如,如果飞机被击中,它应该返回“击落”

如果它刚刚崩溃,输出应该返回“崩溃”

如果它既丢失又崩溃了,它应该返回“missing and crashed”

我无法使用 & 和 | 正确获取此部分还有

得到的输出如下所示

1
others
2
exploded and crashed
3
others
4
others
5
engine failure
6
shot down and crashed
7
exploded and crashed
8
Crashed
9
shot down and crashed
10
shot down and crashed
11
Crashed
12
missing and crashed
13
missing and crashed
14
missing and crashed
15
Crashed
16
shot down and crashed
17
shot down and crashed

【问题讨论】:

    标签: r if-statement grepl


    【解决方案1】:

    我认为你有一个层次结构问题。 R 按顺序测试这些,因此您必须以适当的方式安排它。这是一个帮助解决此问题的链接:https://www.programiz.com/r-programming/if-else-statement

    ac$new  <-ifelse(apply(sapply(c("struck","bird","crash"), grepl, as.character(s$s)), 1, all) ,"struck by bird and crashed",
              ifelse(apply(sapply(c("struck","obstacle","crash"), grepl, as.character(s$s)), 1, all) ,"struck by obstacle and Crashed",
              ifelse(apply(sapply(c("miss" , "crash"), grepl, as.character(s$s)), 1, all) ,"missing and crashed",
              ifelse(apply(sapply(c("shot" , "crash"), grepl, as.character(s$s)), 1, all) ,"shot down and crashed",
              ifelse(apply(sapply(c("struck","lightening"), grepl, as.character(s$s)), 1, all) ,"struck by lightening and crashed",
              ifelse(apply(sapply(c("explode","crash"), grepl, as.character(s$s)), 1 , all) ,"exploded and crashed",
              ifelse(apply(sapply(c("engine|failure"), grepl, as.character(s$s)), 1 , all) ,"engine failure",
              ifelse(apply(sapply(c("fog","crash"), grepl, as.character(s$s)) , 1, all) ,"crashed due to heavy fog",
              ifelse(apply(sapply(c("fire","crash"), grepl, as.character(s$s)), 1, all) ,"caught fire and crashed",
              ifelse(apply(sapply("shot", grepl, as.character(s$s)), 1, all) ,"shot down",
              ifelse(apply(sapply("crash", grepl, as.character(s$s)), 1, all), "crashed",
              ifelse(apply(sapply("explode", grepl, as.character(s$s)), 1, all), "exploded",
              ifelse(apply(sapply("fire", grepl, as.character(s$s)), 1, all),"caught fire",
              ifelse(apply(sapply("disappear", grepl, as.character(s$s)), 1, all), "Disappeared","others"))))))))))))))
    

    现在,这是通过检查c() 中的所有单词,然后将值等同于ac$new,除了engine|failure。此外,因为我们正在处理单词,所以您希望使用最简单的词干来检查所有变体:例如,您应该使用“miss”而不是使用“missing”。

    我明白了

    1                   others
    2                 exploded
    3                   others
    4                  crashed
    5           engine failure
    6    shot down and crashed
    7                 exploded
    8                  crashed
    9    shot down and crashed
    10               shot down
    11                 crashed
    12   shot down and crashed
    13 caught fire and crashed
    

    上面有些单词不匹配,因为我确实检查了所有单词。我检查所有单词的原因是因为您在“ifelse”链的后半部分识别了单个单词。我确实做了一个眼球测试,我认为基于检查所有单词我是正确的。

    顺便说一句,这很乏味,尤其是如果您想扩展列表。你可能想使用类似的东西,

    ac <- data.frame(s = as.character(t), word.que = seq(1, length(t), by = 1))
    
    ac$word.count <- sapply(gregexpr(" ", ac$s), length) + 1
    
    new.mat <- data.frame(word.que = rep.int(ac$word.que, ac$word.count), word = unlist(strsplit(as.character(ac$s), split = " ")))
    words.of.interest <- c("struck|bird|crash|obstacle|miss|shot|struck|lightening|explode|engine|failure|fog|fire|disappear")
    new.mats<- new.mat %>%
               mutate(word = gsub("\\,", "", gsub("\\.", "", word))) %>%
               mutate(word.interest = ifelse(grepl(words.of.interest, as.character(word)), 1, 0)) %>%
               filter(word.interest == 1) %>%
               group_by(word.que) %>% 
               summarise(word.list = paste0(unique(word), collapse = "; ")) %>%
               full_join(ac, by = "word.que" ) %>%
               arrange(word.que) %>%
               mutate(word.list = ifelse(is.na(word.list), 'other', word.list))
    

    这将创建一个更有效的搜索列表供您构建。结果是

       word.que           word.list
    1         1               other
    2         2            exploded
    3         3               other
    4         4            crashing
    5         5     engine; explode
    6         6       crashed; shot
    7         7            exploded
    8         8             crashed
    9         9      shot; crashing
    10       10                shot
    11       11             crashed
    12       12 shot; fire; crashed
    13       13       fire; crashed
    

    以及您的文本变量和word.count。从长远来看,这也可能更有效。

    【讨论】:

    • 你是对的。这是层次结构的问题,但我尝试了你在我的代码中使用的层次结构。它仍然无法正常进行。我想我必须使用 apply 和 sapply 来处理你的代码。 "应用错误(sapply(c("struck", "bird", "crash"), grepl, as.character(ac$summary)), : dim(X) 的长度必须是正数...你能帮忙吗?我不熟悉整个申请家庭
    • 需要注意的一点是,您可能会收到一条警告消息,通知您基本上意味着您正在测试的单词数可能比某些字符串长;这是一个警告,而不是错误。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-06-05
    • 1970-01-01
    • 2021-07-12
    • 2018-12-12
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多