R中的函数（使用dplyr）答案

【问题标题】：function in R (with dplyr)R中的函数（使用dplyr）
【发布时间】：2015-01-05 07:20:31
【问题描述】：

我制作了一个适合我的 R 脚本，但我知道我可以通过使用函数使它变得更好（更漂亮）。不幸的是，我的各种尝试都没有成功。谁能引导我走上正确的道路？以下是我的原始脚本。

library(dplyr)

apples <- read.csv("JoburgApples.csv")

grs <- apples %>% filter(grepl("GRANNY", ProductName), tvaluesold >10000) %>% mutate(Variety = "Granny Smith")
cpp <- apples %>% filter(grepl("PINK", ProductName), tvaluesold >10000) %>% mutate(Variety = "Cripps Pink")
top <- apples %>% filter(grepl("TOP", ProductName), tvaluesold >10000) %>% mutate(Variety = "Top Red")
gld <- apples %>% filter(grepl("GOLDEN", ProductName), tvaluesold >10000) %>% mutate(Variety = "Golden Delicious")
ski <- apples %>% filter(grepl("STARKING", ProductName), tvaluesold >10000) %>% mutate(Variety = "Starking")
bra <- apples %>% filter(grepl("BRAEBURN", ProductName), tvaluesold >10000) %>% mutate(Variety = "Braeburn")

apples <- rbind(grs, cpp, top, gld, ski, bra)

s70 <- apples %>% filter(grepl("70$", ProductName)) %>% mutate(Count = 70)
s80 <- apples %>% filter(grepl("80$", ProductName)) %>% mutate(Count = 80)
s90 <- apples %>% filter(grepl("90$", ProductName)) %>% mutate(Count = 90)
s100 <- apples %>% filter(grepl("100$", ProductName)) %>% mutate(Count = 100)
s110 <- apples %>% filter(grepl("110$", ProductName)) %>% mutate(Count = 110)
s120 <- apples %>% filter(grepl("120$", ProductName)) %>% mutate(Count = 120)
s135 <- apples %>% filter(grepl("135$", ProductName)) %>% mutate(Count = 135)
s150 <- apples %>% filter(grepl("150$", ProductName)) %>% mutate(Count = 150)
s165 <- apples %>% filter(grepl("165$", ProductName)) %>% mutate(Count = 165)

apples <- rbind(s70, s80, s90, s100, s110, s120, s135, s150, s165)

编辑。链接到 .csv 文件 (https://github.com/fderyckel/showcases/blob/master/JoburgMarket/JoburgApples.csv)

> UnitMass  ProductName tvaluesold  tquantitysold   tkgsold avgprice    highestprice    date
> 18.50KG CARTON    CRIPPS PINK,CL 1,100    200 1   18.5    200 200 06/11/14
> 18.50KG CARTON    CRIPPS RED,CL 1,70  200 1   18.5    200 200 06/11/14
> 18.50KG CARTON    TOPRED,CL 1,180 1300    10  185 130 130 06/11/14
> 18.50KG CARTON    GOLDEN DELICIOUS,CL 1,90    22700   108 1998    210.19  240 06/11/14
> 18.50KG CARTON    STARKING,CL 1,80    17920   115 2127.5  155.83  230 06/11/14
> 18.50KG CARTON    GRANNY SMITH,CL 1,135   1800    12  222 150 150 06/11/14
> 18.50KG CARTON    TOPRED,CL 1,90  1730    12  222 144.17  190 06/11/14
> 18.50KG CARTON    CRIPPS PINK,CL 1,90 2600    13  240.5   200 200 06/11/14
> 18.50KG CARTON    GOLDEN DELICIOUS,CL 1,120   22800   136 2516    167.65  180 06/11/14
> 18.50KG CARTON    GOLDEN DELICIOUS,CL 1,135   21810   136 2516    160.37  180 06/11/14
> 18.50KG CARTON    GRANNY SMITH,CL 1,70    2380    14  259 170 220 06/11/14
> 18.50KG CARTON    GRANNY SMITH,CL 1,165   1200    15  277.5   80  80  06/11/14

提前感谢您的帮助。

弗朗索瓦

【问题讨论】：

您可以将所有这些grepl 调用放在一个调用中。阅读 ?regex 手册页和 | 运算符。这些都是向量化的操作，但是没有任何数据就很难提供帮助
感谢@RichardScriven。我已经用 .csv 的链接和文件的前几行更新了问题。
谢谢@docendodiscimus。我确实使用文件链接和前几行更新了问题。
@akrun 非常感谢。
@Franky 您应该已经指定您需要ProductName 的第一部分和最后一部分。我浪费时间追你的代码。看你的代码"Top Red"，在原始数据集中是TOPRED。所以，我的解决方案给了你显示的结果。

标签： r function dplyr

【解决方案1】：

假设您想用自定义名称替换“ProductName”的“前缀”部分，您可以使用qdap 中的mgsub。这将用修改后的元素替换“ProductName”中的元素。基于tvaluesold 创建逻辑indx，然后使用NA 值创建Variety 列，将Variety 中TRUE 的行更改为indx 修改后的ProductName .如果你想要一个新的数据集，更容易通过!is.na(apples$Variety)进行子集化

library(qdap)
indx <- apples$tvaluesold>10000
v1 <- c('GRANNY SMITH', 'CRIPPS PINK', 'TOPRED',
                    'GOLDEN DELICIOUS','STARKING','BRAEBURN')

 v2 <- c('Granny Smith', 'Cripps Pink', 'Top Red','Golden Delicious',
       'Starking', 'Braeburn')
 apples$Variety <- sub(',.*', '', apples$ProductName)

 apples[indx, 'Variety'] <- mgsub(v1,v2, apples[indx,'Variety']  )
 apples1 <- apples[indx,]

 head(apples1,3)
 #       UnitMass               ProductName tvaluesold tquantitysold tkgsold
 #4 18.50KG CARTON  GOLDEN DELICIOUS,CL 1,90      22700           108  1998.0
 #5 18.50KG CARTON          STARKING,CL 1,80      17920           115  2127.5
 #9 18.50KG CARTON GOLDEN DELICIOUS,CL 1,120      22800           136  2516.0
 #  avgprice highestprice       date          Variety
 #4   210.19          240 2014-11-06 Golden Delicious
 #5   155.83          230 2014-11-06         Starking
 #9   167.65          180 2014-11-06 Golden Delicious

或者只使用base R

 apples$Variety <-  unname(setNames(v2,v1)[sub(',.*', '', apples$ProductName)])
 apples1 <- apples[indx,]

对于第二种情况，您可以使用sub 提取最后一个, 之后的数字，然后使用%in% 创建一个逻辑indx2。

 val1 <-  as.numeric(sub(".*,", "", apples$ProductName))
 indx2 <-  val1 %in% c(70,80,90,100,110,120,135,150,165)

 apples$Count <- NA
 apples[indx2,'Count'] <-  val1[indx2]
 apples2 <- apples[!is.na(apples$Count),]
 head(apples2,3)
 #       UnitMass              ProductName tvaluesold tquantitysold tkgsold
 #1 18.50KG CARTON     CRIPPS PINK,CL 1,100        200             1    18.5
 #2 18.50KG CARTON       CRIPPS RED,CL 1,70        200             1    18.5
 #4 18.50KG CARTON GOLDEN DELICIOUS,CL 1,90      22700           108  1998.0
 #  avgprice highestprice       date          Variety Count
 #1   200.00          200 2014-11-06      CRIPPS PINK   100
 #2   200.00          200 2014-11-06       CRIPPS RED    70
 #4   210.19          240 2014-11-06 Golden Delicious    90

更新

您也可以使用dplyr 创建列

library(dplyr)
apples %>%
       filter(tvaluesold >10000) %>% 
       mutate(Variety= setNames(v2,v1)[sub(',.*', '', ProductName)])

创建Count 列

 apples %>%
        filter(indx2) %>%
        mutate(Count=val1[indx2])

更新2

如果要提取“ProductName”的“first”和“last”，另一种选择是

 library(tidyr)
 res1 <-  extract(apples, ProductName, c("Variety", "Count"),
                   '([^,]+),[^,]+,([^,]+)') %>%
                        filter(tvaluesold >10000L & !is.na(as.numeric(Count))

数据

 url <- 'https://raw.githubusercontent.com/fderyckel/showcases/master/JoburgMarket/JoburgApples.csv'
 library(RCurl)

 x <- getURL(url)
 apples <- read.csv(textConnection(x), stringsAsFactors=FALSE)

【讨论】：

非常感谢@akrun。我知道我必须尝试使用 paste0 但无法正确使用语法。让我试试这个。
@Franky 我对您显示的数据有疑问。是否要提取 , 之后的最后一个数字，用于 Count 列？
我真的很喜欢 dplyr 版本。非常感谢@akrun。我真的从你的各种版本中学到了很多。

【解决方案2】：

也许你只需要这个：

apples %>%
  filter(tvaluesold > 10000L & grepl(".*\\d+$", ProductName)) %>%
  mutate(Variety = sub(",.*", "", ProductName),
         Count = as.numeric(sub(".*,", "", ProductName)))

【讨论】：

哇，就是这样！确实更简单更漂亮。这就是说不理解整个部分“& grepl(".*\\d+$", ProductName)"。还有10000后面的“L”是做什么的？
因为它在filter 内，我们用它对行进行子集化。第一部分 tvaluesold > 10000L 表示 tvaluesold 值必须 > 10000（我使用 10000L 并且 L 将数字定义为整数，但您可以只使用 10000 代替）。第二部分，grepl(".*\\d+$", ProductName) 是一个正则表达式，匹配 ProductName 中以数字结尾的那些行。
知道了。非常感谢。