【问题标题】:Data prep for association rules in R - data frame to transactionR中关联规则的数据准备 - 数据帧到事务
【发布时间】:2020-02-26 07:02:06
【问题描述】:

我的数据来自 SQL 数据库,并采用表格形式,其中我有多个行用于单个事务。我希望使用数据框中的所有其他列,而不仅仅是使用“产品”字段。

我的数据如下:

transID <- c('1','1','2','3')
state <- c('TX','TX','CA','MA')
product <- c('Oranges','Banana','Fish','Cheese')
Month <- c('January','January','Febuary','March')
Place <- c('A','A','B','C')

transactions <- data.frame(transID,state,product,Month,Place)

transactions
  transID state product   Month Place
1       1    TX Oranges January     A
2       1    TX  Banana January     A
3       2    CA    Fish Febuary     B
4       3    MA  Cheese   March     C

理想情况下,我的数据如下所示:

1 (TX,Oranges,Banana,January,A)
2 (CA,Fish,Febuary,B)
3 (MA, Cheese, March,C)

将此类数据转换为事务格式的最佳方法是什么?

我尝试了以下方法,但我只是将第 1 行和第 2 行合并为一个事务:

transactionData <- ddply(transactions,c("transID"),
                         function(df1) paste(df1$state,
                                             df1$product,
                                             df1$Month,
                                             df1$Place,
                                             collapse = ","))

【问题讨论】:

  • 您的问题含糊不清。您应该使用 R 提供预期的输出。

标签: r data-mining apriori


【解决方案1】:

这有点尴尬,因为 data.frames 存储因子。

library("arules")

# make all columns into items
df <- data.frame(
  id = transactions$transID, 
  items = factor(c(as.character(transactions$state),
    as.character(transactions$product), 
    as.character(transactions$Month), 
    as.character(transactions$Place))))

# remove duplicated state, month and place enties
df <- df[!duplicated(df),]

# this is from the manual page '? transactions'
trans <- as(split(df[,"items"], df[,"id"]), "transactions")    
inspect(trans)


    items                         transactionID
[1] {A,Banana,January,Oranges,TX} 1            
[2] {B,CA,Febuary,Fish}           2            
[3] {C,Cheese,MA,March}           3    

我希望这会有所帮助。

【讨论】:

    【解决方案2】:

    这是一个基本解决方案:

    stack(tapply(transactions[, -1], 
           transactions[, 1, drop = F],
           FUN = function(DF) {
             paste(unique(unlist(DF), use.names = F), collapse = ',')
           }))[, 2:1]
    
    #  ind                      values
    #1   1 TX,Oranges,Banana,January,A
    #2   2           CA,Fish,Febuary,B
    #3   3           MA,Cheese,March,C
    

    主要部分是tapply() 部分,由transID 拆分,然后取消列出data.frame 的其余部分,只保留唯一值。这是tapply() 调用的输出。

                                1                             2                             3 
    "TX,Oranges,Banana,January,A"           "CA,Fish,Febuary,B"           "MA,Cheese,March,C" 
    

    stack()[, 2:1] 纯粹是为了产生漂亮的data.frame 而订购得很好。

    【讨论】:

      【解决方案3】:

      像这样重塑怎么样?

      reshape(transactions,v.names = "product",timevar = "product",idvar = "state", direction = "wide")
      
      transID state   Month Place product.Oranges product.Banana product.Fish product.Cheese
      1       1    TX January     A         Oranges         Banana         <NA>           <NA>
      3       2    CA Febuary     B            <NA>           <NA>         Fish           <NA>
      4       3    MA   March     C            <NA>           <NA>         <NA>         Cheese
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2019-11-11
        • 2023-03-25
        • 2020-05-30
        • 2020-10-02
        • 1970-01-01
        • 2018-11-21
        • 1970-01-01
        相关资源
        最近更新 更多