【问题标题】:rxDataStep using lagged valuesrxDataStep 使用滞后值
【发布时间】:2015-04-16 13:21:40
【问题描述】:

在 SAS 中,可以遍历数据集并使用滞后值。

我这样做的方法是使用一个执行“滞后”的函数,但这可能会在块的开头产生错误的值。例如,如果一个块从第 200,000 行开始,那么它将假定一个滞后值的 NA 应该来自第 199,999 行。

有解决办法吗?

【问题讨论】:

    标签: revolution-r


    【解决方案1】:

    这是另一种滞后的方法:使用偏移日期进行自我合并。这大大简化了编码,并且可以同时滞后多个变量。缺点是运行时间比我使用 transformFunc 的答案长 2-3 倍,并且需要数据集的第二个副本。

    # Get a sample dataset
    sourcePath <- file.path(rxGetOption("sampleDataDir"), "DJIAdaily.xdf")
    
    # Set up paths for two copies of it
    xdfPath <- tempfile(fileext = ".xdf")
    xdfPathShifted <- tempfile(fileext = ".xdf")
    
    
    # Convert "Date" to be Date-classed
    rxDataStep(inData = sourcePath,
               outFile = xdfPath,
               transforms = list(Date = as.Date(Date)),
               overwrite = TRUE
    )
    
    
    # Then make the second copy, but shift all the dates up 
    # one (or however much you want to lag)
    # Use varsToKeep to subset to just the date and 
    # the variables you want to lag
    rxDataStep(inData = xdfPath,
               outFile = xdfPathShifted,
               varsToKeep = c("Date", "Open", "Close"),
               transforms = list(Date = as.Date(Date) + 1),
               overwrite = TRUE
    )
    
    # Create an output XDF (or just overwrite xdfPath)
    xdfLagged2 <- tempfile(fileext = ".xdf")
    
    # Use that incremented date to merge variables back on.
    # duplicateVarExt will automatically tag variables from the 
    # second dataset as "Lagged".
    # Note that there's no need to sort manually in this one - 
    # rxMerge does it automatically.
    rxMerge(inData1 = xdfPath,
            inData2 = xdfPathShifted,
            outFile = xdfLagged2,
            matchVars = "Date",
            type = "left",
            duplicateVarExt = c("", "Lagged")
    )
    

    【讨论】:

    • 如果您的数据集有间隙,它会中断 - 例如,星期五的Date + 1 给您一个星期六,而不是星期一。你可以用if(format(Date, "%A") %in% "Friday") { Date + 3 } else { Date + 1}之类的东西来解决这个问题,但是你必须添加这样的逻辑来解释每个假期等等。:(
    【解决方案2】:

    您对分块问题完全正确。解决方法是使用rxGetrxSet 在块之间传递值。函数如下:

    lagVar <- function(dataList) { 
    
         # .rxStartRow returns the overall row number of the first row in this
         # chunk. So - the first row of the first chunk is equal to one.
         # If this is the very first row, there's no previous value to use - so
         # it's just an NA.
         if(.rxStartRow == 1) {
    
            # Put the NA out front, then shift all the other values down one row.
            # newName is the desired name of the lagged variable, set using
            # transformObjects - see below
            dataList[[newName]] <- c(NA, dataList[[varToLag]][-.rxNumRows]) 
    
        } else {
    
            # If this isn't the very first chunk, we have to fetch the previous
            # value from the previous chunk using .rxGet, then shift all other
            # values down one row, just as before.
            dataList[[newName]] <- c(.rxGet("lastValue"),
                                     dataList[[varToLag]][-.rxNumRows]) 
    
          }
    
        # Finally, once this chunk is done processing, set its lastValue so that
        # the next chunk can use it.
        .rxSet("lastValue", dataList[[varToLag]][.rxNumRows])
    
        # Return dataList with the new variable
        dataList
    
    }
    

    以及如何在rxDataStep中使用它:

    # Get a sample dataset
    xdfPath <- file.path(rxGetOption("sampleDataDir"), "DJIAdaily.xdf")
    
    # Set a path to a temporary file
    xdfLagged <- tempfile(fileext = ".xdf")
    
    # Sort the dataset chronologically - otherwise, the lagging will be random.
    rxSort(inData = xdfPath,
           outFile = xdfLagged,
           sortByVars = "Date")
    
    # Finally, put the lagging function to use:
    rxDataStep(inData = xdfLagged, 
               outFile = xdfLagged,
               transformObjects = list(
                   varToLag = "Open", 
                   newName = "previousOpen"), 
               transformFunc = lagVar,
               append = "cols",
               overwrite = TRUE)
    
    # Check the results
    rxDataStep(xdfLagged, 
               varsToKeep = c("Date", "Open", "previousOpen"),
               numRows = 10)
    

    【讨论】:

    • 太棒了。更好的一点......当一个人可以“及时”传递 .rxSet 时,似乎没有必要编写额外的新变量?另外,符号 -.rxNumRows 有什么作用?是 .rxNumRows-1 的意思吗?
    • @APK 抱歉,您指的是哪个“额外新”变量?还有 - 关于.rxNumRows 的好问题。这是一个特殊变量(有关特殊变量,请参阅?rxTransforms),它返回当前块中的行数,然后我将其取反(-.rxNumRows)以从变量中删除最后一个元素。基本 R 中的一个简单示例类似于x &lt;- 1:10; x[-length(x)],它删除了x 的最后一个元素,而无需您事先知道x 的长度。希望这是有道理的!我觉得我解释得不好……
    • 基本上,dataList[[varToLag]][.rxNumRows]) 返回变量中的最后一个值,dataList[[varToLag]][-.rxNumRows]) 返回所有值除了最后一个。
    • 我想我知道发生了什么,谢谢!这真的很酷。不幸的是,如果可以的话,文档很薄。
    • @APK 你可以!这是。 :(
    猜你喜欢
    • 2019-09-26
    • 1970-01-01
    • 2016-06-14
    • 1970-01-01
    • 1970-01-01
    • 2019-07-25
    • 2021-11-06
    • 2019-05-23
    • 1970-01-01
    相关资源
    最近更新 更多