XTS 数据占用内存太大？答案

【问题标题】：XTS data taking up too much space in memory?XTS 数据占用内存太大？
【发布时间】：2020-09-09 18:57:50
【问题描述】：

我通过以下方式使用来自 quantmod 的 GetSymbols：

temp0 <- getSymbols("AAPL",src = 'yahoo',from=Sys.Date()-100000,to = Sys.Date(),auto.assign=FALSE);

我得到了一个大的 xts 对象，占用了很少的空间。

temp0   Large xts (59664 elements, 546.2 Kb)

但是，有时 xts 对象会占用太多空间。这是我的一些对象。

temp1   Large xts (34848 elemnts, 25.5 Mb)
t12     Large xts (36 elements, 25.2 Mb)

t12 是 head(temp1) 的结果，这里是 t12：

structure(t12)
                    Bid.Price Bid.Size Ask.Price Ask.Size Trade.Price Volume
2019-05-29 17:00:01  116.4922       51  116.5000      143    116.4922    208
2019-05-29 17:00:02  116.4922       71  116.5000      142    116.5000      2
2019-05-29 17:00:04  116.4844      427  116.4922       92    116.4844     72
2019-05-29 17:00:08  116.4922       83  116.5000      156    116.4922     21
2019-05-29 17:01:01  116.4922       71  116.5000      128    116.4922     34
2019-05-29 17:01:08  116.5000       13  116.5078      228    116.4922    192

我确实发现如果我使用 attributes(t12) 我发现 #na.action 以及 attr(,"index") 包含许多值，超过200 万。

temp1 是非常大的数据集，我从中过滤了大部分数据，但似乎该对象在 #na.action 和 attr(,"index") 如果不是更多。

我不知道为什么会发生这种情况，但我该如何清理它？如何让我的 6 行 t12 成为合适的最小尺寸？

如果有帮助，这里是 max.print=10 的完整属性输出：

> attributes(t12)
$class
[1] "xts" "zoo"

$.indexCLASS
[1] "POSIXct" "POSIXt" 

$tclass
[1] "POSIXct" "POSIXt" 

$na.action
 [1]  1  3  6  7  8 12 13 15 16 17
 [ reached getOption("max.print") -- omitted 2201657 entries ]
attr(,"class")
[1] "omit"
attr(,"index")
 [1] 1519772400 1519772402 1519772407 1519772409 1519772410 1519772420 1519772424 1519772428 1519772429 1519772430
 [ reached getOption("max.print") -- omitted 2201657 entries ]

$index
[1] 1559167201 1559167202 1559167204 1559167208 1559167261 1559167268
attr(,"tzone")
[1] "America/Chicago"
attr(,"tclass")
[1] "POSIXct" "POSIXt" 

$dim
[1] 6 6

$dimnames
$dimnames[[1]]
NULL

$dimnames[[2]]
[1] "Bid.Price"   "Bid.Size"    "Ask.Price"   "Ask.Size"    "Trade.Price" "Volume"

如何从 xts 中删除多余的信息？

更新

根据上面的代码，我似乎找到了解决该问题的方法。

t12 <- rbind(t12[1,],t12)
t12[1,1] < NA
t12 <- na.omit(t12)

我将数据的第一行添加到顶部，并将第一个条目设置为 NA。然后当我对数据集执行 na.omit 时，其余数据与最初的 t12 相同，但没有额外的坏数据。

问题是我已经使用 na.omit() 来创建 temp1 集，但不知道为什么有时 na.omit() 不能正确清理数据？可能有非常大的数据集？

【问题讨论】：

标签： r xts zoo quantmod

【解决方案1】：

na.omit() 函数添加了na.action 属性。此属性包含已删除的所有观测值的位置和索引值。为与na.omit()的其他方法保持一致而添加。

您的解决方法使对象更小，因为它用代表您添加到系列开头的一个缺失值的值覆盖了na.action 属性。

将na.action 属性设置为NULL 是一种更清晰的解决方法。

R> x <- .xts(1:1000, 1:1000); set.seed(21); is.na(x) <- sample(1000, 100)
R> y <- na.omit(x)
R> object.size(x)
9176 bytes
R> object.size(y)
9720 bytes
R> attr(y, "na.action") <- NULL
R> object.size(y)
8376 bytes

【讨论】：

谢谢，有没有完全跳过 na.action 的 na.omit() 替代方法？
目前没有替代方案。我将向 zoo+xts 团队提出类似na.drop() 的建议。

【解决方案2】：

根据上面的代码，我似乎找到了解决该问题的方法。

t12 <- rbind(t12[1,],t12)
t12[1,1] <- NA
t12 <- na.omit(t12)

我将数据的第一行添加到顶部，并将第一个条目设置为 NA。然后当我对数据集执行 na.omit 时，其余数据与最初的 t12 相同，但没有额外的坏数据。

【讨论】：