R：从镶木地板文件中读取前 n 行？答案

【问题标题】：R: Reading first n rows from parquet file?R：从镶木地板文件中读取前 n 行？
【发布时间】：2023-02-19 01:31:13
【问题描述】：

我意识到 parquet 是一种列格式，但对于大文件，有时您不想在过滤之前将其全部读入 R 中的内存，前 1000 行左右可能足以进行测试。我在阅读的镶木地板文档here 中没有看到选项。

我看到了 pandas here 的解决方案和 c# here 的选项，这两者对我来说都不是很明显，它们如何转化为 R。建议？

【问题讨论】：

查看文档，箭头似乎给出了懒惰的评估。那么也许你可以dplyr::slice_head(n=1000) %>% compute()？
不幸的是，arrow::read_parquet() 似乎没有使用惰性评估，基于我对 a) 读取所有文件的时间和最大内存使用的测试，与 b) 您建议的 slice() 的管道实现相比。 - 两者都提供相似的结果。
我认为如果您使用 arrow::open_dataset()，它将索引 parquet 数据集并将其设置为惰性评估。更多信息：arrow.apache.org/docs/r/articles/dataset.html
@Jon 是正确的，arrow::open_dataset() 似乎允许延迟评估。惰性对象与 slice() 不兼容，但 head() 或 filter() 有效。一个好的结果 - 谢谢！

标签： r parquet

【解决方案1】：

感谢 Jon 和 Dan 指出了正确的方向。

arrow::open_dataset() 允许惰性评估（文档[此处][1]），然后您可以从（但不是slice()）或filter() 获得head()。这个过程更快，并且使用更少的峰值 ram。下面的例子。

# https://stackoverflow.com/questions/73131505/r-reading-first-n-rows-from-parquet-file

library(dplyr)
library(arrow)
library(tictoc) #optional, used to time results

tic("read all of large parquet file")
my_animals <- read_parquet("data/my_animals.parquet")
toc() # slow and uses heaps of ram

tic("read parquet and write mini version")
my_animals <- open_dataset("data/my_animals.parquet") 
my_animals # this is a lazy object

my_animals %>% 
  #slice(1000L) %>% #doesn't work
  head(n=1000L) %>% 
  # filter(YEAROFBIRTH >= 2010) %>% #also works
  compute() %>% 
  write_parquet("data/my_animals_mini.parquet") # optional
toc() # much faster, much less peak ram used


  [1]: https://arrow.apache.org/docs/r/articles/dataset.html

【讨论】：

【解决方案2】：

我发布了这个简单的包以供实际使用。 https://github.com/mkparkin/Rinvent随时检查是否有帮助。有一个名为“样本”的参数，它带来样本行。它还可以读取“delta”文件

【讨论】：