R中data.table包中fread速度的原因答案

【问题标题】：Reason behind speed of fread in data.table package in RR中data.table包中fread速度的原因
【发布时间】：2020-08-23 02:51:23
【问题描述】：

data.table 中的fread 函数在处理大型数据文件时的速度让我感到惊讶，但它是如何读取这么快的呢？ fread和read.csv的基本实现区别是什么？

【问题讨论】：

标签： r performance data.table fread

【解决方案1】：

我假设我们正在与 read.csv 进行比较，应用了所有已知的建议，例如设置 colClasses、nrows 等。read.csv(filename) 没有任何其他参数很慢，主要是因为它首先将所有内容读入内存，就好像它是character 然后尝试将其强制转换为 integer 或 numeric 作为第二步。

所以，比较 fread 和 read.csv(filename, colClasses=, nrows=, etc) ...

它们都是用 C 编写的，所以不是这样。

没有一个特别的原因，但本质上，fread 内存将文件映射到内存，然后使用指针遍历文件。而read.csv 通过连接将文件读入缓冲区。

如果您使用verbose=TRUE 运行fread，它将告诉您它是如何工作的，并报告每个步骤所花费的时间。例如，请注意它会直接跳到文件的中间和末尾，以便更好地猜测列类型（尽管在这种情况下前 5 个就足够了）。

> fread("test.csv",verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.486 GB
File is opened and mapped ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
Found 6 columns
First row with 6 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 10000001
Subtracted 1 for last eol and any trailing empty lines, leaving 10000000 data rows
Type codes (   first 5 rows): 113431
Type codes (+ middle 5 rows): 113431
Type codes (+   last 5 rows): 113431
Type codes: 113431 (after applying colClasses and integer64)
Type codes: 113431 (after applying drop or select (if supplied)
Allocating 6 column slots (6 - 0 dropped)
Read 10000000 rows and 6 (of 6) columns from 0.486 GB file in 00:00:44
  13.420s ( 31%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   3.210s (  7%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   1.310s (  3%) Allocation of 10000000x6 result (xMB) in RAM
  25.580s ( 59%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.040s (  0%) Changing na.strings to NA
  43.560s        Total

注意：这些时间在我没有 SSD 的非常慢的上网本上。每个步骤的绝对时间和相对时间都会因机器而异。例如，如果您第二次重新运行 fread，您可能会注意到映射到 mmap 的时间要少得多，因为您的操作系统已经从上次运行中缓存了它。

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            20
Model:                 2
Stepping:              0
CPU MHz:               800.000         # i.e. my slow netbook
BogoMIPS:              1995.01
Virtualisation:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
NUMA node0 CPU(s):     0,1

【讨论】：

另外read.csv() 将所有内容读入一个大字符矩阵，然后对其进行修改，fread() 是否也做同样的事情？在fastread 中，我们猜测列类型，然后强制执行以避免完整的df 副本。
@hadley 不。你为什么建议它可能？ fastread 是什么？
@hadley 好的，我看到了你的回购。你为什么这样做？
似乎是一种明显的方式（对我们来说！）来减少内存使用 - 即不是创建完整的字符向量然后强制转换为数字向量，而是强制执行。
对我们来说很明显！=对每个人都很明显！=正确。我没有对fread() 提出任何建议。