在 R 中读取二进制数据，而不是在 python 中解包答案

【问题标题】：Read binary data in R instead of unpack in python在 R 中读取二进制数据，而不是在 python 中解包
【发布时间】：2012-09-04 18:47:02
【问题描述】：

我学了 8 个月的 python，R 的新手，有一个二进制文件，我可以阅读
并将二进制数据更改为列表（在 python 中，数组为列表）。
数据文件（名为 test ）位于：
https://www.box.com/s/0g3qg2lqgmr7y7fk5aut
结构是：
每4个字节是一个整数，所以用python中的unpack读取它

import struct
datafile=open('test','rb')
data=datafile.read(32)
result=[]
while  data:
    result.append(list(struct.unpack('iiiiiiii',data)))
    data=datafile.read(32)

如何读取 R 中的二进制数据？

Paul Hiemstra 帮助我完成了 R 中的代码。

datafile="test"
totalsize=file.info(datafile)$size
lines=totalsize/32
data=readBin("test",integer(),n=totalsize,size=4,endian="little")
result=data.frame(matrix(data,nrow=lines,ncol=8,byrow=TRUE))
colnames(result)=c(date,"x1","x2","x3","x4","x5","x6","x7")

还有我想解决的问题。在这里，我用n=totalsize完全读取所有数据，如果数据很大，内存不够容纳，如何表示：从第1001个字节到第2000个字节读取数据？如果n=1000，表示从第1个到第1000个读取数据，如果n=2000，表示从第1个到第2000个读取数据，那么从第1001个到第2000个读取数据怎么样？ R中是否有文件指针？当我读取第1000个二进制数据时，文件指针位于第1000个位置，现在使用命令readBin（“test”，integer（），n = 1000，size = 4，endian =“little” ) 读取第 1001 到第 2000 的数据？

【问题讨论】：

尝试润色您的语言，主要是缺少大写字母，以及不正确使用逗号等。

标签： r

【解决方案1】：

搜索R read binary file 会产生the following link 作为其第一个结果。底线是使用 readBin 函数，在你的情况下看起来像：

file2read = file("test", "rb")
number_of_integers_in_file = 128
spam = readBin(file2read, integer(), number_of_integers_in_file, size = 4)
close(file2read)

如果你不知道文件中的整数个数，你可以做一些事情，首先创建一个示例文件：

# Create a binary file that we can read
l = as.integer(1:10)
file2write = file("/tmp/test", "wb")
writeBin(l, file2write)
close(file2write)

一种策略是高估要读取的整数数量 readBin 只会返回真正存在的数字。大小为n 的向量是预先分配的，因此请注意不要让它太大。

file2read = file("/tmp/test", "rb")
l_read = readBin(file2read, integer(), n = 100)
close(file2read)
all.equal(l, l_read)
[1] TRUE

或者，如果您知道尺寸，例如4 个字节的数字，您可以使用我编写的以下函数计算出有多少个：

number_of_numbers = function(path, size = 4) {
  # If path is a file connection, extract file name
  if(inherits(path, "file")) path = summary(path)[["description"]]
  return(file.info(path)[["size"]] / size)
 }
number_of_numbers("/tmp/test")
[1] 10

在行动：

file2read = file("/tmp/test", "rb")
l_read2 = readBin(file2read, integer(), n = number_of_numbers(file2read))
close(file2read)
all.equal(l, l_read2)   
[1] TRUE

如果数据量太大而无法放入内存，我建议分块读取。这可以通过连续调用readBin 来完成，例如：

first_1000 = readBin(con, integer(), n = 1000)
next_1000 = readBin(con, integer(), n = 1000)

如果您想跳过部分数据文件，比如前 1000 个数字，请使用 seek 函数。这比读取 1000 个数字、丢弃这些数字并读取第二个 1000 个数字要快得多。例如：

# Skip the first thousand 4 byte integers
seek(con, where = 4*1000)
next_1000 = readBin(con, integer(), n = 1000)

【讨论】：

readBin("test",integer(),n=1,size=4,endian="little")可以得到正确的python结果，第一个数据（包含4个字节）是20070702，但我还有问题？我怎么知道二进制文件的长度？我可以使用参数n将file pointer',let n`增加直到文件末尾。
关于您的分块阅读问题的更多细节