在 R 中读取 Stata 数据时出错答案

【问题标题】：Error reading Stata data in R在 R 中读取 Stata 数据时出错
【发布时间】：2013-08-27 17:12:54
【问题描述】：

我正在尝试使用 foreign 包读取 R 中的 Stata 数据集，但是当我尝试使用以下方式读取文件时：

library(foreign)
data <- read.dta("data.dta")

我收到以下错误：

Error in read.dta("data.dta") : a binary read error occurred

该文件在 Stata 中运行良好。 site 建议将文件保存在不带标签的 Stata 中，然后将其读入 R。通过这种解决方法，我可以将文件加载到 R 中，但随后我丢失了标签。为什么会出现此错误，如何使用标签将文件读入 R？另一个person 发现当他们有没有值的变量时会出现这个错误。我的数据确实至少有一个或两个这样的变量，但我没有简单的方法在 stata.xml 中识别这些变量。这是一个包含数千个变量的非常大的文件。

【问题讨论】：

即使您有大量变量，也有几种方法可以测试 Stata 中的缺失。见here。
用于制作文件的 Stata 版本可能是问题所在。仔细阅读 read.dta 的帮助页面，然后做任何需要的工作来构建所需的版本。

标签： r stata

【解决方案1】：

您应该在读取 Stata 数据之前致电library(foreign)。

library(foreign)
data <- read.dta("data.dta")

更新：正如提到的here，

"错误消息暗示文件已找到，并且文件已启动具有正确的字节序列是Stata .dta 文件，但是某些东西（可能是文件的结尾）阻止 R 读取它的内容期待阅读。 "

但是，我们可能只是猜测，没有任何进一步的信息。

更新OP的问答：

我尝试过使用Stata的自动数据是否是这种情况，但不是。所以，应该还有其他原因：

*声明1和2：如果变量有缺失或者有标签的数据集，Rread.dta会产生错误*

sysuse auto #this dataset has labels
replace mpg=. #generates missing for mpg variable
br in 1/10
make    price   mpg rep78   headroom    trunk   weight  length  turn    displacement    gear_ratio  foreign
AMC Concord 4099        3   2.5 11  2930    186 40  121 3.58    Domestic
AMC Pacer   4749        3   3.0 11  3350    173 40  258 2.53    Domestic
AMC Spirit  3799            3.0 12  2640    168 35  121 3.08    Domestic
Buick Century   4816        3   4.5 16  3250    196 40  196 2.93    Domestic
Buick Electra   7827        4   4.0 20  4080    222 43  350 2.41    Domestic
Buick LeSabre   5788        3   4.0 21  3670    218 43  231 2.73    Domestic
Buick Opel  4453            3.0 10  2230    170 34  304 2.87    Domestic
Buick Regal 5189        3   2.0 16  3280    200 42  196 2.93    Domestic
Buick Riviera   10372       3   3.5 17  3880    207 43  231 2.93    Domestic
Buick Skylark   4082        3   3.5 13  3400    200 42  231 3.08    Domestic

save "~myauto"
de(myauto)

Contains data from ~\myauto.dta
  obs:            74                          1978 Automobile Data
 vars:            12                          25 Aug 2013 11:32
 size:         3,478 (99.9% of memory free)   (_dta has notes)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
rep78           int    %8.0g                  Repair Record 1978
headroom        float  %6.1f                  Headroom (in.)
trunk           int    %8.0g                  Trunk space (cu. ft.)
weight          int    %8.0gc                 Weight (lbs.)
length          int    %8.0g                  Length (in.)
turn            int    %8.0g                  Turn Circle (ft.)
displacement    int    %8.0g                  Displacement (cu. in.)
gear_ratio      float  %6.2f                  Gear Ratio
foreign         byte   %8.0g       origin     Car type
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:  foreign


library(foreign)
myauto<-read.dta("myauto.dta")  #works perfect
    str(myauto)
'data.frame':   74 obs. of  12 variables:
 $ make        : chr  "AMC Concord" "AMC Pacer" "AMC Spirit" "Buick Century" ...
 $ price       : int  4099 4749 3799 4816 7827 5788 4453 5189 10372 4082 ...
 $ mpg         : int  NA NA NA NA NA NA NA NA NA NA ...
 $ rep78       : int  3 3 NA 3 4 3 NA 3 3 3 ...
 $ headroom    : num  2.5 3 3 4.5 4 4 3 2 3.5 3.5 ...
 $ trunk       : int  11 11 12 16 20 21 10 16 17 13 ...
 $ weight      : int  2930 3350 2640 3250 4080 3670 2230 3280 3880 3400 ...
 $ length      : int  186 173 168 196 222 218 170 200 207 200 ...
 $ turn        : int  40 40 35 40 43 43 34 42 43 42 ...
 $ displacement: int  121 258 121 196 350 231 304 196 231 231 ...
 $ gear_ratio  : num  3.58 2.53 3.08 2.93 2.41 ...
 $ foreign     : Factor w/ 2 levels "Domestic","Foreign": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "datalabel")= chr "1978 Automobile Data"
 - attr(*, "time.stamp")= chr "25 Aug 2013 11:23"
 - attr(*, "formats")= chr  "%-18s" "%8.0gc" "%8.0g" "%8.0g" ...
 - attr(*, "types")= int  18 252 252 252 254 252 252 252 252 252 ...
 - attr(*, "val.labels")= chr  "" "" "" "" ...
 - attr(*, "var.labels")= chr  "Make and Model" "Price" "Mileage (mpg)" "Repair Record 1978" ...
 - attr(*, "expansion.fields")=List of 2
  ..$ : chr  "_dta" "note1" "from Consumer Reports with permission"
  ..$ : chr  "_dta" "note0" "1"
 - attr(*, "version")= int 12
 - attr(*, "label.table")=List of 1
  ..$ origin: Named int  0 1
  .. ..- attr(*, "names")= chr  "Domestic" "Foreign"

【讨论】：

我愿意，我只是把它排除在外。
在上面添加了更多信息，文件没有损坏，因为我可以将它读入 stata。不幸的是，文件太大，我无法在此处发布。
您至少可以在 Stata 中使用describe 命令来提供总体信息或发布示例数据[一些行和一些列]
如果您有兴趣进一步研究，请在 R 中创建文件，在 stata 中加载并重新保存，然后尝试将其重新加载回 R。
问题是为什么我需要在R 创建文件，而您所说的问题是在R 中读取Stata 文件。

【解决方案2】：

我不知道为什么会发生这种情况，如果有人能解释一下，我会很感兴趣，但read.dta 确实无法处理全部为 NA 的变量。一个解决方案是在 Stata 中使用以下code 删除此类变量：

foreach varname of varlist * {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"
 }
}

【讨论】：

您引用的博客条目的 cmets (1) 提供的代码比这更短，并且 (2) 提到 dropmiss 作为用户编写的解决方案。 findit dropmiss，从提到的最新站点安装，然后dropmiss 就足够了。
查看我的更新。至少对于 Stata 中的小型数据集而言，您在此答案中的主张并不正确

【解决方案3】：

这是一个求解器列表。我的猜测是第一项有 75% 的可能性解决您的问题。

在 Stata 中，使用 saveold 重新保存 dta 文件的新副本，然后重试。
如果失败，请提供一个示例来说明哪些值会导致 read.dta 函数失效。
如果应归咎于缺失值，请从另一个答案运行循环。

需要对数据集进行更全面的描述才能超过这一点。这个问题似乎可以解决，我在使用 foreign 处理大量 Stata 文件时从来没有遇到过麻烦。

您也可以尝试memisc 包中的Stata.file 函数，看看是否也失败了。

【讨论】：

【解决方案4】：

已经很长时间了，但我解决了将 .dta 数据导出到 .csv 的同样问题。问题与因子变量的标签有关，特别是因为标签是西班牙语，而 ASCII 编码是一团糟。我希望这对有同样问题和使用 Stata 软件的人有用。

状态：

export delimited using "/Users/data.csv", nolabel replace

在 R 中：

df <- read.csv("lapop2014.csv")

【讨论】：