如何使用 edgarWebR 获取多个公司的信息答案

【问题标题】：How to get more than one company's information using edgarWebR如何使用 edgarWebR 获取多个公司的信息
【发布时间】：2021-02-25 17:12:48
【问题描述】：

我正在尝试使用edgarWebR 包从 EDGAR 获取公司及其备案信息。特别是，我想使用包中的两个函数 - filing_information 和 company_filings。

我实际上在不同的数据集中有数千个cik，但上面的两个函数都无法处理cik 的向量。这是一个例子-

library(edagrWebR)
comp_file <- company_filings(c("1000045"), before = "20201231",
                            type = "10-K",  count = 100,
                            page = 1)

head(comp_file)
  accession_number act file_number filing_date accepted_date
1             <NA>  34   000-26680  2020-06-22    2020-06-22
2             <NA>  34   000-26680  2019-06-28    2019-06-28
3             <NA>  34   000-26680  2018-06-27    2018-06-27
4             <NA>  34   000-26680  2017-06-14    2017-06-14
5             <NA>  34   000-26680  2016-06-14    2016-06-14
6             <NA>  34   000-26680  2015-06-15    2015-06-15
                                                                                               href
1 https://www.sec.gov/Archives/edgar/data/1000045/000156459020030033/0001564590-20-030033-index.htm
2 https://www.sec.gov/Archives/edgar/data/1000045/000156459019023956/0001564590-19-023956-index.htm
3 https://www.sec.gov/Archives/edgar/data/1000045/000119312518205637/0001193125-18-205637-index.htm
4 https://www.sec.gov/Archives/edgar/data/1000045/000119312517203193/0001193125-17-203193-index.htm
5 https://www.sec.gov/Archives/edgar/data/1000045/000119312516620952/0001193125-16-620952-index.htm
6 https://www.sec.gov/Archives/edgar/data/1000045/000119312515223218/0001193125-15-223218-index.htm
  type film_number
1 10-K    20977409
2 10-K    19927449
3 10-K    18921743
4 10-K    17910577
5 10-K   161712394
6 10-K    15931101
                                               form_name
1 Annual report [Section 13 and 15(d), not S-K Item 405]
2 Annual report [Section 13 and 15(d), not S-K Item 405]
3 Annual report [Section 13 and 15(d), not S-K Item 405]
4 Annual report [Section 13 and 15(d), not S-K Item 405]
5 Annual report [Section 13 and 15(d), not S-K Item 405]
6 Annual report [Section 13 and 15(d), not S-K Item 405]
  description  size
1        <NA> 14 MB
2        <NA> 10 MB
3        <NA>  5 MB
4        <NA>  5 MB
5        <NA>  5 MB
6        <NA>  7 MB

我需要在filing_information函数中使用href变量。

其实我也试过这样用-

file_info <- filing_information(comp_file$href)

但它不起作用。我收到了这条消息 -


Error in parse_url(url) : length(url) == 1 is not TRUE

我实际上可以通过如下方式放置每个 href 变量值来做到这一点

x <- "https://www.sec.gov/Archives/edgar/data/1000045/000156459020030033/0001564590-20-030033-index.htm"

file_info <- filing_information(x)

company_filings 函数也是如此，我只使用一个 cik - “1000045”，但在另一个文件中，我有数千个 cik，所有这些我都想运行 company_filings 函数.手动是不可能的，因为我有成千上万的cik。

任何人都知道如何在 LARGE 向量上自动执行这两个函数。

谢谢

【问题讨论】：

res <- lapply(setNames(nm=comp_file$href), filing_information) 将为您提供list 的返回值。如果它的返回是 data.frame，那么您可以考虑将结果与以下之一组合：do.call(rbind.data.frame, res)、dplyr::bind_rows(res, .id="href") 或 data.table::rbindlist(res, idcol="href")。
@r2evans 效果很好。我该怎么做company_filing。我为company_filing - res2 <- lapply(setNames(nm=df2$cik), company_filings) 尝试了这个，它可以工作，但是如何添加company_filing 函数的其他参数，如before = "20201231",type = "10-K", count = 100, page = 1

标签： r edgar

【解决方案1】：

一般来说，当一个函数（无论是 API 到达的还是本地的）只接受一个元素作为参数时，通常最简单的“矢量化”方法是使用 lapply 的形式：

companies <- c("1000045", "1000046", "1000047")
comp_file_list <- lapply(
  setNames(nm=companies),
  function(comp) company_filings(comp, before = "20201231",
                                 type = "10-K",  count = 100,
                                 page = 1)
)

从技术上讲，setNames(nm=.) 部分是一种保护措施，让我们知道每个元素使用了哪个公司 ID。如果它包含在返回数据中，则可以将其删除。

假设返回值始终是data.frame，那么您可以将它们保留在列表中（并将它们作为帧列表处理，c.f.，https://stackoverflow.com/a/24376207/3358227），或者您可以将它们组合成一个- 使用以下之一的更高框架：

# base R
comp_files <- Map(function(x, nm) transform(x, id = nm), comp_files, names(comp_files))
comp_files <- do.call(rbind, comp_files_list)

# dplyr/tidyverse
comp_files <- dplyr::bind_rows(comp_files_list, .id = "id")

# data.table
comp_files <- data.table::rbindlist(comp_files, idcol = "id")

仅供参考，lapply 的第二个参数是一个函数，其中第一个参数填充了来自X 的每个参数（lapply 的第一个参数）。有时这个函数可以只是函数本身，如

res <- lapply(companies, company_filings)

这相当于

res <- lapply(companies, function(z) company_filings(z))

如果您有一组必须应用于所有调用的参数，则可以选择以下等效表达式之一：

res <- lapply(companies, company_filings, before = "20201231", type = "10-K",  count = 100, page = 1)
res <- lapply(companies, function(z) company_filings(z, before = "20201231", type = "10-K",  count = 100, page = 1))

但是，如果这些论点中的一个（或多个）因每个公司而异，则您需要不同的形式。假设我们对每个公司都有不同的before= 参数，

befores <- c("20201231", "20201130", "20201031")
res <- Map(function(comp, bef) company_filing(comp, before=bef, type="10-K"),
           companies, befores)

如果您有查询失败的 ids/refs 时的基本错误处理：

res <- lapply(comp, function(cmp) {
  tryCatch(
    company_filing(cmp, before=".."),
    error = function(e) e
  )
})
errors <- sapply(res, inherits, "error")
failures <- res[errors]
successes <- res[!errors]
good_returns <- do.call(rbind, success)

names(failures)
# indicates which company ids failed, and the text of the error may
# indicate why they failed

tryCatch(..., error=) 参数的一些选项：

error=identity 返回原始错误，有时提供足够的信息
error=function(e) e同样的事情
error=function(e) conditionMessage(e) 是 character 返回，错误的消息部分
error=function(e) NULL 忽略错误，改为返回 NULL（或某个常量）

还可以有条件地对待e，包括if (grepl("not found", e)) {...} else NULL等模式。

【讨论】：

非常感谢。
当我尝试使用我的实际数据运行代码时，出现以下错误 - Error in strsplit(URL, "") : non-character argument。我实际上运行以下代码 - comp_file_list <- lapply( setNames(nm=companies), function(comp) company_filings(comp, before = "20201231", type = "10-K", count = 1000) ) 其中companies 包括我所有的cik。
如果您的companies 不是character，就会发生这种情况。如果你的companies 是一个向量，那么试试lapply(setNames(nm=as.character(companies)),...)。如果companies 不是向量，那么......这是不对的:-)
实际上，我在您最后一条评论之后更改了代码，但它显示了这个 - Error in curl::curl_fetch_memory(url, handle = handle) : necessary data rewind wasn't possible，我实际上运行了这个 - comp_file_list <- lapply( setNames(nm=as.character(companies)), function(comp) company_filings(comp, before = "20201231", type = "10-K", count = 1000) )。我的向量是这样的 - head (companies) # A tibble: 6 x 1 cik2 <chr> 1 730052 2 1750 3 313368 4 910627 5 702511 6 61478
卷曲错误可能与这个问题无关。快速搜索表明它是（a）服务器打嗝，（b）文件名中的空格，或（c）其他格式错误的 URL。或者是其他东西。对不起，我不是卷曲大师。我建议您将company_filings 与try(...,silent=TRUE) 包装起来，以帮助确定哪个URL 导致了错误。如果它是相同的 URL，那么您将知道在哪里进行更深入的挖掘。如果使用相同的 companies 在不同运行中得到不同的结果，则表明存在网络或库问题（不太可能是 R）。