【问题标题】:Edit and Filter JSON List of Lists in R在 R 中编辑和过滤列表的 JSON 列表
【发布时间】:2018-04-02 12:17:37
【问题描述】:

我正在尝试显示这个数据集 -> https://mtgjson.com/json/AllSets.json.zip

但是,我想展平数据,使其不会嵌套为列表中、列表中、列表中的一堆 JSON 数据。

更具体地说,我试图将数据显示为数据框,按$releaseDate(变量之一)的顺序排列。

这是我目前的尝试:

library(jsonlite)
library(tidyjson)
mtgdata <- fromJSON("~/path/to/file.json")

mtgdata 的结果显示了这个列表列表:

summary(mtgdata)
        Length Class  Mode
UST       9     -none- list
UNH      10     -none- list
UGL      11     -none- list
pWOS      8     -none- list
pWOR      8     -none- list
pWCQ      8     -none- list
pSUS      8     -none- list
pSUM     10     -none- list
pREL      8     -none- list
pPRO      8     -none- list
pPRE      8     -none- list
pPOD      7     -none- list
pMPR      8     -none- list
pMGD      8     -none- list
pMEI      8     -none- list
pLPA      8     -none- list
pLGM      8     -none- list
pJGP     10     -none- list
pHHO     11     -none- list
pWPN      8     -none- list
pGTW      8     -none- list
pGRU     10     -none- list
pGPX      8     -none- list
pFNM     10     -none- list
pELP      8     -none- list
pDRC      7     -none- list
pCMP      8     -none- list
pCEL      8     -none- list
pARL      8     -none- list
pALP     10     -none- list
p2HG      8     -none- list
p15A      8     -none- list
PD3       9     -none- list
PD2       9     -none- list
H09       9     -none- list
PTK      12     -none- list
POR      12     -none- list
PO2      13     -none- list
PCA       7     -none- list
PC2      10     -none- list
HOP      10     -none- list
VMA       9     -none- list
MMA      10     -none- list
MM3       8     -none- list
MM2      11     -none- list
MED       9     -none- list
ME4       9     -none- list
ME3       9     -none- list
ME2       9     -none- list
IMA       8     -none- list
EMA       9     -none- list
A25       8     -none- list
MPS_AKH   8     -none- list
MPS       9     -none- list
EXP       9     -none- list
E02       7     -none- list
V17       8     -none- list
V16       7     -none- list
V15       9     -none- list
V14       9     -none- list
V13       9     -none- list
V12      10     -none- list
V11      10     -none- list
V10       9     -none- list
V09      10     -none- list
DRB       9     -none- list
EVG       9     -none- list
DDT       7     -none- list
DDS       7     -none- list
DDR       7     -none- list
DDQ       8     -none- list
DDP      10     -none- list
DDO      10     -none- list
DDN      10     -none- list
DDM      10     -none- list
DDL      10     -none- list
DDK      10     -none- list
DDJ      10     -none- list
DDI      10     -none- list
DDH      10     -none- list
DDG      10     -none- list
DDF      10     -none- list
DDE      10     -none- list
DDD       9     -none- list
DDC       9     -none- list
DD3_JVC   9     -none- list
DD3_GVL   9     -none- list
DD3_EVG   9     -none- list
DD3_DVD   9     -none- list
DD2      11     -none- list
CNS      11     -none- list
CN2       9     -none- list
CMD      11     -none- list
CMA       7     -none- list
CM1      10     -none- list
C17       6     -none- list
C16       8     -none- list
C15      10     -none- list
C14      10     -none- list
C13      10     -none- list
CEI       9     -none- list
CED       9     -none- list
E01       7     -none- list
ARC       9     -none- list
ZEN      12     -none- list
XLN      12     -none- list
WWK      12     -none- list
WTH      13     -none- list
W17       8     -none- list
W16       8     -none- list
VIS      13     -none- list
VAN       8     -none- list
USG      13     -none- list
ULG      13     -none- list
UDS      13     -none- list
TSP      12     -none- list
TSB      12     -none- list
TPR      11     -none- list
TOR      12     -none- list
TMP      13     -none- list
THS      12     -none- list
STH      13     -none- list
SOM      12     -none- list
SOK      12     -none- list
SOI      10     -none- list
SHM      12     -none- list
SCG      12     -none- list
S99      11     -none- list
S00      11     -none- list
RTR      12     -none- list
RQS       6     -none- list
ROE      12     -none- list
RIX      12     -none- list
RAV      12     -none- list
PLS      13     -none- list
PLC      12     -none- list
PCY      13     -none- list
ORI      11     -none- list
ONS      12     -none- list
OGW      10     -none- list
ODY      13     -none- list
NPH      12     -none- list
NMS      14     -none- list
MRD      12     -none- list
MOR      12     -none- list
MMQ      13     -none- list
MIR      13     -none- list
MGB      10     -none- list
MD1       9     -none- list
MBS      12     -none- list
M15      11     -none- list
M14      11     -none- list
M13      11     -none- list
M12      11     -none- list
M11      11     -none- list
M10      11     -none- list
LRW      12     -none- list
LGN      12     -none- list
LEG      12     -none- list
LEB      11     -none- list
LEA      11     -none- list
KTK      12     -none- list
KLD       9     -none- list
JUD      12     -none- list
JOU      12     -none- list
ITP      11     -none- list
ISD      12     -none- list
INV      13     -none- list
ICE      13     -none- list
HOU       9     -none- list
HML      12     -none- list
GTC      12     -none- list
GPT      12     -none- list
FUT      12     -none- list
FRF_UGIN 10     -none- list
FRF      12     -none- list
FEM      11     -none- list
EXO      13     -none- list
EVE      12     -none- list
EMN       9     -none- list
DTK      12     -none- list
DST      12     -none- list
DRK      12     -none- list
DPA       9     -none- list
DKM       9     -none- list
DKA      12     -none- list
DIS      12     -none- list
DGM      12     -none- list
CST      11     -none- list
CSP      12     -none- list
CP3       7     -none- list
CP2       7     -none- list
CP1       7     -none- list
CON      13     -none- list
CHR      11     -none- list
CHK      12     -none- list
BTD      10     -none- list
BRB      10     -none- list
BOK      12     -none- list
BNG      12     -none- list
BFZ      12     -none- list
AVR      12     -none- list
ATQ      11     -none- list
ATH       9     -none- list
ARN      11     -none- list
ARB      12     -none- list
APC      13     -none- list
ALL      13     -none- list
ALA      12     -none- list
AKH       9     -none- list
AER       9     -none- list
9ED      12     -none- list
8ED      12     -none- list
7ED      12     -none- list
6ED      12     -none- list
5ED      12     -none- list
5DN      12     -none- list
4ED      12     -none- list
3ED      12     -none- list
2ED      11     -none- list
10E      11     -none- list

在每个列表中都有我感兴趣的变量,我有兴趣分析这些变量来过滤和排序这些数据,就好像它是一个扁平的数据框一样。

当我们检查其中一个列表中的变量列表时(以“mtgdata$UST”为例),我们会得到这组变量:

names(mtgdata$UST)
[1] "name"        "code"        "releaseDate" "border"      "type"        
"booster"     "mkm_name"   
[8] "mkm_id"      "cards"

在 mtgdata ("mtgdata$SOI") 中的另一个列表上运行相同的查询,我们会得到另一组变量,尽管它们大部分是相同的。

正如我上面提到的,我主要感兴趣的是扁平化这个数据集并按 mtgdata$releaseDate 排名 - 但就目前而言,“$releaseDate”当前嵌套在第一组列表中(“$UST”等。 )

您对此问题的帮助或我如何更好地改写这个问题将不胜感激。

【问题讨论】:

标签: r json dplyr tidyr jsonlite


【解决方案1】:

您可以尝试在命令行上使用 this 之类的方法将 JSON 对象数组转换为文件 ndjson 记录,然后使用 ndjson::stream_in("filename_of the_thing_you_just_converted") 之类的方法,但最终会得到 14,000+ 列,非常没用, “平面”数据框。

相反,做一些探索:

library(tidyverse)

as1 <- jsonlite::read_json("~/Downloads/AllSets.json")

str(as1, 1) 
## List of 221
##  $ UST     :List of 9
##  $ UNH     :List of 10
##  $ UGL     :List of 11
##  $ pWOS    :List of 8
##  $ pWOR    :List of 8
##  $ pWCQ    :List of 8
##  $ pSUS    :List of 8
##  $ pSUM    :List of 10
##  $ pREL    :List of 8
##  $ pPRO    :List of 8
##  $ pPRE    :List of 8
##  $ pPOD    :List of 7
##  $ pMPR    :List of 8
##  $ pMGD    :List of 8
##  $ pMEI    :List of 8
##  $ pLPA    :List of 8
##  $ pLGM    :List of 8
##  $ pJGP    :List of 10
##  $ pHHO    :List of 11
## ...

呃……“那些”JSON 文件之一,即使整个文件(理论上)应该是一致的,也不适合填充每条记录的所有元素。

让我们看看哪些 JSON 数组元素填充了最多的字段,因为这意味着它们可能都填充了:

map_dbl(as1, length) %>% 
  broom::tidy() %>% 
  arrange(desc(x))
## # A tibble: 221 x 2
##    names     x
##    <chr> <dbl>
##  1 NMS    14.0
##  2 PO2    13.0
##  3 WTH    13.0
##  4 VIS    13.0
##  5 USG    13.0
##  6 ULG    13.0
##  7 UDS    13.0
##  8 TMP    13.0
##  9 STH    13.0
## 10 PLS    13.0
## # ... with 211 more rows

我们来看看NMS

str(as1[["NMS"]], 1)
## List of 14
##  $ name              : chr "Nemesis"
##  $ code              : chr "NMS"
##  $ gathererCode      : chr "NE"
##  $ magicCardsInfoCode: chr "ne"
##  $ oldCode           : chr "NEM"
##  $ releaseDate       : chr "2000-02-14"
##  $ border            : chr "black"
##  $ type              : chr "expansion"
##  $ block             : chr "Masques"
##  $ booster           :List of 15
##  $ translations      :List of 5
##  $ mkm_name          : chr "Nemesis"
##  $ mkm_id            : int 32
##  $ cards             :List of 143

真的不想扁平化 boostertranslationscards,应该将它们保留为 list 列和 unnest 必要时。

但是,由于每条记录都有不同的字段,我们不能简单地“data.table::rbindlist()ordplyr::bind_rows()` 因为它会抱怨其中的一些列。

我们必须逐个记录并将每个记录转换为一个数据框,处理丢失的字段并将list 包装在list() 中。我们将使用辅助函数来简化函数式惯用语来测试缺失值:

`%l0%` <- function(x, y) if (length(x) > 0) x else y

^^ 比 %||% 更强大一点,与 purrr 一起出现。

最后:

map_df(as1, ~{
  data_frame(
    name = .x$name %l0% NA_character_,
    code = .x$code,
    gathererCode = .x$gathererCode %l0% NA_character_,
    magicCardsInfoCode = .x$magicCardsInfoCode %l0% NA_character_,
    oldCode = .x$oldCode %l0% NA_character_,
    releaseDate = .x$releaseDate %l0% NA_character_,
    border = .x$border,
    type = .x$type,
    block = .x$block %l0% NA_character_,
    booster = list(.x$booster),
    translations = list(.x$translations),
    mkm_name = .x$mkm_name %l0% NA_character_,
    mkm_id = .x$mkm_id %l0% NA_character_,
    cards = list(.x$cards)
  )
}) -> all_sets

而且,你可以看到结果:

all_sets
## # A tibble: 221 x 14
##    name           code  gathererCode magicCardsInfoC… oldCode releaseDate border type  block booster 
##    <chr>          <chr> <chr>        <chr>            <chr>   <chr>       <chr>  <chr> <chr> <list>  
##  1 Unstable       UST   NA           NA               NA      2017-12-08  silver un    NA    <list […
##  2 Unhinged       UNH   NA           uh               NA      2004-11-20  silver un    NA    <list […
##  3 Unglued        UGL   UG           ug               NA      1998-08-11  silver un    NA    <list […
##  4 Wizards of th… pWOS  NA           wotc             NA      1999-09-04  black  promo NA    <NULL>  
##  5 Worlds         pWOR  NA           wrl              NA      1999-08-04  black  promo NA    <NULL>  
##  6 World Magic C… pWCQ  NA           wmcq             NA      2013-04-06  black  promo NA    <NULL>  
##  7 Super Series   pSUS  NA           sus              NA      1999-12-01  black  promo NA    <NULL>  
##  8 Summer of Mag… pSUM  NA           sum              NA      2007-07-21  black  promo NA    <NULL>  
##  9 Release Events pREL  NA           rep              NA      2003-07-26  black  promo NA    <NULL>  
## 10 Pro Tour       pPRO  NA           pro              NA      2007-02-09  black  promo NA    <NULL>  
## # ... with 211 more rows, and 4 more variables: translations <list>, mkm_name <chr>, mkm_id <int>,
## #   cards <list>

glimpse(all_sets)
## Observations: 221
## Variables: 14
## $ name               <chr> "Unstable", "Unhinged", "Unglued", "Wizards of the Coast Online Store"...
## $ code               <chr> "UST", "UNH", "UGL", "pWOS", "pWOR", "pWCQ", "pSUS", "pSUM", "pREL", "...
## $ gathererCode       <chr> NA, NA, "UG", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ magicCardsInfoCode <chr> NA, "uh", "ug", "wotc", "wrl", "wmcq", "sus", "sum", "rep", "pro", "pt...
## $ oldCode            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ releaseDate        <chr> "2017-12-08", "2004-11-20", "1998-08-11", "1999-09-04", "1999-08-04", ...
## $ border             <chr> "silver", "silver", "silver", "black", "black", "black", "black", "bla...
## $ type               <chr> "un", "un", "un", "promo", "promo", "promo", "promo", "promo", "promo"...
## $ block              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ booster            <list> [["rare", "uncommon", "uncommon", "uncommon", "common", "common", "co...
## $ translations       <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NU...
## $ mkm_name           <chr> "Unstable", "Unhinged", "Unglued", NA, NA, NA, NA, "Summer Magic", NA,...
## $ mkm_id             <int> 1821, 59, 22, NA, NA, NA, NA, 76, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cards              <list> [[["Andrea Radeck", 1, ["W"], ["White"], "95ebdf85f4ea74d584dfdfb72e3...

而且,在将列转换为适当的Date 对象后,我们可以通过releaseDate 排列它们:

mutate(all_sets, releaseDate = lubridate::ymd(releaseDate)) %>% 
  arrange(desc(releaseDate))
## # A tibble: 221 x 14
##    name        code  gathererCode magicCardsInfoCo… oldCode releaseDate border type     block booster
##    <chr>       <chr> <chr>        <chr>             <chr>   <date>      <chr>  <chr>    <chr> <list> 
##  1 Masters 25  A25   NA           a25               NA      2018-03-16  black  reprint  NA    <NULL> 
##  2 Rivals of … RIX   NA           rix               NA      2018-01-19  black  expansi… Ixal… <list …
##  3 Unstable    UST   NA           NA                NA      2017-12-08  silver un       NA    <list …
##  4 Explorers … E02   NA           e02               NA      2017-11-24  black  board g… NA    <NULL> 
##  5 From the V… V17   NA           v17               NA      2017-11-24  black  from th… NA    <NULL> 
##  6 Iconic Mas… IMA   NA           ima               NA      2017-11-17  black  reprint  NA    <list …
##  7 Duel Decks… DDT   NA           ddt               NA      2017-11-10  black  duel de… NA    <NULL> 
##  8 Ixalan      XLN   NA           xln               NA      2017-09-29  black  expansi… Ixal… <list …
##  9 Commander … C17   NA           NA                NA      2017-08-25  black  command… NA    <NULL> 
## 10 Hour of De… HOU   NA           hou               NA      2017-07-14  black  expansi… Amon… <list …
## # ... with 211 more rows, and 4 more variables: translations <list>, mkm_name <chr>, mkm_id <int>,
## #   cards <list>

【讨论】:

  • 非常感谢您的回复 - 这真是让您大开眼界!不过,我想检查的一件事是,我的主要考虑是我希望能够按 $releaseDate 对这些列表中的每一个进行排名。在您的代码中,我可以看到一个部分; releaseDate = x.$name, 结果是发布日期被省略并替换为集合名称 - 我想这是一个简单的解决方法,但我能否按发布日期从上到下对所有集合进行排名?再次感谢您对此的帮助!
  • 我的错。我复制粘贴的代码位太快了。让我修复它。
  • 如果这有效,勾选已回答的复选框可以帮助其他人知道有一个经过验证的有效答案。
  • 再次感谢您,非常感谢您的帮助。
猜你喜欢
  • 1970-01-01
  • 2015-10-19
  • 2019-01-09
  • 1970-01-01
  • 2016-06-14
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2011-08-05
相关资源
最近更新 更多