【问题标题】:Get child node attribute value获取子节点属性值
【发布时间】:2017-06-23 16:00:04
【问题描述】:

我正在尝试将 retrosheet boxscore 生成的 xml 文件转换为可以插入到 sql 表中的数据框。我大部分时间都在那里,但我不知道如何获取中间 xml 节点的属性。下面是一个示例,希望我正确粘贴了它。我想要获取的是 game_id、id(来自玩家)和完整的击球部分。

<boxscores>
<boxscore game_id="CHA191204110" date="1912/04/11" site="CHI10" 
visitor="SLA" visitor_city="St.Louis" visitor_name="Browns" home="CHA" 
home_city="Chicago" home_name="White Sox" start_time="0:00PM" 
day_night="day" temperature="0" wind_direction="unknown" wind_speed="-1" 
field_condition="unknown" precip="unknown" sky="unknown" time_of_game="110" 
attendance="30000" umpire_hp="evanb901" umpire_1b="eganr101" umpire_2b="" 
umpire_3b="" >
<linescore away_runs="2" away_hits="7" away_errors="1" home_runs="6" 
home_hits="10" home_errors="1">
<inning_line_score away="0" home="0" inning="1"/>
<inning_line_score away="0" home="0" inning="2"/>
<inning_line_score away="0" home="1" inning="3"/>
<inning_line_score away="0" home="0" inning="4"/>
<inning_line_score away="2" home="0" inning="5"/>
<inning_line_score away="0" home="1" inning="6"/>
<inning_line_score away="0" home="1" inning="7"/>
<inning_line_score away="0" home="3" inning="8"/>
<inning_line_score away="0" home="x" inning="9"/>
</linescore>
<players team="SLA" lob="5" dp="0" tp="0" risp_ab="0" risp_h="0">

<player id="shotb101" lname="Shotton" fname="Burt" slot="1" seq="1" pos="8">
  <batting ab="4" r="0" h="0" d="0" t="0" hr="0" bi="0" bi2out="-1" bb="0" ibb="-1" so="3" gdp="-1" hp="0" sh="0" sf="-1" sb="0" cs="-1" />
  <fielding pos="8" outs="24" po="1" a="0" e="0" dp="0" tp="0" bip="-1" bf="-1" />
</player>
<player id="austj101" lname="Austin" fname="Jimmy" slot="2" seq="1" pos="5">
  <batting ab="4" r="0" h="1" d="0" t="0" hr="0" bi="0" bi2out="-1" bb="0" ibb="-1" so="1" gdp="-1" hp="0" sh="0" sf="-1" sb="0" cs="-1" />
  <fielding pos="5" outs="24" po="0" a="3" e="0" dp="0" tp="0" bip="-1" bf="-1" />
  </player>
<player id="stovg101" lname="Stovall" fname="George" slot="3" seq="1" pos="3" >
  <batting ab="4" r="0" h="1" d="0" t="0" hr="0" bi="0" bi2out="-1" bb="0" ibb="-1" so="0" gdp="-1" hp="0" sh="0" sf="-1" sb="0" cs="-1" />
  <fielding pos="3" outs="24" po="11" a="0" e="0" dp="0" tp="0" bip="-1" bf="-1" />
</player>

</players>
</boxscore>
</boxscores>

这是我正在使用的代码

box <- 
read_xml("Q:\\Sabermetrics\\Retrosheet\\download.folder\\unzipped\\1912.xml")

atbat <- xml_find_all(box, "//boxscore")

bind_rows(lapply(atbat, function(x) {

player <- try(xml_find_all(x, "./players/player/batting"), silent=FALSE)

if (inherits(player, "try-error") |
  length(player) == 0) return(NULL)

bind_rows(lapply(player, function(y) {
  data.frame(t(xml_attrs(y)), stringsAsFactors=FALSE)
})) -> player_dat

game_id <- try(xml_attr(x, "game_id"))

if (inherits(game_id, "try-error") |
  length(game_id) == 0) return(NULL)

player_dat$game_id <- game_id

player_dat

})) -> player

我想以这样的方式结束

game_id        player_id     ab    r   h    d  ....
CHA191204110   shotb101      4     0   0    0  ....
CHA191204110   austj101      4     0   1    0  ....
CHA191204110   stovg101      4     0   0    0  ....

我尝试复制 game_id 代码并从玩家那里获取“id”,但它不起作用。我试过使用路径 ./players/player[@id] 和 ./players/player/@id 也不起作用。我试过只使用@id,还是不适用。

我不确定我做错了什么,我只是把东西往墙上扔,看看它是否会粘住......

【问题讨论】:

    标签: r xml-parsing


    【解决方案1】:

    这对你有帮助吗?

    xml <- xmlParse('Q:\\Sabermetrics\\Retrosheet\\download.folder\\unzipped\\1912.xml')
    lxml <- xmlToList(xml)
    df <- cbind(t(lxml$boxscore$.attrs),t(data.frame(unlist(lxml$boxscore$players))))
    

    您可以通过向cbind() 传递更多参数来从您的xml 中提取更多信息。

    我认为您正在遍历多个 xml,因此原则上您可以将这样的内容包装到 sapply() 中,然后通过执行以下操作将所有内容收集到一个 df 中:library(plyr);do.call(rbind.fill, your_df_list)

    【讨论】:

    • 差不多了,有没有办法让每个播放器部分出现在新行上?
    • 啊 - 当然,是的 - 你能粘贴一个有多个玩家的示例 xml 吗?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2010-11-30
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多