使用 rvest 抓取 ID 属性答案

【问题标题】：Scraping ID attribute using rvest使用 rvest 抓取 ID 属性
【发布时间】：2021-02-26 06:56:47
【问题描述】：

我正在尝试检查波兰选举是否公平，并且反对派候选人在无效选票较高的地区没有获得异常低的选票。为此，我需要抓取每个地区的结果。

Link to official results of elections for my city - 在底部的表格中，每一行是不同的区，点击你会被重定向到区。链接不是通常的<a ... hef = ...> 格式，而是在data-id=... 中编码了链接到区的可变部分。

我的问题是如何使用R提取网页上的data-id=属性表？

示例数据 - 在此示例中，我想从行数据中提取 697773

<div class="proto" style="">
    <div id="DataTables_Table_16_wrapper" class="dataTables_wrapper dt-bootstrap no-footer">
        <div class="table-responsive">
            <table class="table table-bordered table-striped table-hover dataTable no-footer clickable" id="DataTables_Table_16" role="grid">
                <thead><tr role="row"><th class="sorting_asc" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-sort="ascending" aria-label="Numer: aktywuj, by posortować kolumnę malejąco">Numer</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Siedziba: aktywuj, by posortować kolumnę rosnąco">Siedziba</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Granice: aktywuj, by posortować kolumnę rosnąco">Granice</th></tr></thead>
                <tbody>
                    <tr data-id="697773" role="row" class="odd"><td class="sorting_1">1</td><td>Szkoła Podstawowa nr 63</td> <td>Bożego Ciała...</td></tr>
                </tbody>
            </table>
        </div>
    </div>
</div>

我尝试过使用：

library(dplyr)
library(rvest)

read_html("https://wybory.gov.pl/prezydent20200628/pl/wyniki/1/pow/26400") %>%
  html_nodes('[class="table-responsive"]') %>%
  html_nodes('[class="table table-bordered table-striped table-hover"]') %>%
  html_nodes('tr') %>%
  html_attrs()

但结果我得到了named character(0)

【问题讨论】：

标签： html r web-scraping rvest

【解决方案1】：

我发现不是非常理想的解决方案。我敢打赌有更好的方法！

我已下载网页，将其保存为 txt 文件并从那里读取：

txt_webpage <-  readChar(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"), 
           file.info(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"))$size)
posiotions <- gregexpr(pattern ='<tr data', txt_webpage)

districts_numbers <- c()
for (i in posiotions[[1]]) {
  print (i)
  tmp <- substr(txt_webpage, i + 10, i + 22)
  tmp <- gsub('\\D+','', tmp)
  districts_numbers <- c(districts_numbers, tmp)
}

【讨论】：