使用 R rvest 对表格进行 Web 抓取，以“<=”开头的右下角单元格返回为逻辑 NA答案

【问题标题】：Web scraping of tables with R rvest, bottom right cell starting with "<=" is returned as logical NA使用 R rvest 对表格进行 Web 抓取，以“<=”开头的右下角单元格返回为逻辑 NA
【发布时间】：2022-01-17 19:59:41
【问题描述】：

我正在尝试抓取一个包含以“=”，则该值将毫无问题地被删除。我在 RStudio Workbench 上遇到了 rvest 1.02 的问题，但在运行 rvest 1.00 的笔记本电脑版本的 RStudio 上没有问题。

# Minimal example: 
sample <- 
  minimal_html("<table>
               <tbody>
               <tr>
               <th>Col A</th><th>Col B</th>
               </tr>
               <tr>
               <td>>=62.000</td><td><=72.000</td>
               </tr>
               </tbody>
               </table>")
sample %>% 
  rvest::html_elements("table") %>% 
  rvest::html_table()

输出：

[[1]]
# A tibble: 1 × 2
  `Col A`  `Col B`
  <chr>    <lgl>  
1 >=62.000 NA

【问题讨论】：

我想知道修复 html/fails 的尝试是否失败，因为 sample %>% toString() 并显示输出吗？
与此类似：stackoverflow.com/questions/14171035/… 其中
@QHarr 我运行了示例 %>% toString 并收到以下错误消息： as.character.xml_document(list(node = , doc = )) : 外部指针无效
你能做到sample %>% html_node('body') %>% toString()吗？
[1] "\n\n\n\n \n\n\n\n\n
Col A Col B
>=62.000 \n
"

Col A	Col B
>=62.000	\n

标签： r rvest

【解决方案1】：

我有 RStudio 桌面 (R 4.1.1) 和 rvest 1.0.2。我得到了以下结果，没有问题：

[[1]]
# A tibble: 1 × 2
  `Col A`  `Col B` 
  <chr>    <chr>   
1 >=62.000 <=72.000

【讨论】：

谢谢！我的笔记本电脑上的 RStudio 也一样。编辑：在 RStudio Cloud 上也没有问题。我迷路了:-(

【解决方案2】：

我认为您有一个设置，其中“<td>< 被解释为错误的 html 并被清理，而不是通过 html 实体保留“<。

这将是底层解析器的一个问题，可能稍后会修复。

您的设置打印sample %>% html_node('body') %>% toString() 导致

<tr>
  \n
  <td>&gt;=62.000</td>
  \n
  <td>\n</td>
  \n
</tr>

似乎至少符合这个推理。

我去寻找证据并发现以下内容，对于“lxml”html 解析器，lxml truncates text that contains 'less than' character，这似乎与我的假设一致

【讨论】：