转换已解析字符串的编码答案

【问题标题】：Converting encoding of deparsed strings转换已解析字符串的编码
【发布时间】：2018-07-14 02:41:47
【问题描述】：

我有以下向量：

x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")

我想将其转换为 Unicode 字符替换为 UTF-8 的向量，如下所示：

goal <- list("Chamberlain", "Romañach", "<node>")

解析后的字符串引起了问题。如果第二个字符串是：

wouldbenice <- "Roma\u00F1ach"

那么enc2native(wouldbenice) 会做正确的事。（或lapply(x, enc2native) 表示整个字符串。

我可以让第二个字符串在 UTF-8 中正确显示：

# displays "Romañach"
eval(parse(text = x[[2]]))

但是，x[1] 和 x[2] 的效果很差（引发解析错误）。如何可靠地将整个列表解析为适当的编码？

【问题讨论】：

相关：Unescape unicode in character string

标签： r string text utf-8

【解决方案1】：

使用stringi 包。

从stringi 使用stri_replace_all_regex 进行替换，使用stri_unescape_unicode 取消转义Unicode 符号。

library(stringi)

x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")

removed_quotes <- stri_replace_all_regex(x, "\"", "")

unescaped <- stri_unescape_unicode(removed_quotes)

# [1] "Chamberlain" "Romañach"    "<node>"

【讨论】：

如果需要，可以使用stri_replace_all_regex(str = x, pattern = "\"", replacement = "")。 stri_replace_all_regex 是 stringi 包中的一个函数。

【解决方案2】：

这满足了基础 R 中的目标，但在其他方面似乎不太理想。把它放在这里以便读者可以比较，尽管我认为基于stringi 的解决方案可能是要走的路。

utf8me <- function(x){ 
  i <- grepl('\\u', x) # Not a robust way to detect a unicode char?
  x[i] <- eval(parse(text=x[i])) # 
  x
  }

lapply(x, utf8me)

【讨论】：