如何使用 R 解码数据帧中的字符串答案

【问题标题】：How to decode strings in a data frame using R如何使用 R 解码数据帧中的字符串
【发布时间】：2022-01-25 21:55:56
【问题描述】：

我得到了一个数据框，其中有一列字符串值，其中一些似乎已编码。请参阅下面的示例。

Joe<U+034F>Â¨l Potin; 
Sai<U+034F>Â¨d Slimani;   
JesÃºs Guerrero

读取文件时如何解析/解码它们？

我使用 R Studio 和函数 read.csv()

【问题讨论】：

请不要将代码或数据作为图片上传给these reasons。
谢谢。以下是一些例子：Joe¨l Potin, Sai¨d Slimani, JesÃºs Guerrero
这看起来像是将 utf-8 解释为 latin-? read.csv() 是否采用编码参数？或者你可以在打开文件时指定编码吗？
部分是mojibake 案例。示例：xx <- 'JesÃºs'; xx; Encoding(xx) <- 'UTF-8'; xx 返回 [1] "JesÃºs" 和 [1] "Jesús"。但是，Joe<U+034F>Â¨l（或Sai<U+034F>Â¨d）数据与͏（U+034F，Combining Grapheme Joiner）似乎是乱码——我找不到到Joël的标准转换（或Saïd）。请edit您的问题分享minimal reproducible example（数据框的来源和方式）。

标签： r string parsing utf-8 decode

【解决方案1】：

如果您在 R 中已经有这些字符串，则必须将 unicode 字符的符号转换为可解释的形式，例如使用像stringi 这样的包。在这里使用玩具数据。

library(stringi)

source <- "word word <U+0218> other <U+0218> word"
source
[1] "word word <U+0218> other <U+0218> word"

# converting the string into "\uXXXX"
source_conv <- gsub(">","",gsub("<U\\+","\\\\u", source))
source_conv
[1] "word word \\u0218 other \\u0218 word"

# now interpreting the converted unicode character
stringi::stri_unescape_unicode(source_conv)
[1] "word word Ș other Ș word"

理想情况下，在生成源字符串以进行转换并使用正确的编码时解决这些问题。

【讨论】：