【问题标题】:R- Changing encoding of column in dataframe?R-更改数据框中列的编码?
【发布时间】:2015-05-14 04:52:36
【问题描述】:

我正在尝试更改数据框中列的编码。

stri_enc_mark(data_updated$text)
#   [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8"
#  [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8"
#  [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII"
#  [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII"

当我尝试转换它时,它不会抛出错误,但仍然对向量没有影响:

d <- enc2utf8(data_updated$text)
stri_enc_mark(d)
#   [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8"
#  [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8"
#  [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII"
#  [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII"

有什么建议吗?

我使用的是 32 位的 Windows 7。添加数据sn-p。

> Encoding(data_updated$text[1:35])
 [1] "UTF-8"   "unknown" "unknown" "UTF-8"   "unknown" "unknown" "UTF-8"  
 [8] "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"   "unknown" "UTF-8"  
[15] "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown"
[22] "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown"
[29] "unknown" "UTF-8"   "unknown" "unknown" "unknown" "UTF-8"   "UTF-8"

数据看起来像这样。

> data_updated$text[1:35]
 [1] "RT @satpalpandey: Majlis started in Sirsa Ashram.\nInform others too.\nLive @ http://t.co/zGXWATGajX\nIVR Airtel 55252\nReliance 56300403\n\n#MSG…"
 [2] "Deal Talks for Here Mapping Service Expose Reliance on Location Data, via @nytimes #mapping #dilemma  http://t.co/wGdiS5OlRq"                      
 [3] "http://t.co/UZIyX1Rk7W The popping linksexploaded!! http://t.co/KpNntm1dH7 :) http://t.co/oku91uVxZ8"                                              
 [4] "RT @davidsunaria90: Wtch LIVE Mjlis Now\n http://t.co/GXNhe3eY7Y\nIVR Airtel: 55252\nReliance: 56300403\nYoutube Link : http://t.co/YewOVcz8bb\n…" 
 [5] "Reliance Jio Infocomm: Indian carrier raises $750 million loan for 4G rollout  http://t.co/B2aWlkmwXz"                                             
 [6] "RT @SurjeetInsan: Majlis started in Sirsa Ashram.\nLive @ http://t.co/PR6W5tzZes\nIVR Airtel 55252\nReliance 56300403\n\n#MSGPlsSaveTheEarth"      
 [7] "\"Deal Talks for Here Mapping Service Expose Reliance on Location Data\" by MARK SCOTT and MIKE ISAAC via NYT Techno… http://t.co/kyxTYIxks5"      
 [8] "RT @satpalpandey: Majlis started in Sirsa Ashram.\nInform others too.\nLive @ http://t.co/zGXWATGajX\nIVR Airtel 55252\nReliance 56300403\n\n#MSG…"
 [9] "RT @jaameinsan: Watch LIVE Majlis Now\n http://t.co/nPQegnLXPa\nIVR Airtel: 55252\nReliance: 56300403\nYoutube Link : http://t.co/txXMtw3zFP\n#M…" 
[10] "\"Deal Talks for Here Mapping Service Expose Reliance on Location Data\" by MARK SCOTT and MIKE ISAAC via NYT Technology"

这些是推文,我认为“http://”链接在这里指示编码,因为它们具有“wGdiS5OlRq”之类的表达式。为了分析,我使用正则表达式删除了这些标签。但是要将原始数据存储在数据库中,我需要这些推文。 MongoDB 没有问题,但是 RDBMS 会引发问题。

【问题讨论】:

  • 拥有reproducible example 会有所帮助。了解您使用的操作系统以及 Encoding() 为这些向量返回的内容也很有帮助。如果字符串中没有任何非 ASCII 字符,它可能只会返回 ASCII。
  • 这是一个非常经典的例子,可以简化问题。您有 36 个数据点。你需要 2 来展示这个问题 - data_updated$text[1:2] 足以显示从 ASCII 到 UTF-8 的任何变化
  • 感谢您的指出。已更新帖子
  • 如果问题确实是 RDBMS 抛出错误,那么最好描述一下这个问题。仅包含 ASCII 字符的字符串的编码不应该导致问题。
  • 我将数据移植到的数据表是 UTF-8 编码的。因此我认为它不接受 ASCII,错误说,“预期 UTF-8”

标签: r encoding dataframe


【解决方案1】:

如果有人仍然卡住:我使用了 Encoding()。

  for (col in colnames(mydataframe)){
  Encoding(mydataframe[[col]]) <- "UTF-8"}

【讨论】:

  • 我得到了“Encoding&lt;-(*tmp*, value = "UTF-8") 中的错误:预期的字符向量参数”使用此解决方案
  • 可以尝试这个解决方案来解决错误:stackoverflow.com/questions/33731891/…
【解决方案2】:

看来我们可以在将向量转换为因子然后再转换回字符向量之后使用conv()函数来转换编码。说实话有点奇怪。

【讨论】:

    【解决方案3】:

    我发现stringi::stri_enc_toascii() 非常有用,可以解决我的问题。

    我在How to handle example data in R Package that has UTF-8 marked strings发了一个案例

    【讨论】:

      猜你喜欢
      • 2016-06-29
      • 1970-01-01
      • 2017-05-09
      • 2014-09-08
      • 2018-01-01
      • 2020-01-26
      • 1970-01-01
      • 2015-04-23
      相关资源
      最近更新 更多