Rpy2将包含空值的分类数据转换为R因子答案

【问题标题】：Rpy2 conversion of categorical data containing nulls to R factorsRpy2将包含空值的分类数据转换为R因子
【发布时间】：2018-11-15 04:02:05
【问题描述】：

我有一个 pandas 数据框，其中包含一个包含 NaN 值的分类列，例如：

g = pd.Series(["A", "B", "C", np.nan], dtype="category")
g

0      A
1      B
2      C
3    NaN
dtype: category
Categories (3, object): [A, B, C]

在 pandas 中，NaN 不是一个类别，但您可以在分类数据中包含 NaN 值。我想在 Jupyter notebook 中使用 %%R 将此数据帧传递给 R。分类列被 R 成功识别为一个因子，但该因子格式错误，可能是因为 Nan 值：

%%R -i g
str(g)
Factor w/ 3 levels "A","B","C": 1 2 3 0
 - attr(*, "names")= chr [1:4] "0" "1" "2" "3" 

print(g)
Error in as.character.factor(x) : malformed factor

有什么方法可以确保该因子没有格式错误 - 例如有一个 NA 因子水平自动创建？

R：3.5.1，rpy2：2.9.4，Python - 3

【问题讨论】：

另一个方向（R 因子中的 NA 转换为 Python）也有问题（stackoverflow.com/questions/53236532/…）。你能在 rpy2 问题跟踪器上打开一个问题吗？
谢谢@lgautier。我打开了一个问题。
@lgautier 在问题得到解决之前，您能提出一个解决方法吗？

标签： r pandas rpy2 categorical-data factors

【解决方案1】：

在撰写本文时，这是一个 rpy2 转换 pandas 类别的错误，该错误已修复，将从 2.9.5 版开始包含在 rpy2 中：https://bitbucket.org/rpy2/rpy2/issues/493/rpy2-conversion-of-categorical-data

解决方法相当简单：不要在 pandas 类别中使用 NaN。

g = pd.Series(["A", "B", "C", np.nan], dtype="category")
# Prepare alternative representation to pass it to R
g_r = g.replace(np.nan, 'Missing')

转换时它现在看起来像：

%%R -i g_r
str(g_r)

Factor w/ 4 levels "A","B","C","Missing": 1 2 3 4
- attr(*, "names")= chr [1:4] "0" "1" "2" "3"

翻译回 R NA 只是降低附加级别的问题：

%%R -i g_r
str(droplevels(g_r, exclude = "Missing")) 

Factor w/ 3 levels "A","B","C": 1 2 3 NA
- attr(*, "names")= chr [1:4] "0" "1" "2" "3"

【讨论】：