乳腺癌威斯康星数据库的二进制表示答案

【问题标题】：Binary representation of breast cancer wisconsin database乳腺癌威斯康星数据库的二进制表示
【发布时间】：2018-04-09 18:37:45
【问题描述】：

我想生成著名的威斯康星乳腺癌数据库的二进制表示。

初始数据集有31个数值变量，1个分类变量。

 id_number diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean
1    842302         M       17.99        10.38         122.80    1001.0         0.11840          0.27760         0.3001             0.14710        0.2419
2    842517         M       20.57        17.77         132.90    1326.0         0.08474          0.07864         0.0869             0.07017        0.1812
3  84300903         M       19.69        21.25         130.00    1203.0         0.10960          0.15990         0.1974             0.12790        0.2069
4  84348301         M       11.42        20.38          77.58     386.1         0.14250          0.28390         0.2414             0.10520        0.2597
5  84358402         M       20.29        14.34         135.10    1297.0         0.10030          0.13280         0.1980             0.10430        0.1809

我想通过以下方式生成此数据帧的二进制表示：

将诊断列 (levels= M , B) 转换为两个列 diagnostic_M 和 diagnostic_B 并根据初始列 (M 或 B) 中的值将 1 或 0 放在相关行中。

查找每个数值列的中值，然后根据值是大于还是小于平均值将其拆分为两列。例如：对于 radius_mean 列，将其拆分为 radius_mean_great - 如果值 > 平均值，我们将 1 放入其中，否则；和一列 radius_mean_low 相反。

library(mlbench) 
library("RCurl") 
library("curl")
UCI_data_URL <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data') 

names <- c('id_number', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean','concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst') 

breast.cancer.fr <- read.table(textConnection(UCI_data_URL), sep = ',', col.names = names)

【问题讨论】：

标签： r dataframe dplyr categorical-data

【解决方案1】：

嗯，有几种方法可以二值化基数，我找到了以下我希望它有用

df <- breast.cancer.fr[,3:32]
df2 <- matrix(NA, ncol = 2*ncol(df), nrow = nrow(df))
for(i in 1:ncol(df)){
df2[,2*i-1]<- as.numeric(df[,i] >  mean(df[,i]))
df2[,2*i]  <- as.numeric(df[,i] <= mean(df[,i]))}
colnames(df2) <- c(rbind(paste0(names(df),"_great"),paste0(names(df),"_low")))

library(dplyr)
df3 <- select(breast.cancer.fr,id_number,diagnosis) %>% mutate(diagnosis_M = as.numeric(diagnosis == "M")) %>%
  mutate(diagnosis_B = as.numeric(diagnosis == "B"))

df <- cbind(df3[,-2],df2)
df[1:10,1:7]
   id_number diagnosis_M diagnosis_B radius_mean_great radius_mean_low texture_mean_great texture_mean_low
1     842302           1           0                 1               0                  0                1
2     842517           1           0                 1               0                  0                1
3   84300903           1           0                 1               0                  1                0
4   84348301           1           0                 0               1                  1                0
5   84358402           1           0                 1               0                  0                1
6     843786           1           0                 0               1                  0                1
7     844359           1           0                 1               0                  1                0
8   84458202           1           0                 0               1                  1                0
9     844981           1           0                 0               1                  1                0
10  84501001           1           0                 0               1                  1                0

【讨论】：

谢谢，太好了