【发布时间】:2018-04-09 18:37:45
【问题描述】:
我想生成著名的威斯康星乳腺癌数据库的二进制表示。
初始数据集有31个数值变量,1个分类变量。
id_number diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean
1 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419
2 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812
3 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069
4 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597
5 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809
我想通过以下方式生成此数据帧的二进制表示:
将诊断列 (levels= M , B) 转换为两个列 diagnostic_M 和 diagnostic_B 并根据初始列 (M 或 B) 中的值将 1 或 0 放在相关行中。
查找每个数值列的中值,然后根据值是大于还是小于平均值将其拆分为两列。例如:对于 radius_mean 列,将其拆分为 radius_mean_great - 如果值 > 平均值,我们将 1 放入其中,否则;和一列 radius_mean_low 相反。
library(mlbench)
library("RCurl")
library("curl")
UCI_data_URL <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
names <- c('id_number', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean','concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst')
breast.cancer.fr <- read.table(textConnection(UCI_data_URL), sep = ',', col.names = names)
【问题讨论】:
标签: r dataframe dplyr categorical-data