【问题标题】:Binary representation of breast cancer wisconsin database乳腺癌威斯康星数据库的二进制表示
【发布时间】:2018-04-09 18:37:45
【问题描述】:

我想生成著名的威斯康星乳腺癌数据库的二进制表示。

初始数据集有31个数值变量,1个分类变量。

 id_number diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean
1    842302         M       17.99        10.38         122.80    1001.0         0.11840          0.27760         0.3001             0.14710        0.2419
2    842517         M       20.57        17.77         132.90    1326.0         0.08474          0.07864         0.0869             0.07017        0.1812
3  84300903         M       19.69        21.25         130.00    1203.0         0.10960          0.15990         0.1974             0.12790        0.2069
4  84348301         M       11.42        20.38          77.58     386.1         0.14250          0.28390         0.2414             0.10520        0.2597
5  84358402         M       20.29        14.34         135.10    1297.0         0.10030          0.13280         0.1980             0.10430        0.1809

我想通过以下方式生成此数据帧的二进制表示:

将诊断列 (levels= M , B) 转换为两个列 diagnostic_M 和 diagnostic_B 并根据初始列 (M 或 B) 中的值将 1 或 0 放在相关行中。

查找每个数值列的中值,然后根据值是大于还是小于平均值将其拆分为两列。例如:对于 radius_mean 列,将其拆分为 radius_mean_great - 如果值 > 平均值,我们将 1 放入其中,否则;和一列 radius_mean_low 相反。

library(mlbench) 
library("RCurl") 
library("curl")
UCI_data_URL <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data') 

names <- c('id_number', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean','concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst') 

breast.cancer.fr <- read.table(textConnection(UCI_data_URL), sep = ',', col.names = names) 

【问题讨论】:

    标签: r dataframe dplyr categorical-data


    【解决方案1】:

    嗯,有几种方法可以二值化基数,我找到了以下我希望它有用

    df <- breast.cancer.fr[,3:32]
    df2 <- matrix(NA, ncol = 2*ncol(df), nrow = nrow(df))
    for(i in 1:ncol(df)){
    df2[,2*i-1]<- as.numeric(df[,i] >  mean(df[,i]))
    df2[,2*i]  <- as.numeric(df[,i] <= mean(df[,i]))}
    colnames(df2) <- c(rbind(paste0(names(df),"_great"),paste0(names(df),"_low")))
    
    library(dplyr)
    df3 <- select(breast.cancer.fr,id_number,diagnosis) %>% mutate(diagnosis_M = as.numeric(diagnosis == "M")) %>%
      mutate(diagnosis_B = as.numeric(diagnosis == "B"))
    
    df <- cbind(df3[,-2],df2)
    df[1:10,1:7]
       id_number diagnosis_M diagnosis_B radius_mean_great radius_mean_low texture_mean_great texture_mean_low
    1     842302           1           0                 1               0                  0                1
    2     842517           1           0                 1               0                  0                1
    3   84300903           1           0                 1               0                  1                0
    4   84348301           1           0                 0               1                  1                0
    5   84358402           1           0                 1               0                  0                1
    6     843786           1           0                 0               1                  0                1
    7     844359           1           0                 1               0                  1                0
    8   84458202           1           0                 0               1                  1                0
    9     844981           1           0                 0               1                  1                0
    10  84501001           1           0                 0               1                  1                0
    

    【讨论】:

    • 谢谢,太好了
    猜你喜欢
    • 1970-01-01
    • 2019-02-20
    • 1970-01-01
    • 2021-09-27
    • 2020-09-08
    • 1970-01-01
    • 2021-02-08
    • 2019-03-06
    • 1970-01-01
    相关资源
    最近更新 更多