【问题标题】:how to make a data frame for lm in r如何在r中为lm制作数据框
【发布时间】:2020-08-20 09:16:44
【问题描述】:

我有一个数据框,其中包含个人所有基因的基因表达和所有个人的体重值。我想对所有基因 (A1--A13) 执行 lm,并对所有个体 (S1-----S9) 进行权重。

gene    S1  S2  S3  S4  S5  S6  S7  S8  S9
weight  1,34175933  NA  0,506664615 2,404181093 0,853749494 0,931450603 2,666384344 1,483623026 1,908323207
A1  0   0   0   0   0   0   0   0   0
A2  0   0   0   0   0   0   0   0   0
A3  0,047059    0   0   0   0,055744    0   0   0   0
A4  0   0   0   0   0   0   0   0   0
A5  0   0   0   0   0   0   0   0   0
A6  0   0   0   0   0   0   0   0   0
A7  0   0   0   0   0   0   0   0   0
A8  0   0   0   0   0   0   0   0   0
A9  0   0   0   0   0   0   0   0   0
A10 0   0   0   0   0   0   0   0   0
A11 0   0   0   0   0   0   0   0   0
A12 0   0   0   0   0   0   0   0   0
A13 0   0   0   0   0   0   0   0   0

我想要的输出是所有个体体重的所有基因的 p 值。我遇到的问题是这个数据框没有将权重作为单独的列读取。谢谢!

【问题讨论】:

  • 这些基因表达值是多少?算吗?信号强度?
  • 这是错误的结构……R 中的大多数事情。我怀疑你需要:(1)transpose 数据以便个人是 rows我>; (2) 重新转换为data.frame; (3) 用as.integer 转换大部分内容;然后 (4) 弄清楚 1,341758933 对加权的 numerically 意味着什么(这对我来说没有任何意义),并适当地解析它。 (这也适用于A3……这不是数字。)
  • @AnuragN.Sharma,这些是 TPM 值。谢谢!
  • 鉴于每个样本中大多只有低表达基因,除了少数例外,您确定您尝试建模的内容有什么用吗?
  • @AnuragN.Sharma,我有更多的基因。这只是一个例子。我想让模型先运行。

标签: r


【解决方案1】:

正如@r2evans 所说,您最大的问题是您的数据方向错误...

需要解决这个问题。我将假设数据集太大,您无法手动执行此操作。所以这里有一些非常丑陋的 R 代码,但不管有多少主题或基因,你都应该能够修改。

# what you gave us
your_data <- read.table(text = "gene    S1  S2  S3  S4  S5  S6  S7  S8  S9
weight  1,34175933  NA  0,506664615 2,404181093 0,853749494 0,931450603 2,666384344 1,483623026 1,908323207
A1  0   0   0   0   0   0   0   0   0
A2  0   0   0   0   0   0   0   0   0
A3  0,047059    0   0   0   0,055744    0   0   0   0
A4  0   0   0   0   0   0   0   0   0
A5  0   0   0   0   0   0   0   0   0
A6  0   0   0   0   0   0   0   0   0
A7  0   0   0   0   0   0   0   0   0
A8  0   0   0   0   0   0   0   0   0
A9  0   0   0   0   0   0   0   0   0
A10 0   0   0   0   0   0   0   0   0
A11 0   0   0   0   0   0   0   0   0
A12 0   0   0   0   0   0   0   0   0
A13 0   0   0   0   0   0   0   0   0", header = TRUE)

# save your data from excel to a csv file
# your_data <- read.table("untitled.csv", header = TRUE )
# should show about like this
your_data
#>      gene         S1 S2          S3          S4          S5          S6
#> 1  weight 1,34175933 NA 0,506664615 2,404181093 0,853749494 0,931450603
#> 2      A1          0  0           0           0           0           0
#> 3      A2          0  0           0           0           0           0
#> 4      A3   0,047059  0           0           0    0,055744           0
#> 5      A4          0  0           0           0           0           0
#> 6      A5          0  0           0           0           0           0
#> 7      A6          0  0           0           0           0           0
#> 8      A7          0  0           0           0           0           0
#> 9      A8          0  0           0           0           0           0
#> 10     A9          0  0           0           0           0           0
#> 11    A10          0  0           0           0           0           0
#> 12    A11          0  0           0           0           0           0
#> 13    A12          0  0           0           0           0           0
#> 14    A13          0  0           0           0           0           0
#>             S7          S8          S9
#> 1  2,666384344 1,483623026 1,908323207
#> 2            0           0           0
#> 3            0           0           0
#> 4            0           0           0
#> 5            0           0           0
#> 6            0           0           0
#> 7            0           0           0
#> 8            0           0           0
#> 9            0           0           0
#> 10           0           0           0
#> 11           0           0           0
#> 12           0           0           0
#> 13           0           0           0
#> 14           0           0           0
# let's flip it in R
your_data <- as.data.frame(t(your_data))
your_data
#>               V1 V2 V3       V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
#> gene      weight A1 A2       A3 A4 A5 A6 A7 A8  A9 A10 A11 A12 A13
#> S1    1,34175933  0  0 0,047059  0  0  0  0  0   0   0   0   0   0
#> S2          <NA>  0  0        0  0  0  0  0  0   0   0   0   0   0
#> S3   0,506664615  0  0        0  0  0  0  0  0   0   0   0   0   0
#> S4   2,404181093  0  0        0  0  0  0  0  0   0   0   0   0   0
#> S5   0,853749494  0  0 0,055744  0  0  0  0  0   0   0   0   0   0
#> S6   0,931450603  0  0        0  0  0  0  0  0   0   0   0   0   0
#> S7   2,666384344  0  0        0  0  0  0  0  0   0   0   0   0   0
#> S8   1,483623026  0  0        0  0  0  0  0  0   0   0   0   0   0
#> S9   1,908323207  0  0        0  0  0  0  0  0   0   0   0   0   0
# let's write it back out since you say you have a lot of genes
write.csv(your_data, file = "transposed.csv", na = "", row.names = TRUE)
# read it back in and get the header correct
fixed_data <- read.csv("transposed.csv", skip = 1, header = TRUE, na.strings = "")
fixed_data
#>   gene      weight A1 A2       A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13
#> 1   S1  1,34175933  0  0 0,047059  0  0  0  0  0  0   0   0   0   0
#> 2   S2        <NA>  0  0        0  0  0  0  0  0  0   0   0   0   0
#> 3   S3 0,506664615  0  0        0  0  0  0  0  0  0   0   0   0   0
#> 4   S4 2,404181093  0  0        0  0  0  0  0  0  0   0   0   0   0
#> 5   S5 0,853749494  0  0 0,055744  0  0  0  0  0  0   0   0   0   0
#> 6   S6 0,931450603  0  0        0  0  0  0  0  0  0   0   0   0   0
#> 7   S7 2,666384344  0  0        0  0  0  0  0  0  0   0   0   0   0
#> 8   S8 1,483623026  0  0        0  0  0  0  0  0  0   0   0   0   0
#> 9   S9 1,908323207  0  0        0  0  0  0  0  0  0   0   0   0   0
# better?  it's now by subject not gene
fixed_data$subject <- fixed_data$gene
# I'm fixing the ones in your sample data you need to check all columns
fixed_data$weight <- as.numeric(stringr::str_replace(fixed_data$weight, ",", "."))
fixed_data$A3 <- as.numeric(stringr::str_replace(fixed_data$A3, ",", "."))
# much better
fixed_data
#>   gene    weight A1 A2       A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 subject
#> 1   S1 1.3417593  0  0 0.047059  0  0  0  0  0  0   0   0   0   0      S1
#> 2   S2        NA  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S2
#> 3   S3 0.5066646  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S3
#> 4   S4 2.4041811  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S4
#> 5   S5 0.8537495  0  0 0.055744  0  0  0  0  0  0   0   0   0   0      S5
#> 6   S6 0.9314506  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S6
#> 7   S7 2.6663843  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S7
#> 8   S8 1.4836230  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S8
#> 9   S9 1.9083232  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S9
# make a copy
model_me <- fixed_data
# remove non regressor relevants
model_me$gene <- NULL
model_me$subject <- NULL

# per your question in the comments remove all columns 
# where TPM is zero for all subjects
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
all_zero <- function(x) !sum(x, na.rm = TRUE) == 0
model_me <- model_me %>% select_if(all_zero)

lm(weight ~ ., data = model_me)
#> 
#> Call:
#> lm(formula = weight ~ ., data = model_me)
#> 
#> Coefficients:
#> (Intercept)           A3  
#>       1.656      -11.174
summary(lm(weight ~ ., data = model_me))
#> 
#> Call:
#> lm(formula = weight ~ ., data = model_me)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.1489 -0.3153  0.0200  0.3767  1.0108 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)   1.6556     0.3157   5.244  0.00193 **
#> A3          -11.1742    12.2409  -0.913  0.39651   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.7743 on 6 degrees of freedom
#>   (1 observation deleted due to missingness)
#> Multiple R-squared:  0.1219, Adjusted R-squared:  -0.02439 
#> F-statistic: 0.8333 on 1 and 6 DF,  p-value: 0.3965

reprex package (v0.3.0) 于 2020 年 5 月 5 日创建

【讨论】:

  • 谢谢@Chuck P!在 lm 步骤中,我收到此错误:错误:保护():保护堆栈溢出
  • 您的内存不足。看看这里的初学者。 stackoverflow.com/questions/32826906,如果你得到它,我肯定会用all_zero &lt;- function(x) !sum(x, na.rm = TRUE) == 0 model_me &lt;- model_me %&gt;% select_if(all_zero)删除无用的列
【解决方案2】:

你可以试试这个,

library(broom)
library(dplyr)

weight = t(mat[1,-1])
y = mat[-1,-1]
rownames(y) = as.character(mat[,1])[-1]
colnames(y) = colnames(mat)[-1]
y = t(y)

现在我们有了正确的格式,y(基因)与 x(体重)的每一列。您可以排除全为零(或 colMeans > 某个值)的列,在上面的示例中,您只有 1 个不全为零的列:

table(colMeans(y)>0)
FALSE  TRUE 
   12     1

希望你能解决这个问题。例如,我将创建一个与 y 维度相同的矩阵 Y 并回归所有内容:

Y = matrix(runif(length(y)),ncol=ncol(y),dimnames=list(rownames(y),colnames(y)))

tidy(lm(Y ~ weight)) %>% filter(term!="(Intercept)")
# A tibble: 13 x 6
   response term   estimate std.error statistic p.value
   <chr>    <chr>     <dbl>     <dbl>     <dbl>   <dbl>
 1 A1       weight -0.00297     0.159   -0.0186  0.986 
 2 A2       weight  0.184       0.144    1.28    0.248 
 3 A3       weight -0.0986      0.111   -0.888   0.409 
 4 A4       weight -0.0459      0.202   -0.227   0.828 
 5 A5       weight -0.111       0.115   -0.966   0.371 
 6 A6       weight -0.0160      0.142   -0.113   0.914 
 7 A7       weight  0.250       0.107    2.34    0.0577
 8 A8       weight  0.159       0.185    0.859   0.423 
 9 A9       weight  0.0612      0.148    0.413   0.694 
10 A10      weight  0.347       0.100    3.45    0.0136
11 A11      weight -0.186       0.182   -1.02    0.346 
12 A12      weight -0.0748      0.177   -0.422   0.688 
13 A13      weight  0.0784      0.137    0.572   0.588 

以上对于几千列应该工作得很好。再次过滤掉有太多零的列是有意义的。

使用的数据:

mat = read.table(text="gene    S1  S2  S3  S4  S5  S6  S7  S8  S9
weight  1,34175933  NA  0,506664615 2,404181093 0,853749494 0,931450603 2,666384344 1,483623026 1,908323207
A1  0   0   0   0   0   0   0   0   0
A2  0   0   0   0   0   0   0   0   0
A3  0,047059    0   0   0   0,055744    0   0   0   0
A4  0   0   0   0   0   0   0   0   0
A5  0   0   0   0   0   0   0   0   0
A6  0   0   0   0   0   0   0   0   0
A7  0   0   0   0   0   0   0   0   0
A8  0   0   0   0   0   0   0   0   0
A9  0   0   0   0   0   0   0   0   0
A10 0   0   0   0   0   0   0   0   0
A11 0   0   0   0   0   0   0   0   0
A12 0   0   0   0   0   0   0   0   0
A13 0   0   0   0   0   0   0   0   0",header=TRUE,dec=",")

【讨论】:

  • 谢谢!它起作用并给出了 p 值。但是,当我使用 df % select_if(df) Then rownames(y) 过滤掉全为零的列时= as.character(mat[,1])[-1] 给出错误:.rowNamesDF&lt;-(x, value = value) 中的错误:无效的 'row.names' 长度另外:警告消息:在tibble 已弃用。
  • 哦..你有一个小标题..你的数据有多大?您可以将其转换为 data.frame,或者像我一样使用 read.csv 读取数据。我不认为通过在小标题中阅读它会获得太多收益。记得使用 dec=","
  • 谢谢!它在我使用 read.csv 导入时工作。如果我尝试用全零删除基因,它会改变数据结构。有没有办法在不改变结构的情况下下降。
  • 你还剩多少?如果你只剩下一个,从我上面的代码中,在 t() 之后执行 y[,colSums(y)&gt;0,drop=FALSE] ...
  • 你需要做的是```y = mat[-1,-1]; rownames(y) = as.character(mat[,1])[-1];列名(y)=列名(垫)[-1]; Y = t(y); Y= Y[,colSums(Y)>0] 删除全为 0 的列并运行 lm
猜你喜欢
  • 1970-01-01
  • 2020-03-04
  • 2016-02-11
  • 2021-02-12
  • 2019-08-23
  • 1970-01-01
  • 2020-01-24
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多