【发布时间】:2021-05-22 23:59:09
【问题描述】:
我试图理解 R 中的多元线性回归。
我有一个看起来像这样的数据框。您可以看到有一个 Source_Group 类别包含不同的频道信息,还有一个 Spend 列显示已花费的金额。
Date Source_Group Spend Total_Orders year month
1 2021-01-01 OTT 12359.16 28 2021 1
2 2021-01-01 Paid Search 17266.55 190 2021 1
3 2021-01-01 Paid Social 6799.28 40 2021 1
4 2021-01-01 YouTube 0.00 7 2021 1
5 2021-01-02 OTT 9104.31 42 2021 1
这里是dput 代码,用于重新创建第一个数据框:
structure(list(Date = structure(c(18628, 18628, 18628, 18628,
18629), class = "Date"), Source_Group = structure(c(11L, 12L,
13L, 17L, 11L), .Label = c("Article Or Blog", "Audio", "Direct",
"Email", "From A Friend", "From Contacts", "Influencer", "Organic Search",
"Organic Social", "Other", "OTT", "Paid Search", "Paid Social",
"Pepperjam", "Podcast", "Reddit", "YouTube", "Organic", "Peoplehype"
), class = "factor"), Spend = c(12359.16, 17266.55, 6799.28,
0, 9104.31), Total_Orders = c(28, 190, 40, 7, 42), year = c(2021,
2021, 2021, 2021, 2021), month = structure(c(1L, 1L, 1L, 1L,
1L), .Label = c("1", "2", "3", "4"), class = "factor")), row.names = c(NA,
-5L), groups = structure(list(Date = structure(c(18628, 18629
), class = "Date"), .rows = structure(list(1:4, 5L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
我想看看来自不同的订单数量,在不同的营销渠道上花费的钱,并就如何分配资源做出一些决定。
使用该数据框,我是否可以创建这样的线性模型:
linear_model_long_format <- lm(Total_Orders ~ Spend + Source_Group, df)
或者我应该使用此代码将数据框重组为宽格式:
df_wide <- pivot_wider(df, names_from = Source_Group, values_from = Spend)
因此,我的数据框将如下所示:
这里是重新创建第二个数据帧的一些输入代码:
structure(list(Date = structure(c(18628, 18628, 18628, 18628,
18629), class = "Date"), Total_Orders = c(28, 190, 40, 7, 42),
year = c(2021, 2021, 2021, 2021, 2021), month = structure(c(1L,
1L, 1L, 1L, 1L), .Label = c("1", "2", "3", "4"), class = "factor"),
OTT = c(12359.16, 0, 0, 0, 9104.31), `Paid Search` = c(0,
17266.55, 0, 0, 0), `Paid Social` = c(0, 0, 6799.28, 0, 0
), YouTube = c(0, 0, 0, 0, 0)), row.names = c(NA, -5L), groups = structure(list(
Date = structure(c(18628, 18629), class = "Date"), .rows = structure(list(
1:4, 5L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
df_wide $OTT[is.na(df_wide $OTT)] <- 0
df_wide $`Paid Search`[is.na(df_wide $`Paid Search`)] <- 0
df_wide $`Paid Social`[is.na(df_wide $`Paid Social`)] <- 0
df_wide $YouTube[is.na(df_wide $YouTube)] <- 0
我注意到我必须将 NA 值设置为 0 以免出错。
我认为这样的线性模型应该是这样的:
linear_model_wide_format <- lm(Total_Orders ~ OTT + `Paid Search` + `Paid Social` + YouTube, df_wide)
我看到的在线帖子似乎将这种更广泛的格式用于线性模型,其中每列都是一个变量,但同时我知道 R 中通常首选长格式,而且那些 0 让我真的怀疑宽格式是要走的路。我真的不确定。
【问题讨论】:
-
如果您可以编辑您的问题以包含显示为 text 的数据(即剪切并粘贴到代码块中)而不是屏幕截图,那将是最好的;它在许多方面更易于访问
-
嘿,我刚刚为每个数据帧的头部做了一个 dput,如果这有助于好的建议
-
查看我的编辑。
dput()格式对计算机最友好,但文本表格格式最适合人类(包括使用屏幕阅读器的视障人士......) -
这是一个很好的编辑。谢谢你的信息。你是怎么这么快就得到表格的文本表格格式的?
-
我将您的
dput()代码粘贴到 R 会话中,打印了对象,然后将生成的输出剪切并粘贴到问题中的代码块中。
标签: r linear-regression