生成数据以在 R 中创建绘图答案

【问题标题】：Generating data for the creation of a plot in R生成数据以在 R 中创建绘图
【发布时间】：2016-07-30 11:46:20
【问题描述】：

我的任务是复制我在一项研究中看到的情节。然而，在尝试这个时，我对它是如何创建的感到困惑。

剧情是这样的：

图中的“x”代表具有特定分数的国家/地区的百分比（假设所有得分 ==1 的国家/地区）。两条线代表其他两个自变量的百分比。

现在我知道数据集看起来像这样（这只是一个示例——也与我的数据集的结构非常相似）。

country year    x1  x2  score
A       1990    0   0   0
A       1991    1   0   1
A       1992    1   0   1
A       1993    0   0   0
A       1995    1   0   0
A       1996    1   0   2
A       1997    1   0   0
B       1990    0   0   0
B       1991    0   0   0
B       1992    0   0   1
B       1993    0   0   2
B       1995    0   1   2
B       1996    0   0   2
B       1997    0   1   2
C       1990    0   1   2
C       1991    1   1   0
C       1992    1   0   0
C       1993    1   0   0
C       1995    1   0   0
C       1996    0   0   1
C       1997    0   0   1
C       1998    1   1   0
D       1990    0   0   2
D       1991    0   0   2
D       1992    1   1   2
D       1993    1   1   0
D       1995    0   0   1
D       1996    0   0   1
D       1997    0   0   1

正如您在上面看到的，score 变量是一个序号变量，其值为 0、1 和 2。我想创建一个数据框，允许我以与显示的图类似的方式进行绘图多于。这是我对如何进行感到困惑的地方。我下面的问题是基于我需要执行以下操作才能绘制类似图表的假设。

如何计算得分 ==0 的状态百分比以及得分 ==0 的状态对应的 x1 和 x2 百分比

最终，我需要对 score==1 和 score==2 的国家/地区进行相同的计算。

我需要一些意见 - 所以我感谢所有建议！

【问题讨论】：

标签： r function plot machine-learning

【解决方案1】：

我在 dat 中使用了您下面的示例数据。也许还有一种更矢量化的方式来做到这一点，但它确实有效。这只适用于分数，但也可以直接将其扩展到 x1 和 x2。

# get unique score values and unique years
uniqScore = unique(dat$score)
uniqYear = unique(dat$year)
# assuming total number of countries remains constant
totalCountries = length(unique(dat$country))
# empty matrix to store results
store = matrix(NA, length(uniqYear), length(uniqScore))

# loop over unique scores
for (i in 1:length(uniqScore)) {
  # loop over unique years
  for (j in 1:length(uniqYear)) {
    # find the number of observations with a given year and score
    # subsequently divide it by the total number of possible countries
    # to obtain a percentage and save it in store
    store[j, i] = length(dat[dat$year == uniqYear[j] & 
                               dat$score == uniqScore[i], 1]) / 
      totalCountries
  }
}

# plot results
matplot(uniqYear, store, type = 'b', pch = 1:3, lty = 2, bty = 'n', las = 1,
        ylab = 'Percentage', xlab = 'Year')
legend('topright', legend = uniqScore, pch = 1:3, lty = 2, col = 1:3, bty = 'n')

# or to make it into a dataframe
df = data.frame(percentage = c(store), 
                score = rep(uniqScore, each = nrow(store)))

【讨论】：

获取store 的更简洁的方法可能是tab1 <- xtabs(~score+year, dat)，然后是store <- t(apply(tab1, MARGIN=2, function(x) x/totalCountries))。

【解决方案2】：

获取某些条件百分比（例如，分数百分比 == 0）的一种简单方法是使用 mean(condition) * 100。这是关于它的详细博客文章：https://drsimonj.svbtle.com/proportionsfrequencies-with-mean-and-booleans。如果您有缺失值，请注意使用mean(condition, na.rm = TRUE) * 100。

我将从与您提供的大致匹配的模拟数据开始：

set.seed(987)
d <- data.frame(
  year  = rep(c(1991:2000), each = 10),
  x1    = sample(c(0, 1, 2), 100, replace = TRUE),
  x2    = sample(c(0, 1, 2), 100, replace = TRUE),
  score = sample(c(0, 1, 2), 100, replace = TRUE)
)
head(d)
#>   year x1 x2 score
#> 1 1991  1  2     2
#> 2 1991  2  1     2
#> 3 1991  1  1     2
#> 4 1991  1  0     2
#> 5 1991  2  0     0
#> 6 1991  0  0     1

然后您可以使用 dplyr 包中的 group_by(year) 和 summarise(...) 来计算您每年观察到特定分数的次数百分比：

library(dplyr)
to_match <- 0
d <- d %>%
  group_by(year) %>% 
  summarise(
    x1    = mean(x1 == to_match) * 100,
    x2    = mean(x2 == to_match) * 100,
    score = mean(score == to_match) * 100
  )
d
#> # A tibble: 10 x 4
#>     year    x1    x2 score
#>    <int> <dbl> <dbl> <dbl>
#> 1   1991    10    60    30
#> 2   1992    60    20    30
#> 3   1993    40    40    30
#> 4   1994    40    50    50
#> 5   1995    50    50    20
#> 6   1996    30    40    20
#> 7   1997    20    30     0
#> 8   1998    20    60    40
#> 9   1999    40    30    20
#> 10  2000    20    40    40

注意，我只是将变量to_match 设置为0。您可以将其他值更改为 1 和 2。

然后，您可以使用ggplot2 进行绘图，例如：

library(ggplot2) 
d %>% 
  ggplot(aes(x = year)) +
    scale_x_continuous(breaks = 1991:2000) +
    geom_line(aes(y = x1)) +
    geom_line(aes(y = x2), color = "grey") +
    geom_point(aes(y = score)) +
    scale_y_continuous(limits = c(0, 100)) +
    ylab("Percent Countries") +
    theme_bw()

如果您想要一个图例并且乐于使所有几何图形相同（即所有线和/或所有点），您可以使用 tidyr 包中的 gather() 进入长格式，然后更改 @ 987654339@和color/linetype剧情中的美学要匹配。这是一个例子：

library(tidyr)
d %>% 
  gather(-year, key = "var", value = "Percent") %>% 
  ggplot(aes(x = year, y = Percent, group = var)) +
    scale_x_continuous(breaks = 1991:2000) +
    geom_line(aes(linetype = var, color = var)) +
    geom_point(size = 2) +
    scale_y_continuous(limits = c(0, 100)) +
    ylab("Percent Countries") +
    theme_bw()

【讨论】：

嗨，西蒙——谢谢。很棒的博客！一个问题：NA 在我的真实数据中，我如何在代码中控制它们？另外，为了确定：我不需要将“国家”变量放在代码中，对吧？ “to_match”完成所有工作？
可以将na.rm = TRUE 添加到mean() 以处理缺失值（我将在答案中添加注释）。关于“国家”，除非您有一个国家/地区每年出现一次以上，否则我认为没有必要？并感谢重新 blogR 反馈！
太棒了——感谢您所做的所有工作。是的，博客看起来非常不错，我会进一步探索它。
一个非常有用的事情是在图中有一个小窗口来解释绘制变量的名称。
添加图例的示例。