确定整个数据的结构答案

【问题标题】：Determine structure of the entire data确定整个数据的结构
【发布时间】：2018-10-17 05:41:42
【问题描述】：

说，你有以下数据：

data <- tibble::tribble(~Countries, ~States,   ~Continents,
                        "Country 1",      1L, "continent 1",
                        "Country 1",      2L, "continent 1",
                        "Country 1",      3L, "continent 1",
                        "Country 1",      4L, "continent 1",
                        "Country 2",      1L, "continent 1",
                        "Country 2",      2L, "continent 1",
                        "Country 2",      3L, "continent 1",
                        "Country 2",      4L, "continent 1",
                        "Country 3",      1L, "continent 1",
                        "Country 3",      2L, "continent 1",
                        "Country 3",      3L, "continent 1",
                        "Country 3",      4L, "continent 1",
                        "Country 1",      1L, "continent 2",
                        "Country 1",      2L, "continent 2",
                        "Country 1",      3L, "continent 2",
                        "Country 1",      4L, "continent 2",
                        "Country 2",      1L, "continent 2",
                        "Country 2",      2L, "continent 2",
                        "Country 2",      3L, "continent 2",
                        "Country 2",      4L, "continent 2",
                        "Country 3",      1L, "continent 2",
                        "Country 3",      2L, "continent 2",
                        "Country 3",      3L, "continent 2",
                        "Country 3",      4L, "continent 2")

这些数据可能有许多不同格式和不同粒度级别的变量。我想了解数据的结构，以便我可以说数据的最高级别是具有 2 个值的大陆，下一级粒度是具有 3 个值的县，最低级别是具有 4 个值的州。

理解这一点的一种粗略方法可能是将具有最少不同值的变量保留在左侧，即大陆和具有最多不同值的变量，即数据集右侧的州。
了解这些杂乱数据的更简单方法是创建某种树形图，并在顶部、大陆、此处查看最细粒度的数据，在底部州查看最细粒度的数据（此处为叶子） /节点。

作为第一个切入点，我们可以使用技巧，例如在唯一值的数量相同的情况下，在第一个/顶部显示两个或多个变量中的任何一个。

如果做第二个很难，我们怎么能至少做第一个？ ...可能是通过评估任何通用混乱数据中每个变量的不同值，然后对变量进行排序！任何其他带有 R 代码的方法都会很有帮助。

第一点的解决方案如下所示：

data <- tibble::tribble( ~Continents,  ~Countries,   ~States,
                         "continent 1", "Country 1",   1L,
                         "continent 1", "Country 1",   2L,
                         "continent 1", "Country 1",   3L,
                         "continent 1", "Country 1",   4L,
                         "continent 1", "Country 2",   1L,
                         "continent 1", "Country 2",   2L,
                         "continent 1", "Country 2",   3L,
                         "continent 1", "Country 2",   4L,
                         "continent 1", "Country 3",   1L,
                         "continent 1", "Country 3",   2L,
                         "continent 1", "Country 3",   3L,
                         "continent 1", "Country 3",   4L,
                         "continent 2", "Country 1",   1L,
                         "continent 2", "Country 1",   2L,
                         "continent 2", "Country 1",   3L,
                         "continent 2", "Country 1",   4L,
                         "continent 2", "Country 2",   1L,
                         "continent 2", "Country 2",   2L,
                         "continent 2", "Country 2",   3L,
                         "continent 2", "Country 2",   4L,
                         "continent 2", "Country 3",   1L,
                         "continent 2", "Country 3",   2L,
                         "continent 2", "Country 3",   3L,
                         "continent 2", "Country 3",   4L)

【问题讨论】：

你能显示预期的输出吗
当然，我刚刚将预期的解决方案更新为上面的第一个要点。
如果我理解正确，第一部分可以解决：data[order(sapply(data, function(x) length(unique(x))))]，它首先获取每列唯一值的数量，返回这些的顺序并更改列的顺序data。我不太明白第二部分的预期结果：你想要 3 个节点和 2 个连接 3 个节点的边，其中的顺序仅通过垂直对齐显示？
类似这样的东西：plot(NULL, xlim = c(-10, 10), ylim = c(-10, 10)) rect(-2, 7, 2, 9, col = "red"); text(0, 8, "Continents"); rect(-3, 4, 3, 6, col = "blue"); text(0, 5, "Countries"); rect(-4, 1, 4, 3, col = "green"); text(0, 2, "States")

标签： r data.table tidyverse janitor

【解决方案1】：

如果我猜对了，下面的代码会回答你的问题：

data[order(sapply(data, function(x) length(unique(x))))] # returns the data in the desired order

# simple function for plotting the 'tree'. 
plotTree <- function(lengths, names, space = 0.3){
  L    <- lengths[O <- order(lengths)]
  N    <- names[O]
  XMax <- max(L)
  YMax <- (length(L))
  plot(NULL, xlim = c(-XMax, XMax), ylim = c(-YMax, YMax), axes = F, xlab = "", ylab = "")
  for (i in 1:length(L)){
    rect(-L[i], YMax - 1 - i *  (space + 1), L[i], YMax - i *  (space + 1), col = i)
    text(0, YMax - 1/2 - i * (space + 1), N[i], col = if (i == 1) "white" else "black")
  }
}

# usage
plotTree(sapply(data, function(x) length(unique(x))), names(data), space = 0.3)

【讨论】：

这行得通，除了第一行应该是您在之前评论中建议的内容，即 data[order(sapply(data, function(x) length(unique(x))))]。
让我看看有没有其他建议。如果没有，我会继续接受这个答案。