如何根据来自超过 2 个其他数据帧的条件填充数据帧的空列，所有这些数据帧的长度都不同？答案

【问题标题】：How to fill an empty column from a dataframe, based on conditions from more than 2 other dataframes, all with different lengths?如何根据来自超过 2 个其他数据帧的条件填充数据帧的空列，所有这些数据帧的长度都不同？
【发布时间】：2019-06-05 17:43:32
【问题描述】：

我有第一个数据框（名为“fish_12”），有 74610 行，每行都有海鱼标本的数据。第一列是每个标本所属的物种名称（整个数据框中有很多标本属于同一物种），第二列 BIN 是每个物种的 ID 编号，然后我有名称每个标本的收集者，收集它的国家和我要填写的空列等级。

     species        |    BIN      |    collectors  |  country      | grade
--------------------------------------------------------------------------
Tilapia guineensis  |BOLD:AAL5979 |    C.D. Nwani  |     Nigeria   | NA
Tilapia zillii      |BOLD:AAB9042 |    C.D. Nwani  |     Nigeria   | NA
Fundulus rubrifrons |BOLD:AAI7245 |  John Donavan  |  United States| NA
Eutrigla gurnardus  |BOLD:AAC0262 |Hermann Neumann |    North Sea  | NA
Sprattus sprattus   |BOLD:AAE9187 |Hermann Neumann |    North Sea  | NA
Gadus morhua        |BOLD:ACF1143 |Hermann Neumann |    North Sea  | NA
Tilapia zillii      |BOLD:AAB9042 |     C.D. Nwani |      Nigeria  | NA
Gadus morhua        |BOLD:ACF1169 |   Angela Cicia |  United States| NA

往下看，基本上一个物种只能有一个或多个 BIN，同一个 BIN 有时可以分配给不同的物种。

所以我要做的是填写将等级“E”分配给分配给 BIN 的每个物种的列，该 BIN 本身分配给超过 1 个不同的物种；在第一个数据框中出现少于 3 次的每个物种的等级“D”； “C”分配给分配超过 1 个不同 BIN 但同时分配给该特定物种的每个 BIN 仅分配给一个物种的物种； “B”表示仅分配到一个 BIN 但其每个标本均来自同一收集者和同一国家的物种；最后是“A”，每个物种只分配了一个 BIN，但从多个不同的收集者或多个国家/地区收集了标本。

所以我所做的是创建一个新的数据框，其中包含一列，其中包含为每个物种分配了多少 BIN (bin_per_species)；另一个有一列显示每个 BIN 编号 (species_per_bin) 存在多少物种；另一个有一列显示每个物种存在多少收集器（collectors_per_species）；最后是一个有多少国家分配给每个物种的列（country_per_species）

#creating the other dataframe from the first one 

fish_13=fish_12%>% 
  group_by(species) %>%
  summarise(occurrence = n_distinct(BIN),
            BIN = str_c(unique(BIN), collapse = ","))

names(fish_13)=c("species","bin_per_species","BIN")
View(fish_13)

fish_14=fish_12%>% 
  group_by(BIN) %>%
  summarise(occurrence = n_distinct(species),
            species = str_c(unique(species), collapse = ","))

names(fish_14)=c("BIN","species_per_bin","species")
View(fish_14)
length(unique(fish_14$BIN))

fish_15=fish_12%>% 
  group_by(species) %>%
  summarise(occurrence = n_distinct(collectors),
            collectors = str_c(unique(collectors), collapse = ","))
names(fish_15)=c("species","collector_per_species","collectors")
View(fish_15)

fish_16=fish_12%>% 
  group_by(species) %>%
  summarise(occurrence = n_distinct(country),
            country = str_c(unique(country), collapse = ","))
names(fish_16)=c("species","countries_per_species","country")
View(fish_16)

所以从这里开始，我尝试使用各种 if/else 函数来形成条件，但我遇到的问题是数据帧有不同的长度，我不能同时分配从 A 到 E 的所有等级，因为即使当我设法没有错误时，其中一些会转换回 NA。我想要的输出基本上是第一个数据帧，每个样本都有一个等级。

很抱歉，如果我混淆并以错误的方式呈现数据，但我是这个社区的新手，我正在努力变得更好。提前感谢您的任何回复

【问题讨论】：

标签： r

【解决方案1】：

首先，欢迎来到 SO。

现在关于您的问题：尝试理解所有规则时我感到有些困惑，但我认为解决方案可能很简单。

这些规则主要基于 BIN 行，迭代这些值并从数据中做一个子集，然后应用一个函数来检查规则并更新成绩。

像这样：

bins = unique(fish_12$BIN)
for(b in bins) {
    # Get the index so you can update only the grade of the subset
    sub_fish_index = which(fish_12$BIN == b)
    sub_fish_data = fish_12[,sub_fish_index]

    # use a function to identiffy the patterns and apply the rules (return a vector of rules)
    new_grade = apply_rules(sub_fish_data)

    # Update grade in the main data.frame
    fish_12$grade[sub_fish_index] = new_grade
}

我返回了一个成绩向量，因为某些规则可能能够使用此信息并设置正确的成绩。

希望对你有帮助。

【讨论】：