【问题标题】:Order factor levels according to the order in which the levels appear in the data根据水平在数据中出现的顺序对因子水平进行排序
【发布时间】:2016-07-23 19:50:07
【问题描述】:

这是我使用readHTMLtable从互联网加载的数据框的一部分:

head(tt,59)
    year         sport                      event      athlete_id  medal
1   1896 Track & Field                   100m Men      BURKETOM01   GOLD
2   1896 Track & Field                   100m Men      HOFMAFRI01 SILVER
3   1896 Track & Field                   100m Men       LANEFRA01 BRONZE
4   1896 Track & Field                   100m Men      SZOKOALA01 BRONZE
5   1896 Track & Field                   400m Men      BURKETOM01   GOLD
6   1896 Track & Field                   400m Men      JAMISHER01 SILVER
7   1896 Track & Field                   400m Men      GMELICHA01 BRONZE
8   1896 Track & Field                   800m Men      FLACKTED01   GOLD
9   1896 Track & Field                   800m Men D<C1>NIN<C1>N01 SILVER
10  1896 Track & Field                   800m Men      GOLEMDEM01 BRONZE
11  1896 Track & Field                  1500m Men      FLACKTED01   GOLD
12  1896 Track & Field                  1500m Men      BLAKEART01 SILVER
13  1896 Track & Field                  1500m Men      LERMUALB01 BRONZE
14  1896 Track & Field               Marathon Men      LOUISSPI01   GOLD
15  1896 Track & Field               Marathon Men      VASILCHA01 SILVER
16  1896 Track & Field               Marathon Men      KELLNGYU01 BRONZE
17  1896 Track & Field           110m Hurdles Men      CURTITOM01   GOLD
18  1896 Track & Field           110m Hurdles Men      GOULDGRA01 SILVER
19  1896 Track & Field              High Jump Men      CLARKELL01   GOLD
20  1896 Track & Field              High Jump Men      CONNOJAM01 SILVER
21  1896 Track & Field              High Jump Men      GARREBOB01 SILVER
22  1896 Track & Field             Pole Vault Men       HOYTBIL01   GOLD
23  1896 Track & Field             Pole Vault Men      TYLERALB01 SILVER
24  1896 Track & Field             Pole Vault Men      THEODIOA01 BRONZE
25  1896 Track & Field             Pole Vault Men      DAMASEVA01 BRONZE
26  1896 Track & Field              Long Jump Men      CLARKELL01   GOLD
27  1896 Track & Field              Long Jump Men      GARREBOB01 SILVER
28  1896 Track & Field              Long Jump Men      CONNOJAM01 BRONZE
29  1896 Track & Field            Triple Jump Men      CONNOJAM01   GOLD
30  1896 Track & Field            Triple Jump Men   TUFF<C8>ALE01 SILVER
31  1896 Track & Field            Triple Jump Men      PERSAIOA01 BRONZE
32  1896 Track & Field               Shot Put Men      GARREBOB01   GOLD
33  1896 Track & Field               Shot Put Men      GOUSKMIL01 SILVER
34  1896 Track & Field               Shot Put Men      PAPASGEO01 BRONZE
35  1896 Track & Field           Discus Throw Men      GARREBOB01   GOLD
36  1896 Track & Field           Discus Throw Men      PARASPAN01 SILVER
37  1896 Track & Field           Discus Throw Men      VERSISOT01 BRONZE
38  1896       Cycling 2000m Sprint (Scratch) Men      MASSOPAU01   GOLD
39  1896       Cycling 2000m Sprint (Scratch) Men      NIKOLSTA01 SILVER
40  1896       Cycling 2000m Sprint (Scratch) Men   FLAMEL<C9>O01 BRONZE
41  1896       Cycling   Individual Road Race Men      KONSTARI01   GOLD
42  1896       Cycling   Individual Road Race Men      GOEDRAUG01 SILVER
43  1896       Cycling   Individual Road Race Men      BATTEEDW01 BRONZE
44  1896       Cycling               One-Lap Race      MASSOPAU01   GOLD
45  1896       Cycling               One-Lap Race      NIKOLSTA01 SILVER
46  1896       Cycling               One-Lap Race      SCHMAADO01 BRONZE
47  1896       Cycling            10km Track Race      MASSOPAU01   GOLD
48  1896       Cycling            10km Track Race   FLAMEL<C9>O01 SILVER
49  1896       Cycling            10km Track Race      SCHMAADO01 BRONZE
50  1896       Cycling           100km Track Race   FLAMEL<C9>O01   GOLD
51  1896       Cycling           100km Track Race      KOLETGEO01 SILVER
52  1896       Cycling               12-Hour Race      SCHMAADO01   GOLD
53  1896       Cycling               12-Hour Race      KEEPIFRA01 SILVER
54  1896       Fencing           Foil, Individual      GRAVEEUG01   GOLD
55  1896       Fencing           Foil, Individual      CALLOHEN01 SILVER
56  1896       Fencing           Foil, Individual      PIERRPER01 BRONZE
57  1896       Fencing          Sabre, Individual      GEORGIOA01   GOLD
58  1896       Fencing          Sabre, Individual      KARAKTEL01 SILVER
59  1896       Fencing          Sabre, Individual      NIELSHOL01 BRONZE

如您所见,变量sport 是一个因素。当我检查级别时,这就是我得到的:

levels(tt$sport)
[1] "Cycling"       "Fencing"       "Gymnastics"    "Shooting"      "Swimming"      "Tennis" 
[7] "Track & Field" "Weightlifting" "Wrestling 

由于某种原因,级别出现的顺序与数据框中的顺序不匹配。我正在寻找一种使用levels函数的方法,它会给我一个根据第一次出现组织的级别列表,如下所示:

levels(medals.df$tt)
[1] "Track & Field" "Cycling"       "Fencing"       "Gymnastics"    "Shooting"    "Swimming"
[7] "Tennis"        "Weightlifting" "Wrestling"

现在要记住的另一件事是,列运动不是“块设计”,这意味着前 59 行具有相邻的所有相同值,但在整个数据帧中并非如此。

【问题讨论】:

    标签: r r-factor


    【解决方案1】:

    我使用了@gung 在他的回答中设置的数据框:

    d <- read.table(text="rn    year    sport          event      athlete_id medal
    1   1896 'Track & Field'                   '100m Men'      'BURKETOM01'   'GOLD'
    53  1896       'Cycling'               '12-Hour Race'      'KEEPIFRA01' 'SILVER'
    54  1896       'Fencing'           'Foil, Individual'      'GRAVEEUG01'   'GOLD'
    55  1896       'Gymnastics'           'Foil, Individual'      'CALLOHEN01' 'SILVER'
    56  1896       'Shooting'           'Foil, Individual'      'PIERRPER01' 'BRONZE'
    57  1896       'Swimming'          'Sabre, Individual'      'GEORGIOA01'   'GOLD'
    58  1896       'Tennis'          'Sabre, Individual'      'KARAKTEL01' 'SILVER'
    58  1896       'Weightlifting'          'Sabre, Individual'      'KARAKTEL01' 'SILVER'
    59  1896       'Wrestling'          'Sabre, Individual'      'NIELSHOL01' 'BRONZE'", 
                header=T)
    
    levels(d$sport)
    

    然后你可以像这样在因子函数中使用unique(d$sport)

    d$sport <- factor(d$sport, levels=unique(d$sport))
    # Check the results:
    levels(d$sport)
    

    【讨论】:

      【解决方案2】:

      请注意,我必须调整您的数据集,以便您列出的所有级别都显示出来,并按照您指定的顺序进行。从那里,我编写了一个简单的函数,它按照它们在数据集中出现的顺序输出级别。关键是使用which(列出符合标准的观察的行数)、min(选择最小值)和order(告诉您从最低值开始的顺序)到最高)。

      d <- read.table(text="rn    year    sport          event      athlete_id  medal
      1   1896 'Track & Field'                   '100m Men'      'BURKETOM01'   'GOLD'
      53  1896       'Cycling'               '12-Hour Race'      'KEEPIFRA01' 'SILVER'
      54  1896       'Fencing'           'Foil, Individual'      'GRAVEEUG01'   'GOLD'
      55  1896       'Gymnastics'           'Foil, Individual'      'CALLOHEN01' 'SILVER'
      56  1896       'Shooting'           'Foil, Individual'      'PIERRPER01' 'BRONZE'
      57  1896       'Swimming'          'Sabre, Individual'      'GEORGIOA01'   'GOLD'
      58  1896       'Tennis'          'Sabre, Individual'      'KARAKTEL01' 'SILVER'
      58  1896       'Weightlifting'          'Sabre, Individual'      'KARAKTEL01' 'SILVER'
      59  1896       'Wrestling'          'Sabre, Individual'      'NIELSHOL01' 'BRONZE'", 
                      header=T)
      
      levels(d$sport)
      # [1] "Cycling"       "Fencing"       "Gymnastics"    "Shooting"     
      # [5] "Swimming"      "Tennis"        "Track & Field" "Weightlifting"
      # [9] "Wrestling"    
      
      level.order <- function(var){
        l <- levels(var)
        o <- c()
        for(i in 1:length(l)){
          o[i] <- min(which(var==l[i]))
        }
        return(l[order(o)])
      }
      level.order(d$sport)
      # [1] "Track & Field" "Cycling"       "Fencing"       "Gymnastics"   
      # [5] "Shooting"      "Swimming"      "Tennis"        "Weightlifting"
      # [9] "Wrestling"    
      

      从这里开始,如果您想将默认顺序(字母顺序)更改为级别在数据集中显示的顺序,您可以使用factor。考虑:

      levels(d$sport)
      # [1] "Cycling"       "Fencing"       "Gymnastics"    "Shooting"     
      # [5] "Swimming"      "Tennis"        "Track & Field" "Weightlifting"
      # [9] "Wrestling"    
      d$sport <- factor(d$sport, levels=level.order(d$sport))
      levels(d$sport)
      # [1] "Track & Field" "Cycling"       "Fencing"       "Gymnastics"   
      # [5] "Shooting"      "Swimming"      "Tennis"        "Weightlifting"
      # [9] "Wrestling"    
      

      【讨论】:

      • 您也可以使用level.order() 函数来代替d$sport &lt;- factor(d$sport, levels = unique(d$sport))
      • 好点,@KenS。我不知道unique() 总是按照它们出现的顺序列出这些值。您为什么不将其作为官方答案?
      • 不客气,@Lee。如果肯斯。使用unique 发布答案,您可能应该将复选标记移给他。这将是一个比我提出的更简单、更优雅的解决方案。
      • @Lee,这当然是您的决定,但您可能需要考虑将复选标记切换为 KenS. 的答案,这比我的解决方案更简洁。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-01-02
      • 2017-07-18
      • 2011-12-08
      • 2021-11-14
      • 2021-11-17
      • 2018-12-07
      • 2017-07-14
      相关资源
      最近更新 更多