【问题标题】:Plots characters and numberic from dataframe and mapping in r从数据框中绘制字符和数字并在 r 中映射
【发布时间】:2020-10-02 15:28:13
【问题描述】:

所以我有一个 113K 行 X 14 列的中型数据库

Month District   Age Gender Education Disability Religion                          Occupation JobSeekers
1 2020-01      Dan   U17   Male      None       None   Jewish              Unprofessional workers          2
2 2020-01      Dan   U17   Male      None       None  Muslims          Sales and costumer service          1
3 2020-01      Dan   U17 Female      None       None    Other                           Undefined          1
4 2020-01      Dan 18-24   Male      None       None   Jewish         Production and construction          1
5 2020-01      Dan 18-24   Male      None       None   Jewish                     Academic degree          1
6 2020-01      Dan 18-24   Male      None       None   Jewish Practical engineers and technicians          1
  GMI ACU NACU NewSeekers NewFiredSeekers
1   0   0    2          0               0
2   0   0    1          0               0
3   0   0    1          0               0
4   0   0    1          0               0
5   0   0    1          0               0
6   0   0    1          1               1

我将它分组到一个较小的表中,其中包含我需要使用的某些数据

Sorta <- datac %>% 
  group_by(District, Month,Gender, Occupation) %>% 
  summarise(JobSeekers=sum(JobSeekers))

结果:

  District Month   Gender Occupation                    JobSeekers   GMI   ACU  NACU NewSeekers NewFiredSeekers
  <chr>    <chr>   <chr>  <chr>                              <int> <int> <int> <int>      <int>           <int>
1 Dan      2020-01 Female Academic degree                     4560   120  2622  1818        863             597
2 Dan      2020-01 Female Agriculture, forestry and fi~         14     9     2     3          1               0
3 Dan      2020-01 Female Machine Operators and drivers         57     6    10    41          9               6
4 Dan      2020-01 Female Managers                            1913    36   969   908        390             310
5 Dan      2020-01 Female Officials and clerks                1702   120   263  1319        344             243
6 Dan      2020-01 Female Practical engineers and tech~       2847    66  1125  1656        671             504

比我试图从该表中绘制的数据应该显示趋势,如按地区划分的失业人数、显示失业率随时间增长的时间表等等 每次我尝试这样做时,我都会收到有关字符列的各种错误 所以我请求您帮助将字符和数值绘制在一起

结构如下:

structure(
  list(
    District = c(
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan"
    ),
    Month = c(
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01"
    ),
    Gender = c(
      "Female",
      "Female",
      "Female",
      "Female",
      "Female",
      "Female",
      "Female",
      "Female",
      "Female",
      "Female",
      "Male",
      "Male",
      "Male",
      "Male",
      "Male",
      "Male",
      "Male",
      "Male",
      "Male",
      "Male"
    ),
    Occupation = c(
      "Academic degree",
      "Agriculture, forestry and fishing",
      "Machine Operators and drivers",
      "Managers",
      "Officials and clerks",
      "Practical engineers and technicians",
      "Production and construction",
      "Sales and costumer service",
      "Undefined",
      "Unprofessional workers",
      "Academic degree",
      "Agriculture, forestry and fishing",
      "Machine Operators and drivers",
      "Managers",
      "Officials and clerks",
      "Practical engineers and technicians",
      "Production and construction",
      "Sales and costumer service",
      "Undefined",
      "Unprofessional workers"
    ),
    JobSeekers = c(
      4560L,
      14L,
      57L,
      1913L,
      1702L,
      2847L,
      480L,
      3086L,
      893L,
      1985L,
      2605L,
      44L,
      1276L,
      2236L,
      247L,
      2249L,
      1258L,
      2233L,
      924L,
      2462L
    ),
    GMI = c(
      120L,
      9L,
      6L,
      36L,
      120L,
      66L,
      47L,
      396L,
      155L,
      998L,
      119L,
      26L,
      240L,
      101L,
      30L,
      111L,
      322L,
      359L,
      309L,
      1124L
    ),
    ACU = c(
      2622L,
      2L,
      10L,
      969L,
      263L,
      1125L,
      99L,
      392L,
      259L,
      52L,
      1549L,
      1L,
      49L,
      797L,
      44L,
      829L,
      102L,
      202L,
      124L,
      58L
    ),
    NACU = c(
      1818L,
      3L,
      41L,
      908L,
      1319L,
      1656L,
      334L,
      2298L,
      479L,
      935L,
      937L,
      17L,
      987L,
      1338L,
      173L,
      1309L,
      834L,
      1672L,
      491L,
      1280L
    ),
    NewSeekers = c(
      863L,
      1L,
      9L,
      390L,
      344L,
      671L,
      83L,
      622L,
      201L,
      325L,
      550L,
      5L,
      239L,
      469L,
      53L,
      525L,
      233L,
      432L,
      212L,
      324L
    ),
    NewFiredSeekers = c(
      597L,
      0L,
      6L,
      310L,
      243L,
      504L,
      60L,
      375L,
      123L,
      150L,
      447L,
      4L,
      196L,
      405L,
      41L,
      429L,
      162L,
      316L,
      124L,
      190L
    )
  ),
  row.names = c(NA,-20L),
  class = c("grouped_df", "tbl_df", "tbl", "data.frame"),
  groups = structure(
    list(
      District = c("Dan", "Dan"),
      Month = c("2020-01", "2020-01"),
      Gender = c("Female", "Male"),
      .rows = list(1:10, 11:20)
    ),
    row.names = c(NA,-2L),
    class = c("tbl_df", "tbl", "data.frame"),
    .drop = TRUE
  )
)

第二个问题是关于如何制作失业人员/职业/年龄的“热点”区域的地图

请帮忙!

更新:

dist.oc.mo <- Cdata %>% 
  group_by(District,Gender,Occupation,Month) %>% 
  summarise(JobSeekers=sum(JobSeekers),GMI=sum(GMI), ACU=sum(ACU), NACU=sum(NACU), NewSeekers=sum(NewSeekers), NewFiredSeekers=sum(NewFiredSeekers))


p <- ggplot(data = dist.oc.mo) +
  geom_bar(mapping = aes(x = Occupation, y = JobSeekers, fill=factor(District)), 
           stat = "identity", position = "dodge", alpha=0.7 ) + 
  labs(title = "March-April Jobseekers", subtitle = "This barchart describes unemployment trend for March and April sorted by jobseekers number and occupation type", fill = "District", 
       x = "Occupation", y = "JobSeekers") +
  scale_x_discrete(labels = wrap_format(10)) +
  scale_fill_brewer(palette="Set1") +
  theme(legend.position = "bottom")
p

[https://i.stack.imgur.com/v0R0V.jpg][1]

问候, 摩西

【问题讨论】:

  • 如果您可以提供一个代表来复制您的问题,那将会很有帮助。请看链接:stackoverflow.com/questions/5963269/…
  • 嗨@YBS 感谢您的关注。我从表中粘贴了 1:20,顺便说一下,如果它有什么不同,它会显示为 tibble。希望你能帮助我,再次感谢!
  • 您在寻找什么类型的情节?条形图是否足够,或者您有特定的想法?
  • 我需要一个条形图和“热点地图”示例,显示有大量失业人口的地区。非常感谢您的帮助!

标签: r database ggplot2 plot statistics


【解决方案1】:

将您的数据视为df。然后我添加了一些名为 Bob 和 John 的虚拟区域。另外,我只考虑了这个例子的前 5 个职业。条形图代码如下:

myoccupation = c(
    "Academic degree",
    "Agriculture, forestry and fishing",
    "Machine Operators and drivers",
    "Managers",
    "Officials and clerks")

  df1 <- mutate(df, District="Bob", JobSeekers=(JobSeekers+50*row_number()*row_number()),
                GMI=(GMI-row_number()), ACU=(ACU+row_number()), NACU=(NACU-row_number()),
                NewSeekers=(NewSeekers+row_number()), NewFiredSeekers=(NewFiredSeekers+row_number()))
  df2 <- mutate(df, District="John", JobSeekers=(JobSeekers+88*row_number()),
                GMI=(GMI-row_number()+25), ACU=(ACU+5*row_number()-1), NACU=(NACU-row_number()+1),
                NewSeekers=(NewSeekers+row_number()-2), NewFiredSeekers=(NewFiredSeekers+row_number()-3))         

  df3 <- rbind(df,df1,df2)
  df4 <- df3[df3$Occupation %in% myoccupation,]

  p <- ggplot(data = df4) +
    geom_bar(mapping = aes(x = Occupation, y = JobSeekers, fill=factor(District)), 
             stat = "identity", position = "dodge", alpha=0.7 ) + 
    labs(title = "Bar Chart", fill = "District", 
         x = "Occupation", y = "JobSeekers") +
    scale_x_discrete(labels = wrap_format(10)) +
    scale_fill_brewer(palette="Set1") +
    theme(legend.position = "bottom")
  p

你会得到以下输出:

请注意,在此图中,男性和女性条相互重叠。较深的阴影是两个值中的较低值。您可以单独绘制它们。

对于热点,您需要对密度图等进行一些研究。您需要使用原始数据,而不是汇总数据。下面给出了一个示例 2d 密度图:

dfa <- tibble(x_variable = rnorm(5000), y_variable = rnorm(5000))
  p2d <- ggplot(dfa, aes(x = x_variable, y = y_variable)) +
    stat_density2d(aes(fill = ..density..), contour = F, geom = 'tile') +
    scale_fill_viridis()
  p2d

请注意,在 SO 中,我们只能回答您的 r 代码中的任何问题。

更新:子集数据只包括女性

df5 <- subset(df4, Gender=="Female")

然后在上面的ggplot 代码中使用df5,您会得到以下输出:

请注意,我使用手动分配颜色为scale_fill_manual(values=c("blue","green","purple")),因为我知道我的数据中有 3 个区。

【讨论】:

  • 您好@YBS,我在尝试运行此代码时遇到错误,提示 wrap_format 函数未知。 “wrap_format(10) 中的错误:找不到函数“wrap_format””。如果您不介意,还有 2 个问题: 1. 我一般如何使用它来统计其他值? (如 x = 地区,y = 新求职者)。我理解除了 row_number() 函数和数字 50,88 等之外的所有内容 2. 区栏(红色的)实际代表什么?对不起,如果我的问题很愚蠢。我是一名试图在数据科学课程中完成高于平均水平的项目的学生。非常感谢您的关注和工作!
  • 我创建了 Bob 和 John 区的虚拟数据。为了创建虚拟对象,我使用了row_number() 来更改一些变量的值。 row_number() 代表数据框中的行号。在你的情节中,你应该使用你的数据框而不是df4。然后,您将拥有真实数据中存在的地区。如果你只有一个区,那也应该没问题。您可能缺少一些软件包来获取该错误。请安装名为 tidyr 的软件包。
  • 嗨@YBS,希望你能记住这个案子。所以我设法按照描述使用你的情节,但我无法摆脱红色的区栏。它实际上并不代表任何东西。我该怎么做?还有一件事 - 如果没有通过代码提及性别列,那么彼此之间的更暗和更亮如何?让我再次感谢您的帮助。接受这个问题,您提供的方法没有问题,我很感激!
  • 如果红色是默认列表中的第一个颜色,那么无论您拥有哪个区,都将获得该红色。如果你想要不同的颜色,你可以指定你想要的颜色。由于 District 是填充因子,因此您可以指定一种颜色或与区域数一样多的颜色。只需添加scale_fill_manual(values=c("blue", "green")) ## this is for two colors, if you have two districts
  • 我从代码中添加了一个示例和一个视觉效果,因此您可以看到我要解释的内容。还有一个类别叫做“地区”,它似乎是计算其他职业的总和并显示出来。我想摆脱这个值,让其余的保持原样。
猜你喜欢
  • 1970-01-01
  • 2020-03-31
  • 2021-11-15
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-11-28
  • 2019-11-13
相关资源
最近更新 更多