【发布时间】:2016-11-14 00:15:37
【问题描述】:
我的两个数据框有相同的字符列。使用 dplyr::full_joint 在此列中加入它们会很容易。但问题是common列在拼写上有细微但明显的差异。与定义技能的每个字符串相比,拼写差异很小:
Skill Grade_Judge_A
pack & ship 1
pack & store 5
sell 3
Design a room 9
Skill Grade_Judge_B
pack and store 3
pack & ship 7
sell 2
Design room 6
如何在下面实现所需的输出:
Skill Grade_Judge_A Grade_Judge_B
pack & ship 1 3
pack & store 5 7
sell 3 2
Design a room 9 6
我正在考虑根据“技能”列中字符串之间的距离匹配两个数据框中的行,例如使用 stringdist 包。如果两根弦的差别很小,说明技能是一样的。
我更喜欢 dplyr/tidyverse 解决方案。
这是数据框 A 的实际输出:
> dput(df_A)
structure(list(skill = c(" [Assess abdomen for a floating mass]",
" [Assess Nerve Root Compression]", " [Evaluate breathing effort (rate, patterns, chest expansions)]",
" [Evaluate Plantar Reflex/Babinski sign]", " [Evaluate Speech]",
" [External palpation of a uterus]", " [Heel to Shin test]",
" [Inspect anterior chamber of eye with ophthalmoscope or penlight]",
" [Inspect breast]", " [Inspect Overall Skin Color/Tone]", " [Inspect Skin Lesions]",
" [Inspect Wounds]", " [Mental Status/level of consciousness]",
" [Nose/index finger]", " [Percuss abdomen to determine spleen size]",
" [Percuss costovertebral angle for kidney tenderness]", " [Percuss for diaphragmatic excursion]",
" [Percuss the abdomen for abdominal tones]", " [Percuss the abdomen to determine liver span]"
), `2016-09-17 13:41:08` = c(1, 1, 5, 3, 4, 0, 4, 3, 3, 5, 4,
5, 5, 3, 1, 1, 2, 4, 1)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -19L), .Names = c("skill", "2016-09-17 13:41:08"
))
和数据框 B:
> dput(df_B)
structure(list(skill = c(" [Assess abdomen for floating mass]",
" [Assess nerve root compression]", " [Evaluate breathing effort (rate, patterns, chest expansion)]",
" [Evaluate plantar reflex/Babinski sign]", " [Evaluate speech]",
" [External palpation of uterus]", " [Heel to shin test]", " [Inspect anterior chamber of the eye with opthalmoscope or penlight]",
" [Inspect breasts]", " [Inspect overall skin color/tone]", " [Inspect skin lesions]",
" [Inspect wounds]", " [Mental status/level of consciousness]",
" [Nose/Index finger]", " [Percuss costovertebral angle for kidney tenderness]",
" [Percuss for diaphragmatic excursion]", " [Percuss the abdomen for abdominal tones]",
" [Percuss the abdomen to determine liver span]", " [Percuss the abdomen to determine spleen size]"
), `2016-09-21 07:58:43` = c(0, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -19L), .Names = c("skill", "2016-09-21 07:58:43"
))
这是两个数据帧的头部:
> head(df_A)
# A tibble: 6 × 2
skill `2016-09-17 13:41:08`
<chr> <dbl>
1 [Assess abdomen for a floating mass] 1
2 [Assess Nerve Root Compression] 1
3 [Evaluate breathing effort (rate, patterns, chest expansions)] 5
4 [Evaluate Plantar Reflex/Babinski sign] 3
5 [Evaluate Speech] 4
6 [External palpation of a uterus] 0
第二个:
> head(df_B)
# A tibble: 6 × 2
skill `2016-09-21 07:58:43`
<chr> <dbl>
1 [Assess abdomen for floating mass] 0
2 [Assess nerve root compression] 2
3 [Evaluate breathing effort (rate, patterns, chest expansion)] 2
4 [Evaluate plantar reflex/Babinski sign] 2
5 [Evaluate speech] 2
6 [External palpation of uterus] 1
【问题讨论】:
-
您可以在合并之前解决系统差异(例如,将“&”替换为“and”并将所有字符串转换为小写)。但是否也存在随机拼写错误?
-
另外,如果你想融入宇宙意识,
full_joint可能是合适的,但full_join应用于数据时会更有效。 -
不知道会出现哪种类型的拼写错误。众所周知,它们相对于字符串的长度来说很小。定义每个技能的字符串长度为 20 到 60 个字符,但差异很小,例如缺少文章。可以使用 stringdist 来配对技能吗?如果 stringdist 很小,则假定是相同的技能。
-
是的,
stringdist可能能够解决大多数问题,尽管您可能会遇到一些拼写正确的技能彼此相似的情况。你能提供更多的数据示例吗?使用dput提供示例数据。
标签: r dataframe dplyr tidyverse