【问题标题】:replace text using values from lookup table without for loop使用查找表中的值替换文本而不使用 for 循环
【发布时间】:2021-01-04 19:46:16
【问题描述】:

我正在编写一个拼写纠正功能。我从维基百科上抓取了spelling variants 页面并将其转换为表格。我现在想将其用作查找表(拼写)并替换我的文档(skills.db)中的值。 注意:下面的技能数据框只是一个示例。忽略第二列。我将在简历处理过程中更早地进行拼写更正。简历很大,所以我想我会改为分享。

我可以使用下面的 for 循环来做到这一点,但是我想知道是否有更好的解决方案

spellings = structure(list(preferred_spellings = c("organisation", "acknowledgement", 
"cypher", "anaesthesia", "analyse"), other_spellings = c(" organization", 
" acknowledgment", " cipher", " anesthesia", " analyze")), row.names = c(NA, 
5L), class = "data.frame")

skills.db = structure(list(skills = c("variance analysis static", "analyze kpi", 
"financial analysis", "variance analysis", "organizational", 
"analysis", "organize", "result analysis", "analytic", "datum analysis", 
"analytics", "business analysis", "organized", "quantitative analysis", 
"train need analysis", "analytic think", "analysis trial preparation", 
"analyze statue", "google analytics", "service analysis", "organize individual", 
"account analysis", "analyze department work", "pareto analysis train", 
"organization", "ratio analysis", "statistical analysis", "project organization", 
"organize client's file", "with good analytic", "nielsen analytics", 
"datum analytics", "textual analytics", "social analytics", "business intelligence analytics", 
"market analysis", "analyse", "analytic skill", "superb analytic", 
"financial statement analysis", "credit analysis", "quick analysis", 
"organizational development", "outstanding financial analytic", 
"organization design", "organize conference", "business analytics", 
"industry analysis", "fs analysis", "analyze", "cash flow analysis", 
"investment analysis", "technical analysis bloomberg", "community organize", 
"monthly financial analysis", "expense variance analysis", "stock analysis"
), level1 = c("variance analysis static", "analyze kpi", "financial analysis", 
"variance analysis", "organizational", "analysis", "organize", 
"result analysis", "analytic", "datum analysis", "analytics", 
"business analysis", "organized", "quantitative analysis", "train need analysis", 
"analytic think", "analysis trial preparation", "analyze statue", 
"google analytics", "service analysis", "organize individual", 
"account analysis", "analyze department work", "pareto analysis train", 
"organization", "ratio analysis", "statistical analysis", "project organization", 
"organize client's file", "with good analytic", "nielsen analytics", 
"datum analytics", "textual analytics", "social analytics", "business intelligence analytics", 
"market analysis", "analyse", "analytic skill", "superb analytic", 
"financial statement analysis", "credit analysis", "quick analysis", 
"organizational development", "outstanding financial analytic", 
"organization design", "organize conference", "business analytics", 
"industry analysis", "fs analysis", "analyze", "cash flow analysis", 
"investment analysis", "technical analysis bloomberg", "community organize", 
"monthly financial analysis", "expense variance analysis", "stock analysis"
)), row.names = c(49L, 65L, 77L, 82L, 155L, 190L, 215L, 244L, 
246L, 260L, 287L, 300L, 311L, 323L, 349L, 356L, 378L, 386L, 447L, 
607L, 622L, 664L, 686L, 766L, 824L, 832L, 895L, 922L, 928L, 949L, 
1020L, 1054L, 1079L, 1080L, 1081L, 1088L, 1146L, 1158L, 1228L, 
1248L, 1319L, 1366L, 1385L, 1440L, 1468L, 1475L, 1509L, 1554L, 
1584L, 1606L, 1635L, 1658L, 1660L, 1696L, 1760L, 1762L, 1798L
), class = "data.frame")

for(i in 1:nrow(spellings)){
    skills.db = skills.db %>% mutate(TEST = gsub(spellings$other_spellings[i], spellings$preferred_spellings[i], skills))
  } 

【问题讨论】:

  • 我可能会从names(spellings)[1] <- "preferred_spellings" 开始;-)
  • @r2evans 很好。这正是我需要这个功能的原因:D
  • 另外,您真的打算在所有替换单词的前面插入一个新空格吗?
  • @r2evens 没有。我将对拼写执行 trims() 以删除多余的空格
  • 有点强迫症,@marc_s? :-)

标签: r text


【解决方案1】:

这是一种方法,使用Reduce(很容易是purrr::reduce)来迭代每个拼写并更正它们。

spellings_list <- asplit(spellings, 1)
skills.db %>%
  mutate(TEST = Reduce(function(txt, spl) gsub(spl[2], spl[1], txt), spellings_list, init = skills), changed = (skills != TEST))
#                             skills                          level1                            TEST changed
# 1         variance analysis static        variance analysis static        variance analysis static   FALSE
# 2                      analyze kpi                     analyze kpi                     analyse kpi    TRUE
# 3               financial analysis              financial analysis              financial analysis   FALSE
# 4                variance analysis               variance analysis               variance analysis   FALSE
# 5                   organizational                  organizational                  organisational    TRUE
# 6                         analysis                        analysis                        analysis   FALSE
# 7                         organize                        organize                        organize   FALSE
# 8                  result analysis                 result analysis                 result analysis   FALSE
# 9                         analytic                        analytic                        analytic   FALSE
# 10                  datum analysis                  datum analysis                  datum analysis   FALSE
# 11                       analytics                       analytics                       analytics   FALSE
# 12               business analysis               business analysis               business analysis   FALSE
# 13                       organized                       organized                       organized   FALSE
# 14           quantitative analysis           quantitative analysis           quantitative analysis   FALSE
# 15             train need analysis             train need analysis             train need analysis   FALSE
# 16                  analytic think                  analytic think                  analytic think   FALSE
# 17      analysis trial preparation      analysis trial preparation      analysis trial preparation   FALSE
# 18                  analyze statue                  analyze statue                  analyse statue    TRUE
# 19                google analytics                google analytics                google analytics   FALSE
# 20                service analysis                service analysis                service analysis   FALSE
# 21             organize individual             organize individual             organize individual   FALSE
# 22                account analysis                account analysis                account analysis   FALSE
# 23         analyze department work         analyze department work         analyse department work    TRUE
# 24           pareto analysis train           pareto analysis train           pareto analysis train   FALSE
# 25                    organization                    organization                    organisation    TRUE
# 26                  ratio analysis                  ratio analysis                  ratio analysis   FALSE
# 27            statistical analysis            statistical analysis            statistical analysis   FALSE
# 28            project organization            project organization            project organisation    TRUE
# 29          organize client's file          organize client's file          organize client's file   FALSE
# 30              with good analytic              with good analytic              with good analytic   FALSE
# 31               nielsen analytics               nielsen analytics               nielsen analytics   FALSE
# 32                 datum analytics                 datum analytics                 datum analytics   FALSE
# 33               textual analytics               textual analytics               textual analytics   FALSE
# 34                social analytics                social analytics                social analytics   FALSE
# 35 business intelligence analytics business intelligence analytics business intelligence analytics   FALSE
# 36                 market analysis                 market analysis                 market analysis   FALSE
# 37                         analyse                         analyse                         analyse   FALSE
# 38                  analytic skill                  analytic skill                  analytic skill   FALSE
# 39                 superb analytic                 superb analytic                 superb analytic   FALSE
# 40    financial statement analysis    financial statement analysis    financial statement analysis   FALSE
# 41                 credit analysis                 credit analysis                 credit analysis   FALSE
# 42                  quick analysis                  quick analysis                  quick analysis   FALSE
# 43      organizational development      organizational development      organisational development    TRUE
# 44  outstanding financial analytic  outstanding financial analytic  outstanding financial analytic   FALSE
# 45             organization design             organization design             organisation design    TRUE
# 46             organize conference             organize conference             organize conference   FALSE
# 47              business analytics              business analytics              business analytics   FALSE
# 48               industry analysis               industry analysis               industry analysis   FALSE
# 49                     fs analysis                     fs analysis                     fs analysis   FALSE
# 50                         analyze                         analyze                         analyse    TRUE
# 51              cash flow analysis              cash flow analysis              cash flow analysis   FALSE
# 52             investment analysis             investment analysis             investment analysis   FALSE
# 53    technical analysis bloomberg    technical analysis bloomberg    technical analysis bloomberg   FALSE
# 54              community organize              community organize              community organize   FALSE
# 55      monthly financial analysis      monthly financial analysis      monthly financial analysis   FALSE
# 56       expense variance analysis       expense variance analysis       expense variance analysis   FALSE
# 57                  stock analysis                  stock analysis                  stock analysis   FALSE

我添加changed 只是为了试金石,假设您知道哪些输入应该不同。

演练:

  1. Reduce 将为每个拼写更正遍历整个skills 列。其函数一次迭代的输入将是前一次迭代的输出,这是我们保留更改的必要属性。

  2. 不幸的是,我们在这里不能轻易使用Vectorize,而Reduce 通常喜欢简单的2-argument 函数(不容易Map-able),所以我将spellings 框架分解为长度为 2 的向量列表:

    spellings_list <- asplit(spellings, 1)
    spellings_list
    # $`1`
    # preferred_spellings     other_spellings 
    #      "organisation"     " organization" 
    # $`2`
    # preferred_spellings     other_spellings 
    #   "acknowledgement"   " acknowledgment" 
    # $`3`
    # preferred_spellings     other_spellings 
    #            "cypher"           " cipher" 
    # $`4`
    # preferred_spellings     other_spellings 
    #       "anaesthesia"       " anesthesia" 
    # $`5`
    # preferred_spellings     other_spellings 
    #           "analyse"          " analyze" 
    

    这让我们可以更轻松地使用gsub(spl[1], spl[2], ...)

  3. Reduce 的艺术在于知道在何处使用哪个参数,以及何时使用init=。这是一门艺术。当我把自己置于怀疑什么被喂到哪里的位置时,我在 anon-func 的开头插入一个 browser() 并运行了几次减少迭代。

  4. 建议:您可能希望将other_spellings\\b 夹在其字符串的任一侧,以防止部分匹配替换。例如,您的spellings 也将替换organizational,即使它实际上并不存在。虽然可能需要 那个,但根据您的较大列表,很容易出现误报。 (例如,color/colourColorado。)

(已编辑:我最初在gsub 中交换了spl[1]spl[2]。显然这方面的艺术也有“逻辑” :-)

【讨论】:

  • 逻辑对我来说很有意义。我会在简历上测试它。我喜欢 Reduce 如何使用先前迭代的输出,这就是为什么我不能使用 apply 系列函数而不是 for 循环。
  • 感谢您的建议。我将在我的数据集上对其进行测试。
猜你喜欢
  • 2014-10-11
  • 1970-01-01
  • 2016-12-15
  • 1970-01-01
  • 1970-01-01
  • 2017-10-21
  • 1970-01-01
  • 1970-01-01
  • 2017-08-30
相关资源
最近更新 更多