基于字符串切换到新行并在R中的括号内提取信息答案

【问题标题】：Switch to new row based on string and extract info within parantheses in R基于字符串切换到新行并在R中的括号内提取信息
【发布时间】：2021-05-04 13:12:00
【问题描述】：

我有一个如下所示的字符串：

    [1] "1st April 2004 (Queens Cross) — 16th August 2007 (Midstocket/Rosemount) — 19th May 2011 (Dyce/Bucksburn/Danestone) — 23rd June 2011 (Airyhall/Broomhill/Garthdee) — 30th July 2015 (Hilton/Woodside/Stockethill; Kincorth/Nigg/Cove) — 1st October 2015 (George St/Harbour; Midstocket/Rosemount) — 3rd October 2019 (Bridge of Don) — 21st November 2019 (Torry/Ferryhill) — 5th November 2020 (Kincorth/Nigg/Cove)"                                                          
[2] "9th June 2005 (Huntly E.) — 1st May 2008 (Troup) — 23rd April 2009 (Aboyne, Upper Deeside & Donside) — 27th November 2014 (Troup) — 5th November 2015 (Huntly, Strathbogie & Howe of Alford) — 3rd November 2016 (Banff & District; Inverurie & District) — 12th October 2017 (Inverurie & District) — 15th October 2020 (Ellon & District)"

总共有 300 多行。我想做的是两件事：

首先，当有— 字符串时创建新行。所以结果看起来像这样：

goal_1
 [1] "1st April 2004 (Queens Cross)"                                   
 [2] "16th August 2007 (Midstocket/Rosemount)"                         
 [3] "19th May 2011 (Dyce/Bucksburn/Danestone)"                        
 [4] "23rd June 2011 (Airyhall/Broomhill/Garthdee)"                    
 [5] "30th July 2015 (Hilton/Woodside/Stockethill; Kincorth/Nigg/Cove)"
 [6] "1st October 2015 (George St/Harbour; Midstocket/Rosemount)"      
 [7] "3rd October 2019 (Bridge of Don)"                                
 [8] "21st November 2019 (Torry/Ferryhill)"                            
 [9] "5th November 2020 (Kincorth/Nigg/Cove)"                          
[10] "9th June 2005 (Huntly E.)"                                       
[11] "1st May 2008 (Troup)"                                            
[12] "23rd April 2009 (Aboyne, Upper Deeside & Donside)"               
[13] "27th November 2014 (Troup)"                                      
[14] "5th November 2015 (Huntly, Strathbogie & Howe of Alford)"        
[15] "3rd November 2016 (Banff & District; Inverurie & District)"      
[16] "12th October 2017 (Inverurie & District)"                        
[17] "15th October 2020 (Ellon & District)"

然后，我想将括号之间的位放入一个新对象中，但是（这是我真正遇到麻烦的地方）如果有 ;分开它我希望它是两个不同的行。因此，仅以前 6 个为例，这就是我要寻找的：

    goal_2
  date             name                       
  <chr>            <chr>                      
1 1st April 2004   Queens Cross               
2 16th August 2007 Midstocket/Rosemount       
3 19th May 2011    Dyce/Bucksburn/Danestone   
4 23rd June 2011   Airyhall/Broomhill/Garthdee
5 30th July 2015   Hilton/Woodside/Stockethill
6 30th July 2015   Kincorth/Nigg/Cove

我很抱歉一次发布两个任务，但我一直在努力完成这个任务，所以所有帮助都会很棒。非常感谢您的帮助！

【问题讨论】：

请在您的原始字符串上使用dput，以便我们更轻松地复制它。

标签： r string dplyr stringr

【解决方案1】：

基础R：

newtxt <- unlist(lapply(strsplit(txt, "\u2014"), trimws))
newtxt
#  [1] "1st April 2004 (Queens Cross)"                                       
#  [2] "16th August 2007 (Midstocket/ Rosemount)"                            
#  [3] "19th May 2011 (Dyce/ Bucksburn/ Danestone)"                          
#  [4] "23rd June 2011 (Airyhall/ Broomhill/ Garthdee)"                      
#  [5] "30th July 2015 (Hilton/ Woodside/ Stockethill; Kincorth/ Nigg/ Cove)"
#  [6] "1st October 2015 (George St/ Harbour; Midstocket/ Rosemount)"        
#  [7] "3rd October 2019 (Bridge of Don)"                                    
#  [8] "21st November 2019 (Torry/ Ferryhill)"                               
#  [9] "5th November 2020 (Kincorth/ Nigg/ Cove)"                            
# [10] "9th June 2005 (Huntly E.)"                                           
# [11] "1st May 2008 (Troup)"                                                
# [12] "23rd April 2009 (Aboyne, Upper Deeside & Donside)"                   
# [13] "27th November 2014 (Troup)"                                          
# [14] "5th November 2015 (Huntly, Strathbogie & Howe of Alford)"            
# [15] "3rd November 2016 (Banff & District; Inverurie & District)"          
# [16] "12th October 2017 (Inverurie & District)"                            
# [17] "15th October 2020 (Ellon & District)"

然后

out <- strcapture("^(.*)\\(([^)]*)\\)", newtxt, list(date = "", name = ""))
out
#                   date                                                name
# 1      1st April 2004                                         Queens Cross
# 2    16th August 2007                                Midstocket/ Rosemount
# 3       19th May 2011                           Dyce/ Bucksburn/ Danestone
# 4      23rd June 2011                        Airyhall/ Broomhill/ Garthdee
# 5      30th July 2015  Hilton/ Woodside/ Stockethill; Kincorth/ Nigg/ Cove
# 6    1st October 2015            George St/ Harbour; Midstocket/ Rosemount
# 7    3rd October 2019                                        Bridge of Don
# 8  21st November 2019                                     Torry/ Ferryhill
# 9   5th November 2020                                 Kincorth/ Nigg/ Cove
# 10      9th June 2005                                            Huntly E.
# 11       1st May 2008                                                Troup
# 12    23rd April 2009                      Aboyne, Upper Deeside & Donside
# 13 27th November 2014                                                Troup
# 14  5th November 2015                 Huntly, Strathbogie & Howe of Alford
# 15  3rd November 2016               Banff & District; Inverurie & District
# 16  12th October 2017                                 Inverurie & District
# 17  15th October 2020                                     Ellon & District

要拆分; 句子，

semis <- strsplit(out$name, ";")
data.frame(date = rep(out$date, lengths(semis)), name = unlist(semis))
#                   date                                 name
# 1      1st April 2004                          Queens Cross
# 2    16th August 2007                 Midstocket/ Rosemount
# 3       19th May 2011            Dyce/ Bucksburn/ Danestone
# 4      23rd June 2011         Airyhall/ Broomhill/ Garthdee
# 5      30th July 2015         Hilton/ Woodside/ Stockethill
# 6      30th July 2015                  Kincorth/ Nigg/ Cove
# 7    1st October 2015                    George St/ Harbour
# 8    1st October 2015                 Midstocket/ Rosemount
# 9    3rd October 2019                         Bridge of Don
# 10 21st November 2019                      Torry/ Ferryhill
# 11  5th November 2020                  Kincorth/ Nigg/ Cove
# 12      9th June 2005                             Huntly E.
# 13       1st May 2008                                 Troup
# 14    23rd April 2009       Aboyne, Upper Deeside & Donside
# 15 27th November 2014                                 Troup
# 16  5th November 2015  Huntly, Strathbogie & Howe of Alford
# 17  3rd November 2016                      Banff & District
# 18  3rd November 2016                  Inverurie & District
# 19  12th October 2017                  Inverurie & District
# 20  15th October 2020                      Ellon & District

数据

txt <- c("1st April 2004 (Queens Cross)   16th August 2007 (Midstocket/ Rosemount)   19th May 2011 (Dyce/ Bucksburn/ Danestone)   23rd June 2011 (Airyhall/ Broomhill/ Garthdee)   30th July 2015 (Hilton/ Woodside/ Stockethill; Kincorth/ Nigg/ Cove)   1st October 2015 (George St/ Harbour; Midstocket/ Rosemount)   3rd October 2019 (Bridge of Don)   21st November 2019 (Torry/ Ferryhill)   5th November 2020 (Kincorth/ Nigg/ Cove)", "9th June 2005 (Huntly E.)   1st May 2008 (Troup)   23rd April 2009 (Aboyne, Upper Deeside & Donside)   27th November 2014 (Troup)   5th November 2015 (Huntly, Strathbogie & Howe of Alford)   3rd November 2016 (Banff & District; Inverurie & District)   12th October 2017 (Inverurie & District)   15th October 2020 (Ellon & District)"

【讨论】：

太好了，非常感谢！它只缺少最后一点，即创建新的行来分隔有 ; 的实例。例如，对于“[15] 2016 年 11 月 3 日班夫和地区；因弗鲁里和地区”，应该有“[15] 2016 年 11 月 3 日班夫和地区”和“[16] 2016 年 11 月 3 日因弗鲁里和地区”。你知道我将如何使用 baseR 来获得它吗？谢谢！
忘记了，看我的编辑。（看起来我们都错过了 :-)
确实如此，所以可能我没有正确解释。但它现在有效！非常感谢！
我认为你解释得很好。很多时候，我们快速浏览一下这个问题，并认为我们已经抓住了它（基于看到这么多相关的问题）并继续运行它......在包含一些关键细节的段落中遗漏了第二或第三句话。我不能代表别人说话，但我知道我有时过于匆忙（回到我的日常工作）。
太好了，很高兴知道。别担心，你帮了我很多，所以希望你的日常工作值得休息:)

【解决方案2】：

基础R

part1 = trimws(gsub('\\s\\((.*)\\)', '',  unlist(str_split(original_string, '—'))))
  
part2 = trimws(gsub('[0-9](.*)[0-9]|\v', '',  unlist(str_split(original_string, '—'))))
     


data.frame(part1, part2)


               part1                                             part2
1      1st April 2004                                    (Queens Cross)
2    16th August 2007                            (Midstocket/Rosemount)
3       19th May 2011                        (Dyce/Bucksburn/Danestone)
4      23rd June 2011                     (Airyhall/Broomhill/Garthdee)
5      30th July 2015 (Hilton/Woodside/Stockethill; Kincorth/Nigg/Cove)
6    1st October 2015         (George St/Harbour; Midstocket/Rosemount)
7    3rd October 2019                                   (Bridge of Don)
8  21st November 2019                                 (Torry/Ferryhill)
9   5th November 2020                              (Kincorth/Nigg/Cove)
10      9th June 2005                                       (Huntly E.)
11       1st May 2008                                           (Troup)
12    23rd April 2009                 (Aboyne, Upper Deeside & Donside)
13 27th November 2014                                           (Troup)
14  5th November 2015            (Huntly, Strathbogie & Howe of Alford)
15  3rd November 2016          (Banff & District; Inverurie & District)
16  12th October 2017                            (Inverurie & District)
17  15th October 2020                                (Ellon & District)

数据

original_string <- 
c( "1st April 2004 (Queens Cross) — 16th August 2007 (Midstocket/Rosemount) — 19th May 2011 (Dyce/Bucksburn/Danestone) — 23rd June 2011 (Airyhall/Broomhill/Garthdee) — 30th July 2015 (Hilton/Woodside/Stockethill; Kincorth/Nigg/Cove) — 1st October 2015 (George St/Harbour; Midstocket/Rosemount) — 3rd October 2019 (Bridge of Don) — 21st November 2019 (Torry/Ferryhill) — 5th November 2020 (Kincorth/Nigg/Cove)"   ,                                                       
"9th June 2005 (Huntly E.) — 1st May 2008 (Troup) — 23rd April 2009 (Aboyne, Upper Deeside & Donside) — 27th November 2014 (Troup) — 5th November 2015 (Huntly, Strathbogie & Howe of Alford) — 3rd November 2016 (Banff & District; Inverurie & District) — 12th October 2017 (Inverurie & District) — 15th October 2020 (Ellon & District)" )

【讨论】：

太好了，非常感谢！它只缺少最后一点，即创建新的行来分隔有 ; 的实例。例如，对于“[15] 2016 年 11 月 3 日班夫和地区；因弗鲁里和地区”，应该有“[15] 2016 年 11 月 3 日班夫和地区”和“[16] 2016 年 11 月 3 日因弗鲁里和地区”。你知道我将如何使用 baseR 来获得它吗？谢谢！

【解决方案3】：

在tidyverse时尚中这样做

df  <- data.frame(str <- c("1st April 2004 (Queens Cross)   16th August 2007 (Midstocket/ Rosemount)   19th May 2011 (Dyce/ Bucksburn/ Danestone)   23rd June 2011 (Airyhall/ Broomhill/ Garthdee)   30th July 2015 (Hilton/ Woodside/ Stockethill; Kincorth/ Nigg/ Cove)   1st October 2015 (George St/ Harbour; Midstocket/ Rosemount)   3rd October 2019 (Bridge of Don)   21st November 2019 (Torry/ Ferryhill)   5th November 2020 (Kincorth/ Nigg/ Cove)", "9th June 2005 (Huntly E.)   1st May 2008 (Troup)   23rd April 2009 (Aboyne, Upper Deeside & Donside)   27th November 2014 (Troup)   5th November 2015 (Huntly, Strathbogie & Howe of Alford)   3rd November 2016 (Banff & District; Inverurie & District)   12th October 2017 (Inverurie & District)   15th October 2020 (Ellon & District)")

suppressMessages(library(tidyverse))

df %>% separate_rows(str, sep = '—') %>%
  separate(str, into = c('date', 'name'), sep = '\\(') %>%
  separate_rows(name, sep = ";") %>%
  mutate(name = str_remove_all(name, "\\)")) %>%
  mutate(across(everything(), ~str_trim(.)))


# A tibble: 20 x 2
   date               name                                
   <chr>              <chr>                               
 1 1st April 2004     Queens Cross                        
 2 16th August 2007   Midstocket/<U+200B>Rosemount                
 3 19th May 2011      Dyce/<U+200B>Bucksburn/<U+200B>Danestone            
 4 23rd June 2011     Airyhall/<U+200B>Broomhill/<U+200B>Garthdee         
 5 30th July 2015     Hilton/<U+200B>Woodside/<U+200B>Stockethill         
 6 30th July 2015     Kincorth/<U+200B>Nigg/<U+200B>Cove                  
 7 1st October 2015   George St/<U+200B>Harbour                   
 8 1st October 2015   Midstocket/<U+200B>Rosemount                
 9 3rd October 2019   Bridge of Don                       
10 21st November 2019 Torry/<U+200B>Ferryhill                     
11 5th November 2020  Kincorth/<U+200B>Nigg/<U+200B>Cove                  
12 9th June 2005      Huntly E.                           
13 1st May 2008       Troup                               
14 23rd April 2009    Aboyne, Upper Deeside & Donside     
15 27th November 2014 Troup                               
16 5th November 2015  Huntly, Strathbogie & Howe of Alford
17 3rd November 2016  Banff & District                    
18 3rd November 2016  Inverurie & District                
19 12th October 2017  Inverurie & District                
20 15th October 2020  Ellon & District

^{由reprex package (v2.0.0) 于 2021-05-04 创建}

【讨论】：

太好了，非常感谢！它只缺少最后一点，即创建新行分隔有 ; 的实例。例如，对于“[15] 2016 年 11 月 3 日班夫和地区；因弗鲁里和地区”，应该有“[15] 2016 年 11 月 3 日班夫和地区”和“[16] 2016 年 11 月 3 日因弗鲁里和地区”。你知道我怎么把它整理得井井有条吗？谢谢！
请参阅编辑。如果有帮助，请考虑投票
确实有帮助！非常感谢。会投票