提取重复字符答案

【问题标题】：Extracting repeated characters提取重复字符
【发布时间】：2021-09-23 08:10:14
【问题描述】：

我正在尝试提取艺术家和标题名称。然而它有点复杂。这是清单；

nlist <- c(
"Lil' SlimLil' Slim feat. PxMxWxPxMxWx Where Your Ward At!!",               
"I Like It (Mannie Fresh Style)I Like It (Mannie Fresh Style)Ms. Tee",
"Bella VistaBella Vista Mister Wong",
"Tom WareTom WareChina Town",                                        
"Race 'N RhythmRace 'N Rhythm Teenage Girls",                                    
"Ronald MarquisseRonald MarquisseElectro Link 7",
"PleasurePleasure Thoughts Of Old Flames",
"OM, OM, Dom Um RomaoDom Um Romao Chipero",
"HookfaceHookface4 07 181221"
)

这是字符串中的模式。

说明：

共有三种不同的模式（1、2-7、8）。
RED：艺术家（重复），
蓝色：标题（不重复），
GREEN：连词（艺术家姓名之间不可重复）

1 和 8 非常难，我无法解决。但是对于下面的 2 到 7 个代码解决了我的问题。

title = str_trim(gsub('(.+?)\\1','', nlist))
artist = re.match('(.+?)\\1', nlist)[,2]
data = cbind(title,artist);data

这里是上述代码的输出。

     title                                     artist                          
[1,] "feat. PxMxWxPxMxWx Where Your Ward At!!" "Lil' Slim"                     
[2,] "Ms. Tee"                                 "I Like It (Mannie Fresh Style)"
[3,] "Mister Wong"                             "Bella Vista"                   
[4,] "China Town"                              "Tom Ware"                      
[5,] "Teenage Girls"                           "Race 'N Rhythm"                
[6,] "Electro Link 7"                          "Ronald Marquisse"              
[7,] "Thoughts Of Old Flames"                  "Pleasure"                      
[8,] "Chipero"                                 "OM, "  
[9,] "4 07 181221"                             "Hookeface"

问题：当有“壮举”时。或字符串中的“，”将字符串的重复序列截断。问题：如何才能真正提取出如下艺术家姓名？

我的预期结果在这里（检查 1 和 8）；

     title                                     artist                          
[1,] "Where Your Ward At!!"                    "Lil' Slim feat. PxMxWx"                     
[2,] "Ms. Tee"                                 "I Like It (Mannie Fresh Style)"
[3,] "Mister Wong"                             "Bella Vista"                   
[4,] "China Town"                              "Tom Ware"                      
[5,] "Teenage Girls"                           "Race 'N Rhythm"                
[6,] "Electro Link 7"                          "Ronald Marquisse"              
[7,] "Thoughts Of Old Flames"                  "Pleasure"                      
[8,] "Chipero"                                 "OM, Dom Um Romao"                             
[9,] "4 07 181221"                             "Hookeface"

谢谢...

【问题讨论】：

不应该是4 07 181221 title 和Hookeface artist 吗？

标签： r regex gsub text-extraction

【解决方案1】：

您可以使用以下方式获取艺术家：

artist <- sub("^(?:.*?\\b(.+?)\\1(?=\\b|\\p{Lu}))*\\s*(.*)", "\\2", nlist, perl=TRUE)
## => [1] "Where Your Ward At!!"   "Ms. Tee"                "Mister Wong"           
## => [4] "China Town"             "Teenage Girls"          "Electro Link 7"        
## => [7] "Thoughts Of Old Flames" "Chipero"

您可以使用以下代码获取标题：

rx <- "^(?:.*?\\b(.+?)\\1(?=\\b|\\p{Lu}))*"
titles <- regmatches(nlist, regexpr(rx, nlist, perl=TRUE))
titles <- gsub("(.+?)\\1", "\\1", titles, perl=TRUE)
## => [1] "Lil' Slim feat. PxMxWx"        
## => [2] "I Like It (Mannie Fresh Style)"
## => [3] "Bella Vista"                   
## => [4] "Tom Ware"                      
## => [5] "Race 'N Rhythm"                
## => [6] "Ronald Marquisse"              
## => [7] "Pleasure"                      
## => [8] "OM, Dom Um Romao"

请参阅regex demo。详情：

^ - 字符串开头
(?:.*?\b(.+?)\1(?=\b|\p{Lu}))* - 零次或多次出现
- .*? - 除换行符之外的任何零个或多个字符，尽可能少
- \b - 单词边界
- (.+?) - 第 1 组：除换行符之外的任何一个或多个字符，尽可能少
- \1 - 与第 1 组中的值相同
- (?=\b|\p{Lu}) - next 必须是单词边界或大写字母
\s* - 零个或多个空格
(.*) - 第 2 组 (\2)：线路的其余部分。

请参阅R demo online。

【讨论】：

非常感谢！我只有以下错误代码。我想念图书馆吗？ Error in regexpr(rx, nlist, perl = TRUE) : object 'rx' not found
@podious 不，您没有复制rx 变量定义。
@podious 从ideone.com/32DZQZ获取代码
谢谢。但现在我在“nlist”中添加了第 9 行。在这种情况下，数字是折叠的。
@podious 您是否计划更多“添加”？一旦你提供了一套完整的要求，我会更新答案。

【解决方案2】：

也许以下内容提取了您想要的内容。我删除所有内容和最后一次重复并将其存储在title 中。为了获得艺术家，我使用substr 删除先前找到的title 的长度，然后使用gsub 和(.{2,})\\1 删除艺术家的重复，但这也会删除连词中的重复。

title <- sub(".*(.{2,})\\1\\s*", "", nlist)
artist <- trimws(gsub("(.{2,})\\1", "\\1"
              , substr(nlist, 1, nchar(nlist) - nchar(title)), perl=TRUE))
cbind(title,artist)
#      title                    artist                          
# [1,] "Where Your Ward At!!"   "Lil' Slim feat. PxMxWx"        
# [2,] "Ms. Tee"                "I Like It (Mannie Fresh Style)"
# [3,] "Mister Wong"            "Bella Vista"                   
# [4,] "China Town"             "Tom Ware"                      
# [5,] "Teenage Girls"          "Race 'N Rhythm"                
# [6,] "Electro Link 7"         "Ronald Marquisse"              
# [7,] "Thoughts Of Old Flames" "Pleasure"                      
# [8,] "Chipero"                "OM, Dom Um Romao"              
# [9,] "4 07 181221"            "Hookface"

另一种可能是：

x <- sub("^(.*)\\1\\s*", "", nlist)     #Remove the first repetition of artist
title <- sub(".*?(.{2,})\\1\\s*", "", x) #Remove Conjunction and repetition of Artist if there is one
artist <- trimws(gsub("(.{2,})\\1", "\\1"
              , substr(nlist, 1, nchar(nlist) - nchar(title)), perl=TRUE))
cbind(title,artist)
#      title                    artist                          
# [1,] "Where Your Ward At!!"   "Lil' Slim feat. PxMxWx"        
# [2,] "Ms. Tee"                "I Like It (Mannie Fresh Style)"
# [3,] "Mister Wong"            "Bella Vista"                   
# [4,] "China Town"             "Tom Ware"                      
# [5,] "Teenage Girls"          "Race 'N Rhythm"                
# [6,] "Electro Link 7"         "Ronald Marquisse"              
# [7,] "Thoughts Of Old Flames" "Pleasure"                      
# [8,] "Chipero"                "OM, Dom Um Romao"              
# [9,] "4 07 181221"            "Hookface"

【讨论】：

谢谢@GKİ。但在第一行“Lil' Slimfeat.PxMxWx”是“壮举”。需要分开。
也许更新现在给了你想要的？
也许新的更新仍能提供预期的效果，但看起来比预览版更简单。
你好@GKi 我检查了你的代码并在我的主数据文件中尝试过。我现在可以说它的准确率在 %95 左右。（有一些非常难以创建特定逻辑的名称。）所以我可以说这对我有用。非常感谢！！