【问题标题】:Match partial text with full text and replace将部分文本与全文匹配并替换
【发布时间】:2021-10-04 02:49:12
【问题描述】:

我有一个推文数据集,其中一些推文是原创的,而其他推文是转发的。由于某种原因,转推被... 截断,因此整个文本不存在。在我的数据集中,原始推文(希望)始终存在,所以我想找到原始推文并用它替换截断的推文。

例如:

my_data <- tribble(
  ~user, ~text,
  "Peter", "Hello, this is Peter, I like ice cream!",
  "John", "RT @Peter: Hello, this is Peter, I like ...",
  "Martha", "RT @Peter: Hello, this is Peter, I like ...",
  "Julia", "Hi, I really like apples!",
  "Bjorn", "RT @Julia: I really like ..."
)
# A tibble: 5 x 2
  user   text                                       
  <chr>  <chr>                                      
1 Peter  Hello, this is Peter, I like ice cream!    
2 John   RT @Peter: Hello, this is Peter, I like ...
3 Martha RT @Peter: Hello, this is Peter, I like ...
4 Julia  Hi, I really like apples!                  
5 Bjorn  RT @Julia: I really like ... 

我想找到RT@ username: some text... 的每个实例,并将其替换为完整的推文。基本上:

# A tibble: 5 x 2
  user   text                                              
  <chr>  <chr>                                             
1 Peter  Hello, this is Peter, I like ice cream!           
2 John   RT @Peter: Hello, this is Peter, I like ice cream!
3 Martha RT @Peter: Hello, this is Peter, I like ice cream!
4 Julia  Hi, I really like apples!                         
5 Bjorn  RT @Julia: Hi, I really like apples!     

我已经提取了被转发的句柄,并将其按组分解:

retweet_pattern <- "^RT @([a-zA-Z0-9_]*): (.*)"
str_match(my_data$text, retweet_pattern)

但是,我不完全确定如何进行。由于用户/文本对不一定是唯一的(即,一个用户可能有多个被转发的推文),简单地找到一个转发句柄并更改整个文本是行不通的。也许我需要使用字符串指标,比如 Levenshtein?

谢谢。

【问题讨论】:

    标签: r regex pattern-matching string-matching


    【解决方案1】:

    由于转推文本与非转推数据完全一致,你可以试试这个。

    library(dplyr)
    library(tidyr)
    
    #Create a separate dataframe for retweet data
    #separate the username and tweet in different columns
    rt_data <- my_data %>% 
      filter(grepl('RT', text)) %>%
      separate(text, c('name', 'text'), sep = ':\\s*')
    
    #Create a separate dataframe for tweets which are not retweets. 
    no_rt_data <- my_data %>% filter(!grepl('RT', text))
      
    
    #Clean the retweet string and find the corresponding match
    #in non-retweet data
    rt_data$text <- sapply(gsub('RT @\\w+:\\s*|\\s*\\.+$', '', rt_data$text), 
                           function(x) no_rt_data$text[grepl(x, no_rt_data$text)])
    
    #Combine the username and tweet  
    rt_data <- rt_data %>% unite(text, name, text, sep = ' :')
    
    #combine the two dataframes
    bind_rows(no_rt_data, rt_data)
    
    #   user   text                                              
    #  <chr>  <chr>                                             
    #1 Peter  Hello, this is Peter, I like ice cream!           
    #2 Julia  Hi, I really like apples!                         
    #3 John   RT @Peter :Hello, this is Peter, I like ice cream!
    #4 Martha RT @Peter :Hello, this is Peter, I like ice cream!
    #5 Bjorn  RT @Julia :Hi, I really like apples!              
    

    【讨论】:

      猜你喜欢
      • 2018-02-01
      • 1970-01-01
      • 1970-01-01
      • 2012-05-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-02-03
      相关资源
      最近更新 更多