【发布时间】:2021-10-04 02:49:12
【问题描述】:
我有一个推文数据集,其中一些推文是原创的,而其他推文是转发的。由于某种原因,转推被... 截断,因此整个文本不存在。在我的数据集中,原始推文(希望)始终存在,所以我想找到原始推文并用它替换截断的推文。
例如:
my_data <- tribble(
~user, ~text,
"Peter", "Hello, this is Peter, I like ice cream!",
"John", "RT @Peter: Hello, this is Peter, I like ...",
"Martha", "RT @Peter: Hello, this is Peter, I like ...",
"Julia", "Hi, I really like apples!",
"Bjorn", "RT @Julia: I really like ..."
)
# A tibble: 5 x 2
user text
<chr> <chr>
1 Peter Hello, this is Peter, I like ice cream!
2 John RT @Peter: Hello, this is Peter, I like ...
3 Martha RT @Peter: Hello, this is Peter, I like ...
4 Julia Hi, I really like apples!
5 Bjorn RT @Julia: I really like ...
我想找到RT@ username: some text... 的每个实例,并将其替换为完整的推文。基本上:
# A tibble: 5 x 2
user text
<chr> <chr>
1 Peter Hello, this is Peter, I like ice cream!
2 John RT @Peter: Hello, this is Peter, I like ice cream!
3 Martha RT @Peter: Hello, this is Peter, I like ice cream!
4 Julia Hi, I really like apples!
5 Bjorn RT @Julia: Hi, I really like apples!
我已经提取了被转发的句柄,并将其按组分解:
retweet_pattern <- "^RT @([a-zA-Z0-9_]*): (.*)"
str_match(my_data$text, retweet_pattern)
但是,我不完全确定如何进行。由于用户/文本对不一定是唯一的(即,一个用户可能有多个被转发的推文),简单地找到一个转发句柄并更改整个文本是行不通的。也许我需要使用字符串指标,比如 Levenshtein?
谢谢。
【问题讨论】:
标签: r regex pattern-matching string-matching