【问题标题】:Split two words connected by a dot拆分由点连接的两个单词
【发布时间】:2020-03-15 03:31:02
【问题描述】:

我有一个包含新闻文章的大数据框。我注意到有些文章有两个用点连接的单词,如下例所示The government.said it was important to quit.。我将进行一些主题建模,因此我需要将每个单词分开。

这是我用来分隔这些词的代码

    #String example
    test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences")

    #Code to separate the words
    test <- do.call(paste, as.list(strsplit(test, "\\.")[[1]]))

   #This is what I get
  > test
  [1] "i need to separate the words connected by dots  however, I need to keep having the dots separating sentences"

如您所见,我删除了文本上的所有点(句点)。我怎样才能得到以下结果:

"i need to separate the words connected by dots. however, I need to keep having the dots separating sentences"

最后说明

我的数据框由 17.000 篇文章组成;所有的文字都是小写的。我只是提供了一个小例子来说明我在尝试分隔由点连接的两个单词时遇到的问题。此外,有什么方法可以在列表中使用strsplit

【问题讨论】:

  • 试试gsub("\\b\\.\\b", " ", test, perl=TRUE)。这将删除字母/数字/下划线之间的点。如果这不是您所需要的,您能否详细解释您要删除点的上下文?
  • 它有效。有没有机会我可以在不修改网址的情况下应用此代码?我的数据框由不同的新闻文章组成,其中包含一些网址。我想保留它们,但这段代码肯定会改变它们。
  • 请提供示例并更新问题。
  • 我会发布一个新的,因为你已经为我正确回答了这个问题。谢谢!
  • 不,请不要,我会在这里发布答案。

标签: r regex string strsplit


【解决方案1】:

你可以使用

test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences. Look at http://google.com for s.0.m.e more details.")
# Replace each dot that is in between word characters
gsub("\\b\\.\\b", " ", test, perl=TRUE)
# Replace each dot that is in between letters
gsub("(?<=\\p{L})\\.(?=\\p{L})", " ", test, perl=TRUE)
# Replace each dot that is in between word characters, but no in URLs
gsub("(?:ht|f)tps?://\\S*(*SKIP)(*F)|\\b\\.\\b", " ", test, perl=TRUE)

请参阅R demo online

输出:

[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s 0 m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s.0.m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google.com for s 0 m e more details."

详情

  • \b\.\b - 一个用单词边界括起来的点(即. 之前和之后不能是任何非单词字符,除了字母、数字或下划线之外不能有任何字符
  • (?&lt;=\p{L})\.(?=\p{L}) 匹配一个不紧跟在字母之前或之后的点((?&lt;=\p{L}) 是否定的后视,(?=\p{L}) 是否定的前瞻)
  • (?:ht|f)tps?://\\S*(*SKIP)(*F)|\b\.\b 匹配 http/ftphttps/ftps,然后是 ://,然后是任何 0 个或多个非空白字符,并跳过匹配并继续从遇到 SKIP PCRE 时的位置搜索匹配动词。

【讨论】:

    猜你喜欢
    • 2020-03-25
    • 1970-01-01
    • 2010-09-16
    • 2011-12-30
    • 1970-01-01
    • 2012-10-09
    • 1970-01-01
    • 2017-07-26
    • 1970-01-01
    相关资源
    最近更新 更多