如何将字符串与R中的数字分开？答案

【问题标题】：How to separate string from numbers in R?如何将字符串与R中的数字分开？
【发布时间】：2019-10-03 03:22:57
【问题描述】：

我有一个疯狂而疯狂的文本文件，其头部如下所示：

2016-07-01 02:50:35 <name redacted> hey
2016-07-01 02:51:26 <name redacted> waiting for plane to Edinburgh
2016-07-01 02:51:45 <name redacted> thinking about my boo
2016-07-01 02:52:07 <name reda> nothing crappy has happened, not really
2016-07-01 02:52:20 <name redac> plane went by pretty fast, didn't sleep
2016-07-01 02:54:08 <name r> no idea what time it is or where I am really
2016-07-01 02:54:17 <name redacted> just know it's london
2016-07-01 02:56:44 <name redacted> you are probably asleep
2016-07-01 02:58:45 <name redacted> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <name redacted> ????
2016-07-01 02:59:34 <name redacted> ????????????
2016-07-01 03:02:48 <name > British security is a little more rigorous...

它会持续一段时间。这是一个大文件。但我觉得用 coreNLP 库或包进行注释会很困难。我正在做自然语言处理。换句话说，我很好奇我会如何剃掉，比如说，至少是日期，如果不是日期和名字的话。

但我想我需要这些名字，因为最终我希望能够成为这样的人，这个人说了 50 次，而这个人说了 75 次，等等，但这有点可能领先于我自己。

这需要正则表达式吗？我在 R 工作。

我还没有尝试过任何东西，因为我不知道从哪里开始。我将如何在 R 中编写一个仅选择性读取文本的代码？有意义地组合在一起的短语和句子？

【问题讨论】：

名称的长度是否一致？如图所示，它们是否包裹在插入符号中？有分隔符吗？
存在不一致。例如，大多数时候，一行的结尾和下一行之间没有空格，但有时，很少有。 2016-01-27 09:15:20 嘿 2016-01-27 09:15:22。所以在 Hey 和 2016 之间有一个空格，但那是因为空格是信息本身的一部分。如果消息本身没有空格，它们会像这样被挤在一起：2016-07-01 02:50:35 hey2016-07-01 02:51:26 waiting for plane to Edinburgh2016 -07-01 02:51:45 请注意，嘿就在 2016 旁边。没有空格。
但是名字和胡萝卜之间总是有一个空格。这是 Google Hangouts 数据顺便说一句。结构是这样的。总是有一个日期，它在时间之前由一个空格分隔，它与名称由一个空格分隔，该名称与消息本身由一个空格分隔，但同样，消息本身可能会也可能不会以空格结尾。
有趣的是，只要我将其粘贴到电子邮件甚至这些堆栈溢出框中，就会立即识别结构，并且文本框正确地格式化文本。但是，在文本文件本身中，它看起来像这样。
2016-07-01 23:59:27 我们俩同时签字2016-07-02 00:00:04 :-)2016- 07-02 00:00:28 我住你 supercalagraa...phragrlous...esp..dociois2016-07-02 00:12:23 我爱你 :)2016-07-02 08:57:33

标签： r regex stanford-nlp data-cleaning regex-greedy

【解决方案1】：

这可能不需要表达式，但如果你想这样做，这个表达式可能会帮助你做到这一点：

(.*)(\s<name.*)

正则表达式

如果这不是您想要的表达式，您可以在regex101.com 中修改/更改您的表达式。如有必要，您可以添加更多边界。

正则表达式电路

您还可以在jex.im 中可视化您的表达式：

JavaScript 演示

const regex = /(.*)(\s<name.*)/gm;
const str = `2016-07-01 02:50:35 <name redacted> hey
2016-07-01 02:51:26 <name redacted> waiting for plane to Edinburgh
2016-07-01 02:51:45 <name redacted> thinking about my boo
2016-07-01 02:52:07 <name reda> nothing crappy has happened, not really
2016-07-01 02:52:20 <name redac> plane went by pretty fast, didn't sleep
2016-07-01 02:54:08 <name r> no idea what time it is or where I am really
2016-07-01 02:54:17 <name redacted> just know it's london
2016-07-01 02:56:44 <name redacted> you are probably asleep
2016-07-01 02:58:45 <name redacted> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <name redacted> ?
2016-07-01 02:59:34 <name redacted> ???
2016-07-01 03:02:48 <name > British security is a little more rigorous...`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

【讨论】：

Javascript 不是每个标签和问题正文的预期语言。此外，R 中的正则表达式约定与 javascript 中的不同。

【解决方案2】：

使用 gsub 函数中使用的基本 R 正则表达式，可以提取每条信息。假设以这个文件为例：

2016-07-01 02:50:35 <name1 surname1> hey
2016-07-01 02:51:26 <name1 surname1> waiting for plane to Edinburgh
2016-07-01 02:51:45 <name1 surname1> thinking about
2016-07-01 02:52:07 <name2 surname2> nothing crappy 
2016-07-01 02:52:20 <name2 surname2> plane went by pretty fast
2016-07-01 02:54:08 <name2 surname2> no idea
2016-07-01 02:54:17 <name2 surname2> just know it's london
2016-07-01 02:56:44 <name1 surname1> you are probably asleep
2016-07-01 02:58:45 <name1 surname1> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <name2 surname2> x
2016-07-01 02:59:34 <name1 surname2> y
2016-07-01 03:02:48 <name2 > British security is a little more rigorous...

现在在 R 控制台中，您将文件作为简单文本读取并通过正则表达式处理它们。 gsub 的参数 2 是从正则表达式中提取模式

your_data <- readLines(your_text_file)  # Reading 
pattern <- "(.*) <(\\S*) (\\S*)>(.*)" # The regex pattern
times <- gsub(pattern,"\\1",your_data) # Get Time and date
person_name <- gsub(pattern,"\\2 \\3",your_data) # Get name
message <- gsub(pattern,"\\4",your_data) # Get message

【讨论】：

我试过这个，但我看不出每个输出有什么不同。它们看起来都一样，并且正则表达式在您的示例中看起来也没有那么不同。这不是要测试的吗？这只是一个假设的例子吗？就像我应该编写自己的正则表达式一样？但我同意你的观点，这应该是可能的。编写一个可以完成这项工作的正则表达式会很好，也就是说，将这些数据的每个方面提取到它自己的列中。
当我执行该代码时，它看起来像这样：> head (person) [1] "ï»¿2016-01-27 09:14:40 *** Jane Doe 开始视频聊天" [2] “2016-01-27 09:15:20 lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/…” [3] “2016-01-27 09:15:20 嘿”
事实上，这是这些数据令人讨厌的部分之一。第 2 行（虽然不是唯一这样的）很长： > test [2] [1] "2016-01-27 09:15:20 lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/…"
> 测试 [150] [1] "2016-07-01 08:17:47 晚餐"

【解决方案3】：

使用您的示例粘贴文本，我们可以执行以下操作。请注意，您对复制粘贴时文本行为方式的描述向我表明，文本中实际上有换行符 \n，但如果没有可重现的示例，很难说。

通过在日期之前的边界上拆分将单个长字符串拆分为行。如果您让人们定期在消息中输入日期，您可以扩展该模式以包括时间和姓名。如果人们将其输入到消息中，那么它会很复杂，但希望只会影响一些消息。这可以通过划线来解决。
将这些行放入数据框列中，并在插入符号< 或> 之前或之后的空格处拆分为名称和消息。

library(tidyverse)
text <- "2016-07-01 23:59:27 <John Doe> We're both signing off at the same time2016-07-02 00:00:04 <John Doe> :-)2016-07-02 00:00:28 <John Doe> I live you supercalagraa...phragrlous...esp..dociois2016-07-02 00:12:23 <Jane Doe> I love you :)2016-07-02 08:57:33"
text %>%
  str_split("(?=\\d{4}-\\d{2}-\\d{2})") %>%
  pluck(1) %>%
  enframe(name = NULL, value = "message") %>%
  separate(message, c("datetime", "name", "message"), sep = "\\s(?=<)|(?<=>)\\s", extra = "merge")
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1,
#> 6].
#> # A tibble: 6 x 3
#>   datetime           name      message                                     
#>   <chr>              <chr>     <chr>                                       
#> 1 ""                 <NA>      <NA>                                        
#> 2 2016-07-01 23:59:… <John Do… We're both signing off at the same time     
#> 3 2016-07-02 00:00:… <John Do… :-)                                         
#> 4 2016-07-02 00:00:… <John Do… I live you supercalagraa...phragrlous...esp…
#> 5 2016-07-02 00:12:… <Jane Do… I love you :)                               
#> 6 2016-07-02 08:57:… <NA>      <NA>

^{由reprex package (v0.2.1) 于 2019 年 5 月 16 日创建}

【讨论】：

我能够复制你刚刚做的事情（因为我只是将它剪切并粘贴到 R 中），但是，当我用整个文件尝试它时，它很大，它只返回两条断线，但这可能是因为整个文件的第一行和第二行很奇怪：2016-01-27 09:14:40 *** Jane Doe 开始视频聊天 2016-01-27 09:15:20 lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/… 2016-01-27 09:15:20 嘿
前 84 行的结构如下： > test2 [84] [1] "2016-06-28 21:12:43 *** John Doe 结束了视频聊天"
> 测试 [85] [1] "2016-07-01 02:50:35 嘿"

【解决方案4】：

在一些帮助下，我能够弄清楚。

> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8")
> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> c <- gsub(b, "\\1<\\2> ", a)
> d <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$"
> e <- data.frame(date = character(),
+                     time = character(),
+                     name = character(),
+                     text = character(),
+                     stringsAsFactors = TRUE)
f <- strcapture(d, c, e)
> f <- f [-c(1),]

第一行都是 NA，因此最后一次使用 -c

【讨论】：

再一次，这应该适用于 Google Hangouts 数据的其他文件。这是代码的来源。