【问题标题】:lookbehind in str_extract with R用 R 在 str_extract 中回顾
【发布时间】:2014-02-06 15:27:05
【问题描述】:

我有以下文本文件

[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)
[01/29/14 16:42:57, 10.100.120.120, unknown]: spatial_monitor: Alan left Conference Room (Zone Role contains Person role)
[01/29/14 16:43:00, 10.100.120.120, unknown]: spatial_monitor: Kurt entered Conference Room (Computer desk contains Person role)
[01/29/14 16:43:02, 10.100.120.120, unknown]: spatial_monitor: Kurt left Conference Room (Computer desk contains Person role)
[01/29/14 16:43:03, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)
[01/29/14 16:43:08, 10.100.120.120, unknown]: spatial_monitor: Alan left Conference Room (Zone Role contains Person role)
[01/29/14 16:46:07, 10.100.120.120, unknown]: spatial_monitor: Fred entered Conference Room (Zone Role contains Person role)
[01/29/14 16:46:08, 10.100.120.120, unknown]: spatial_monitor: Fred left Conference Room (Zone Role contains Person role)

我正在尝试使用 R 中的 str_extract(在库 stringr 中)来提取位置的名称(上面示例中的“会议室”)。逻辑是拉出跟在“entered”或“left”之后的字符串部分。为此,我有以下正则表达式

(?<=entered\s)[A-Z][a-z]+\s[A-Z][a-z]+

这在 Notepad++ 中运行良好,但是当我将它嵌入到 R 中时,我收到以下错误

> tt <- "[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)"
> str_extract(tt, '(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+')
Error in regexpr("(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+", "[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)",  : 
  invalid regular expression '(?<=entered\s)[A-Z][a-z]+\s[A-Z][a-z]+', reason 'Invalid regexp'

其他答案告诉我lookahead and lookbehind only work with Perl。所以问题是如何使用 str_extract 启用 Perl?或者有没有更好的方法来做到这一点?提前致谢。

【问题讨论】:

  • 这工作并且不使用前瞻/后瞻。将要提取的部分用括号括起来,如图:library(gsubfn); strapplyc(tt, 'entered\\s([A-Z][a-z]+\\s[A-Z][a-z]+)', simplify = TRUE)

标签: regex r perl


【解决方案1】:
library(stringr)
tt <- "[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)"
str_extract(tt, perl('(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+'))
# [1] "Conference Room"

更新: 使用stringr 1.3.0 2018-02-19,删除了perl()。你现在可以简单地做str_extract(tt, '(?&lt;=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+')

【讨论】:

    【解决方案2】:

    你的正则表达式有效的。如果您指定perl = TRUE,它适用于sub。您还可以为您的任务使用sub 函数:

    sub('.*(?<=entered\\s)([A-Z][a-z]+\\s[A-Z][a-z]+).*', '\\1', tt, perl = TRUE)
    # [1] "Conference Room"
    

    或者,没有perl

    sub('.*entered\\s([A-Z][a-z]+\\s[A-Z][a-z]+).*', '\\1', tt)
    # [1] "Conference Room"
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-09-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-07-27
      • 2020-06-11
      相关资源
      最近更新 更多