提取部分字符串：日期和时间答案

【问题标题】：Extract part of string: date and times提取部分字符串：日期和时间
【发布时间】：2019-07-30 11:28:50
【问题描述】：

我有一个变量通常有一些乱码，例如：

\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n

我正在尝试提取日期 (30.07.2019) 和时间 (12:00 - 14:30)。我对解析不是很好，所以在 R 中实现这一点的一些帮助将不胜感激。

【问题讨论】：

标签： r regex date datetime stringr

【解决方案1】：

如果您可以依赖日期和时间部分在数据中仅出现一次这一事实，您可以使用正则表达式来提取它们（此处使用数据框）：

library(tidyverse)
data <-
   tibble(gibberish_string = "\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n")

data %>% mutate(date = str_extract(gibberish_string,
                                   pattern = "\\d{1,2}\\.\\d{1,2}\\.\\d{4}"),
                time = str_extract(gibberish_string,
                                   pattern = "\\d{1,2}:\\d{1,2}"))

【讨论】：

我喜欢这个。谢谢！

【解决方案2】：

字符串拆分，然后提取日期和时间：

x <- "\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n"

lapply(strsplit(x, "[\n\t ]"), function(i){
  dd <- i[ grepl("[0-9]{2}.[0-9]{2}.[0-9]{2}", i) ]
  tt <- i[ grepl("[0-9]{2}:[0-9]{2}", i) ]
  c(dd, paste(tt, collapse = "-"))
})

# [[1]]
# [1] "30.07.2019"  "12:00-14:30"

【讨论】：

【解决方案3】：

这个日期：

(\d{1,2}[\.\/]){2}((\d{4})|(\d{2}))

Here is Demo

这是时间：

\d{1,2}:\d{2}\s?-\s?\d{1,2}:\d{2}

Here Is Demo

【讨论】：

这还不错，因为我不明白如何在 R 中实现。

【解决方案4】：

一种冗长的循序渐进base/stringr的做法：

tst<-"\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n"
 cleaner<-gsub("\\n|\\t","",tst)
 split_txt<-strsplit(cleaner, "\\s(?=[a-z])",perl=T)
 dates<-stringr::str_extract_all(unlist(split_txt),
                         "\\d{1,}\\.\\d{2,}\\.\\d{4}")
 times<-stringr::str_extract_all(stringr::str_remove_all(unlist(split_txt),
                          "[A-Za-z]"),".*\\-.*")
 dates[lengths(dates)>0]
[[1]]
[1] "30.07.2019"

 trimws(times[lengths(times)>0])
[1] "12:00 - 14:30"

【讨论】：