Sparklyr/Hive：如何正确使用正则表达式（regexp_replace）？答案

【问题标题】：Sparklyr/Hive: how to use regex (regexp_replace) correctly?Sparklyr/Hive：如何正确使用正则表达式（regexp_replace）？
【发布时间】：2017-11-23 08:09:13
【问题描述】：

考虑以下示例

dataframe_test<- data_frame(mydate = c('2011-03-01T00:00:04.226Z', '2011-03-01T00:00:04.226Z'))

# A tibble: 2 x 1
                    mydate
                     <chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

sdf <- copy_to(sc, dataframe_test, overwrite = TRUE)

> sdf
# Source:   table<dataframe_test> [?? x 1]
# Database: spark_connection
                    mydate
                     <chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

我想修改字符timestamp 使其具有更常规的格式。我尝试使用regexp_replace 这样做，但失败了。

> sdf <- sdf %>% mutate(regex = regexp_replace(mydate, '(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}).(\\d{3})Z', '$1-$2-$3 $4:$5:$6.$7'))
> sdf
# Source:   lazy query [?? x 2]
# Database: spark_connection
                    mydate                    regex
                     <chr>                    <chr>
1 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z

有什么想法吗？正确的语法是什么？

【问题讨论】：

模式是正确的（您可以使用文字 . 代替通配符），您只是使用了错误的函数。
请稍等。我应该使用哪个功能？您的链接实际上指定了我使用的相同功能
仔细看看 - 这是regexp_replace，而不是regexp_extract :)
我相信这仍然是重复的 - 我只是错了模式。请注意，它必须匹配整个字符串并且您没有转义所有内容：sdf %>% mutate(regex = regexp_replace(mydate, '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$', '$1-$2-$3 $4:$5:$6.$7'))。您可以使用regexp_extact，但它需要枚举所有字段sdf %>% mutate(regex = regexp_extract(mydate, '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$', 1))
恐怕你必须为 R 逃一次，为 Java 逃一次。如果您认为这应该是一个单独的答案，我可以重新打开它。

标签： r apache-spark hive sparklyr

【解决方案1】：

Spark SQL 和 Hive 提供两种不同的功能：

regexp_extract - 接受字符串、模式和要提取的组的索引。
regexp_replace - 接受字符串、模式和替换字符串。

前者可用于提取单个组，索引语义being the same 与java.util.regex.Matcher 相同

对于regexp_replace，模式必须匹配整个字符串，如果没有匹配，则返回输入字符串：

sdf %>% mutate(
 regex = regexp_replace(mydate, '^([0-9]{4}).*', "$1"),
 regexp_bad = regexp_replace(mydate, '([0-9]{4})', "$1"))

## Source:   query [2 x 3]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 3
##                     mydate regex               regexp_bad
##                      <chr> <chr>                    <chr>
## 1 2011-03-01T00:00:04.226Z  2011 2011-03-01T00:00:04.226Z
## 2 2011-03-01T00:00:04.226Z  2011 2011-03-01T00:00:04.226Z

regexp_extract 不需要：

sdf %>% mutate(regex = regexp_extract(mydate, '([0-9]{4})', 1))

## Source:   query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 2
##                     mydate regex
##                      <chr> <chr>
## 1 2011-03-01T00:00:04.226Z  2011
## 2 2011-03-01T00:00:04.226Z  2011

另外，由于间接执行（R -> Java），你必须转义两次：

sdf %>% mutate(
  regex = regexp_replace(
    mydate, 
    '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$',
    '$1-$2-$3 $4:$5:$6.$7'))

通常会使用 Spark 日期时间函数：

spark_session(sc) %>%  
  invoke("sql",
    "SELECT *, DATE_FORMAT(CAST(mydate AS timestamp), 'yyyy-MM-dd HH:mm:ss.SSS') parsed from dataframe_test") %>% 
  sdf_register


## Source:   query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 2
##                     mydate                  parsed
##                      <chr>                   <chr>
## 1 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226
## 2 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226

但遗憾的是sparklyr 在这方面似乎极为有限，并将时间戳视为字符串。

另见change string in DF using hive command and mutate with sparklyr。

【讨论】：

非常有趣的解决方案

【解决方案2】：

我在替换“.”时遇到了一些困难。与“”，但最后它适用于：

mutate(myvar2=regexp_replace(myvar, "[.]", ""))

【讨论】：