【问题标题】:How to merge rows by a similar column via levenshtein distance如何通过 levenshtein 距离按相似列合并行
【发布时间】:2019-09-02 12:08:26
【问题描述】:

我正在使用 AWS Athena,我正在尝试合并所有具有特定列且 levenshtein_distance 值低于 5 的行,并对标准化百分比求和。

表格结构如下:

CREATE EXTERNAL TABLE `actions`(
  `id` string COMMENT 'from deserializer', 
  `text` string COMMENT 'from deserializer',
  `normalizedpercentage` float COMMENT 'from deserializer', 
  `timestamp` timestamp COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
  's3://xxxxxx/db/actions'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1566991410')

这是我想做的:

WITH t AS 
    (SELECT id,
         text,
         normalizedPercentage
    FROM actions
    WHERE actions.timestamp
        BETWEEN timestamp '2019-08-01 00:00:01'
            AND timestamp '2019-08-31 23:59:59' )
SELECT *,
         SUM(normalizedPercentage)
    OVER (PARTITION BY levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5) AS cumulative
FROM t

很遗憾,PARTITION BY 子句只接受列名。

我正在考虑定义一个函数并使用它来遍历所有行,但是这在 Presto 中似乎是不可能的。

【问题讨论】:

    标签: sql presto amazon-athena


    【解决方案1】:

    您可以根据您的函数计算临时表中的新列,然后在主查询中使用该列进行分区

    WITH t AS 
    (SELECT id,
         text,
         normalizedPercentage,case when  levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5 then 'groupA' else 'groupB' end as classification
    FROM actions
    WHERE actions.timestamp
        BETWEEN timestamp '2019-08-01 00:00:01'
            AND timestamp '2019-08-31 23:59:59' )
       SELECT *,
         SUM(normalizedPercentage)
    OVER (PARTITION BY classification ) AS cumulative
    FROM t
    

    【讨论】:

    • 谢谢,但问题仍未得到解答。你如何引用 levenshtein 函数中的每一行? (EVERY_OTHER_TEXT_COLUMN)
    猜你喜欢
    • 2012-07-25
    • 2018-09-21
    • 2011-10-01
    • 2021-08-30
    • 1970-01-01
    • 2017-07-26
    • 1970-01-01
    • 2013-04-08
    相关资源
    最近更新 更多