【问题标题】:Is there a UDF can help to trim all the string space in each one table?是否有 UDF 可以帮助修剪每个表中的所有字符串空间?
【发布时间】:2021-11-20 01:01:57
【问题描述】:

我是一个 bigquery 用户,这里需要 修剪所有空格 for (data type=string) in table。

通常,每个表中有数百个字段
如果我一个一个地使用它,它是非常低效的。

能否有更好的解决方案或创建 UDF 来解决?

示例表:

with `project.dataset.tablename` as 
(
select " 1" as number," Male" as Sex,12 as Age,"A " as level union all 
select " 2" as number," Male" as Sex,11 as Age,"A- " as level union all 
select " 3" as number,"Female " as Sex,9 as Age," A " as level union all 
select "4 " as number,"Female" as Sex,13 as Age,"   A " as level union all 
select "5 " as number,"Male " as Sex,10 as Age," B" as level
)

【问题讨论】:

    标签: google-bigquery bigquery-udf


    【解决方案1】:

    如果您需要修剪 BigQuery 表中 STRING 列的所有空格,可以使用TRIM() function

    使用您的示例表:

    WITH table_A as 
    (
    select " 1" as number," Male" as Sex,12 as Age,"A " as level union all 
    select " 2" as number," Male" as Sex,11 as Age,"A- " as level union all 
    select " 3" as number,"Female " as Sex,9 as Age," A " as level union all 
    select "4 " as number,"Female" as Sex,13 as Age,"   A " as level union all 
    select "5 " as number,"Male " as Sex,10 as Age," B" as level
    )
    
    SELECT TRIM(number), TRIM(Sex), Age, TRIM(level) from table_A
    

    另一方面,基于此StackOverflow post,UDF 函数不能用于创建动态 SQL 语句。

    不过,您可以使用以下方法之一:

    1- 存储过程

    DECLARE i INT64 DEFAULT 1;
    DECLARE j INT64 DEFAULT 0;
    DECLARE x INT64 DEFAULT 1;
    DECLARE z INT64 DEFAULT 0;
    DECLARE update_query STRING;
    DECLARE string_query STRING;
    DECLARE column_name STRING;
    DECLARE project_id STRING;
    DECLARE dataset_name STRING;
    DECLARE table_name STRING;
    
    SET project_id = '<project_id>';
    SET dataset_name = '<dataset_name>';
    SET table_name = '<table_name>';
    
    SET string_query = 'CREATE OR REPLACE TEMP TABLE temp_tables AS '||
    'SELECT table_name, RANK() OVER(ORDER BY table_name) rownum '||
    'FROM (SELECT table_name '||
            'FROM '||project_id||'.'||dataset_name||'.INFORMATION_SCHEMA.TABLES '||
            'WHERE table_schema = "'||dataset_name||'"';
    
    IF table_name <> '' THEN
        SET string_query = string_query || ' AND table_name="'||table_name||'"';
    END IF;
    
    SET string_query = string_query ||')';
    
    SELECT string_query;
    EXECUTE IMMEDIATE string_query;
    
    SET string_query = 'SELECT COUNT(*) FROM temp_tables';
    EXECUTE IMMEDIATE string_query INTO j;
    
          WHILE i<=j DO
                SET table_name = (SELECT table_name FROM temp_tables WHERE rownum = i);
                
                SET string_query = 'CREATE OR REPLACE TEMP TABLE temp_columns AS '||
                'SELECT column_name, RANK() OVER(ORDER BY column_name) rownum '||
                'FROM ('||
                    'SELECT column_name '||
                    'FROM '||project_id||'.'||dataset_name||'.INFORMATION_SCHEMA.COLUMNS '||
                    'WHERE table_name = "'||table_name||'" AND data_type="STRING")';
    
                EXECUTE IMMEDIATE string_query;
    
                SET string_query = 'SELECT COUNT(*) FROM temp_columns';
                EXECUTE IMMEDIATE string_query INTO z;
    
                SET update_query = "";
    
                WHILE x<=z DO
                    SET column_name = (SELECT column_name FROM temp_columns WHERE rownum = x);
                    SET update_query = update_query||column_name||"=TRIM("||column_name||")";
    
                    IF x<z THEN
                        SET update_query = update_query || ',';
                    END IF;
                    SET x=x+1;
                END WHILE;
                SET x = 1;
                SET string_query = 'UPDATE `'||project_id||'.'||dataset_name||'.'||table_name||'` SET '||update_query||' WHERE 1=1';
    
                EXECUTE IMMEDIATE string_query;
                DROP TABLE temp_columns;
                SET i = i+1;
        END WHILE;
    DROP TABLE temp_tables;
    

    2-BigQuery client libraries动态创建句子:

    • 查询表元数据获取列名和列类型

    • 利用这些信息动态创建修剪查询语句

    例如与Python library:

    from google.cloud import bigquery
    
    client = bigquery.Client()
    
    def get_trim_query(project, dataset, table):
        
        query_job = client.query(
            """
            SELECT column_name, data_type
          FROM {}.{}.INFORMATION_SCHEMA.COLUMNS
          WHERE table_name = '{}'""".format(project,dataset,table)
        )
    
        results = query_job.result()  # Waits for job to complete.
        
        metadata = []
    
        for row in results:
            if row.data_type == "STRING":
                metadata.append("TRIM(%s)"%(row.column_name))
            else:
                metadata.append("%s"%(row.column_name))
        
        query = "SELECT %s FROM {}.{}.{}".format(project, dataset, table) % ', '.join(metadata)
    
        return query
    
    if __name__ == "__main__":
        query = get_trim_query("<PROJECT>","<DATASET>","<ORIGIN_TABLE>")
        print(query)
    
        destination_table_id = "<PROJECT>.<DATASET>.<DESTINATION_TABLE>"
    
        job_config = bigquery.QueryJobConfig(destination=destination_table_id)
    
        sql = query
    
        query_job = client.query(sql, job_config=job_config)  # Make an API request.
        query_job.result()  # Wait for the job to complete.
    
        print("Query results loaded to the table {}".format(destination_table_id))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2022-11-01
      • 2014-04-12
      • 2021-01-20
      • 2011-02-11
      • 1970-01-01
      • 1970-01-01
      • 2016-04-07
      相关资源
      最近更新 更多