【问题标题】:python SQL parser to get the column name and data typepython SQL解析器获取列名和数据类型
【发布时间】:2020-08-04 12:53:43
【问题描述】:

我正在使用 python SQL 解析器来获取表信息。我能够得到表名和模式名。

import sqlparse
line = '''
CREATE TABLE public.actor (
    actor_id integer DEFAULT nextval('public.actor_actor_id_seq'::regclass) NOT NULL,
    first_name character varying(45) NOT NULL,
    last_name character varying(45) NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);

CREATE TABLE public.category (
    category_id integer DEFAULT nextval('public.category_category_id_seq'::regclass) NOT NULL,
    name character varying(25) NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);

CREATE TABLE IF NOT EXISTS "sample_schema"."sample_table"
(
    "div_cd" VARCHAR(2) NOT NULL
    ,"div_name" VARCHAR(30) NOT NULL
    ,"org_cd" VARCHAR(8) NOT NULL
    ,"org_name" VARCHAR(60) NOT NULL
    ,"team_cd" VARCHAR(2) NOT NULL
    ,"team_name" VARCHAR(120) NOT NULL
    ,"personal_cd" VARCHAR(7) NOT NULL
    ,"personal_name" VARCHAR(300) NOT NULL
    ,"username" VARCHAR(6) NOT NULL
    ,"staff_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
    ,"leader_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
)
DISTSTYLE EVEN
;

CREATE TABLE IF NOT EXISTS "sample_schema"."ref_table"
(
     "staff_flg" CHAR(1)  DEFAULT '0'::bpchar SORTKEY ENCODE lzo 
    ,"leader_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
)
DISTSTYLE EVEN
;
'''
parse = sqlparse.parse(line)
print([str(t) for t in parse[0].tokens if t.ttype is None][0])

Output: 
public.actor

但如果我想返回列名和数据类型,我可以使用哪个标记来打印这两个 DDL。

输出是这样的,[不完全相同:)]

table: public.actor

一一打印列名和数据类型(可能在for循环中)

column: actor_id
date type: integer
column: first_name
data type: character varying

【问题讨论】:

    标签: python sql-parser


    【解决方案1】:

    内联记录的代码

    import sqlparse
    
    line = '''
    CREATE TABLE public.actor (
        actor_id integer DEFAULT nextval('public.actor_actor_id_seq'::regclass) NOT NULL,
        first_name character varying(45) NOT NULL,
        last_name character varying(45) NOT NULL,
        last_update timestamp without time zone DEFAULT now() NOT NULL
    );
    
    CREATE TABLE public.category (
        category_id integer DEFAULT nextval('public.category_category_id_seq'::regclass) NOT NULL,
        name character varying(25) NOT NULL,
        last_update timestamp without time zone DEFAULT now() NOT NULL
    );
    
    CREATE TABLE IF NOT EXISTS "sample_schema"."sample_table"
    (
        "div_cd" VARCHAR(2) NOT NULL
        ,"div_name" VARCHAR(30) NOT NULL
        ,"org_cd" VARCHAR(8) NOT NULL
        ,"org_name" VARCHAR(60) NOT NULL
        ,"team_cd" VARCHAR(2) NOT NULL
        ,"team_name" VARCHAR(120) NOT NULL
        ,"personal_cd" VARCHAR(7) NOT NULL
        ,"personal_name" VARCHAR(300) NOT NULL
        ,"username" VARCHAR(6) NOT NULL
        ,"staff_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
        ,"leader_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
    )
    DISTSTYLE EVEN
    ;
    
    CREATE TABLE IF NOT EXISTS "sample_schema"."ref_table"
    (
         "staff_flg" CHAR(1)  DEFAULT '0'::bpchar SORTKEY ENCODE lzo 
        ,"leader_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
    )
    DISTSTYLE EVEN
    ;
    '''
    
    def get_table_name(tokens):
        for token in reversed(tokens):
            if token.ttype is None:
                return token.value
        return " "
    
    parse = sqlparse.parse(line)
    for stmt in parse:
        # Get all the tokens except whitespaces
        tokens = [t for t in sqlparse.sql.TokenList(stmt.tokens) if t.ttype != sqlparse.tokens.Whitespace]
        is_create_stmt = False
        for i, token in enumerate(tokens):
            # Is it a create statements ?
            if token.match(sqlparse.tokens.DDL, 'CREATE'):
                is_create_stmt = True
                continue
            
            # If it was a create statement and the current token starts with "("
            if is_create_stmt and token.value.startswith("("):
                # Get the table name by looking at the tokens in reverse order till you find
                # a token with None type
                print (f"table: {get_table_name(tokens[:i])}")
    
                # Now parse the columns
                txt = token.value
                columns = txt[1:txt.rfind(")")].replace("\n","").split(",")
                for column in columns:
                    c = ' '.join(column.split()).split()
                    c_name = c[0].replace('\"',"")
                    c_type = c[1]  # For condensed type information 
                    # OR 
                    #c_type = " ".join(c[1:]) # For detailed type information 
                    print (f"column: {c_name}")
                    print (f"date type: {c_type}")
                print ("---"*20)
                break
    

    输出:

    table: public.actor
    column: actor_id
    date type: integer
    column: first_name
    date type: character
    column: last_name
    date type: character
    column: last_update
    date type: timestamp
    ------------------------------------------------------------
    table: public.category
    column: category_id
    date type: integer
    column: name
    date type: character
    column: last_update
    date type: timestamp
    ------------------------------------------------------------
    table: "sample_schema"."sample_table"
    column: div_cd
    date type: VARCHAR(2)
    column: div_name
    date type: VARCHAR(30)
    column: org_cd
    date type: VARCHAR(8)
    column: org_name
    date type: VARCHAR(60)
    column: team_cd
    date type: VARCHAR(2)
    column: team_name
    date type: VARCHAR(120)
    column: personal_cd
    date type: VARCHAR(7)
    column: personal_name
    date type: VARCHAR(300)
    column: username
    date type: VARCHAR(6)
    column: staff_flg
    date type: CHAR(1)
    column: leader_flg
    date type: CHAR(1)
    ------------------------------------------------------------
    table: "sample_schema"."ref_table"
    column: staff_flg
    date type: CHAR(1)
    column: leader_flg
    date type: CHAR(1)
    ------------------------------------------------------------
    

    【讨论】:

    • 是否可以打印完整的数据类型?比如character varyingtimestamp without time zone
    • @Bhuvanesh 如代码中所述使用c_type = " ".join(c[1:]) # For detailed type information (取消注释该行)打印详细的类型信息
    • 是的,但是它在列名之后打印所有内容,SQLparser 是否有任何令牌来获取确切的数据类型?
    • 不,你必须对令牌进行字符串解析,我认为没有办法识别数据类型,因为解析器没有数据类型的语义。
    【解决方案2】:

    您可以使用 python-sqlparse 示例中的script 中的函数来提取数据:

    def extract_definitions(token_list):
        # assumes that token_list is a parenthesis
        definitions = []
        tmp = []
        par_level = 0
        for token in token_list.flatten():
            if token.is_whitespace:
                continue
            elif token.match(sqlparse.tokens.Punctuation, '('):
                par_level += 1
                continue
            if token.match(sqlparse.tokens.Punctuation, ')'):
                if par_level == 0:
                    break
                else:
                    par_level += 1
            elif token.match(sqlparse.tokens.Punctuation, ','):
                if tmp:
                    definitions.append(tmp)
                tmp = []
            else:
                tmp.append(token)
        if tmp:
            definitions.append(tmp)
        return definitions
    

    并使用 for 循环打印有关列名及其数据类型的信息:

    parsed = sqlparse.parse(line)[0]
    
    # extract the parenthesis which holds column definitions
    _, par = parsed.token_next_by(i=sqlparse.sql.Parenthesis)
    columns = extract_definitions(par)
    
    for column in columns:
        print(f"column: {column[0]}")
        print(f"data type: {' '.join(str(t) for t in column[1:])}")
    

    此代码产生以下输出:

    column: actor_id
    data type: integer DEFAULT nextval 'public.actor_actor_id_seq' :: regclass NOT NULL
    column: first_name
    data type: character varying 45 NOT NULL
    column: last_name
    data type: character varying 45 NOT NULL
    column: last_update
    data type: timestamp without time zone DEFAULT now NOT NULL
    

    它实际上比您想要的信息多一点。但是,这些字符串应该很容易“正则表达式”以仅提取基本数据类型。

    正如 cmets 中已经提到的,使用特殊信息(例如 DISTSTYLE)会导致解析器无法识别 sqlparse.sql.Parenthesis 实例并被报告为 bug。因此,需要在解析之前从 SQL 查询中删除此类信息。

    【讨论】:

    • 是的,我看到了,但是如果我有一些特殊信息,例如 DISTRIBUTION STYLE(DDL 查询中的最后一条 SQL 语句),它就不起作用,它会给出这个错误 AttributeError: 'NoneType' object has no attribute 'flatten'
    • 你是对的。这种情况需要从line 中删除此类特殊信息才能正常工作。注意这个错误has already been reported.
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2012-06-16
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-09-06
    • 2013-11-05
    • 1970-01-01
    相关资源
    最近更新 更多