python SQL解析器获取列名和数据类型答案

【问题标题】：python SQL parser to get the column name and data typepython SQL解析器获取列名和数据类型
【发布时间】：2020-08-04 12:53:43
【问题描述】：

我正在使用 python SQL 解析器来获取表信息。我能够得到表名和模式名。

import sqlparse
line = '''
CREATE TABLE public.actor (
    actor_id integer DEFAULT nextval('public.actor_actor_id_seq'::regclass) NOT NULL,
    first_name character varying(45) NOT NULL,
    last_name character varying(45) NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);

CREATE TABLE public.category (
    category_id integer DEFAULT nextval('public.category_category_id_seq'::regclass) NOT NULL,
    name character varying(25) NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);

CREATE TABLE IF NOT EXISTS "sample_schema"."sample_table"
(
    "div_cd" VARCHAR(2) NOT NULL
    ,"div_name" VARCHAR(30) NOT NULL
    ,"org_cd" VARCHAR(8) NOT NULL
    ,"org_name" VARCHAR(60) NOT NULL
    ,"team_cd" VARCHAR(2) NOT NULL
    ,"team_name" VARCHAR(120) NOT NULL
    ,"personal_cd" VARCHAR(7) NOT NULL
    ,"personal_name" VARCHAR(300) NOT NULL
    ,"username" VARCHAR(6) NOT NULL
    ,"staff_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
    ,"leader_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
)
DISTSTYLE EVEN
;

CREATE TABLE IF NOT EXISTS "sample_schema"."ref_table"
(
     "staff_flg" CHAR(1)  DEFAULT '0'::bpchar SORTKEY ENCODE lzo 
    ,"leader_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
)
DISTSTYLE EVEN
;
'''
parse = sqlparse.parse(line)
print([str(t) for t in parse[0].tokens if t.ttype is None][0])

Output: 
public.actor

但如果我想返回列名和数据类型，我可以使用哪个标记来打印这两个 DDL。

输出是这样的，[不完全相同:)]

table: public.actor

一一打印列名和数据类型（可能在for循环中）

column: actor_id
date type: integer
column: first_name
data type: character varying

【问题讨论】：

标签： python sql-parser

【解决方案1】：

内联记录的代码

import sqlparse

line = '''
CREATE TABLE public.actor (
    actor_id integer DEFAULT nextval('public.actor_actor_id_seq'::regclass) NOT NULL,
    first_name character varying(45) NOT NULL,
    last_name character varying(45) NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);

CREATE TABLE public.category (
    category_id integer DEFAULT nextval('public.category_category_id_seq'::regclass) NOT NULL,
    name character varying(25) NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);

CREATE TABLE IF NOT EXISTS "sample_schema"."sample_table"
(
    "div_cd" VARCHAR(2) NOT NULL
    ,"div_name" VARCHAR(30) NOT NULL
    ,"org_cd" VARCHAR(8) NOT NULL
    ,"org_name" VARCHAR(60) NOT NULL
    ,"team_cd" VARCHAR(2) NOT NULL
    ,"team_name" VARCHAR(120) NOT NULL
    ,"personal_cd" VARCHAR(7) NOT NULL
    ,"personal_name" VARCHAR(300) NOT NULL
    ,"username" VARCHAR(6) NOT NULL
    ,"staff_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
    ,"leader_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
)
DISTSTYLE EVEN
;

CREATE TABLE IF NOT EXISTS "sample_schema"."ref_table"
(
     "staff_flg" CHAR(1)  DEFAULT '0'::bpchar SORTKEY ENCODE lzo 
    ,"leader_flg" CHAR(1)  DEFAULT '0'::bpchar ENCODE lzo
)
DISTSTYLE EVEN
;
'''

def get_table_name(tokens):
    for token in reversed(tokens):
        if token.ttype is None:
            return token.value
    return " "

parse = sqlparse.parse(line)
for stmt in parse:
    # Get all the tokens except whitespaces
    tokens = [t for t in sqlparse.sql.TokenList(stmt.tokens) if t.ttype != sqlparse.tokens.Whitespace]
    is_create_stmt = False
    for i, token in enumerate(tokens):
        # Is it a create statements ?
        if token.match(sqlparse.tokens.DDL, 'CREATE'):
            is_create_stmt = True
            continue
        
        # If it was a create statement and the current token starts with "("
        if is_create_stmt and token.value.startswith("("):
            # Get the table name by looking at the tokens in reverse order till you find
            # a token with None type
            print (f"table: {get_table_name(tokens[:i])}")

            # Now parse the columns
            txt = token.value
            columns = txt[1:txt.rfind(")")].replace("\n","").split(",")
            for column in columns:
                c = ' '.join(column.split()).split()
                c_name = c[0].replace('\"',"")
                c_type = c[1]  # For condensed type information 
                # OR 
                #c_type = " ".join(c[1:]) # For detailed type information 
                print (f"column: {c_name}")
                print (f"date type: {c_type}")
            print ("---"*20)
            break

输出：

table: public.actor
column: actor_id
date type: integer
column: first_name
date type: character
column: last_name
date type: character
column: last_update
date type: timestamp
------------------------------------------------------------
table: public.category
column: category_id
date type: integer
column: name
date type: character
column: last_update
date type: timestamp
------------------------------------------------------------
table: "sample_schema"."sample_table"
column: div_cd
date type: VARCHAR(2)
column: div_name
date type: VARCHAR(30)
column: org_cd
date type: VARCHAR(8)
column: org_name
date type: VARCHAR(60)
column: team_cd
date type: VARCHAR(2)
column: team_name
date type: VARCHAR(120)
column: personal_cd
date type: VARCHAR(7)
column: personal_name
date type: VARCHAR(300)
column: username
date type: VARCHAR(6)
column: staff_flg
date type: CHAR(1)
column: leader_flg
date type: CHAR(1)
------------------------------------------------------------
table: "sample_schema"."ref_table"
column: staff_flg
date type: CHAR(1)
column: leader_flg
date type: CHAR(1)
------------------------------------------------------------

【讨论】：

是否可以打印完整的数据类型？比如character varying，timestamp without time zone
@Bhuvanesh 如代码中所述使用c_type = " ".join(c[1:]) # For detailed type information （取消注释该行）打印详细的类型信息
是的，但是它在列名之后打印所有内容，SQLparser 是否有任何令牌来获取确切的数据类型？
不，你必须对令牌进行字符串解析，我认为没有办法识别数据类型，因为解析器没有数据类型的语义。

【解决方案2】：

您可以使用 python-sqlparse 示例中的script 中的函数来提取数据：

def extract_definitions(token_list):
    # assumes that token_list is a parenthesis
    definitions = []
    tmp = []
    par_level = 0
    for token in token_list.flatten():
        if token.is_whitespace:
            continue
        elif token.match(sqlparse.tokens.Punctuation, '('):
            par_level += 1
            continue
        if token.match(sqlparse.tokens.Punctuation, ')'):
            if par_level == 0:
                break
            else:
                par_level += 1
        elif token.match(sqlparse.tokens.Punctuation, ','):
            if tmp:
                definitions.append(tmp)
            tmp = []
        else:
            tmp.append(token)
    if tmp:
        definitions.append(tmp)
    return definitions

并使用 for 循环打印有关列名及其数据类型的信息：

parsed = sqlparse.parse(line)[0]

# extract the parenthesis which holds column definitions
_, par = parsed.token_next_by(i=sqlparse.sql.Parenthesis)
columns = extract_definitions(par)

for column in columns:
    print(f"column: {column[0]}")
    print(f"data type: {' '.join(str(t) for t in column[1:])}")

此代码产生以下输出：

column: actor_id
data type: integer DEFAULT nextval 'public.actor_actor_id_seq' :: regclass NOT NULL
column: first_name
data type: character varying 45 NOT NULL
column: last_name
data type: character varying 45 NOT NULL
column: last_update
data type: timestamp without time zone DEFAULT now NOT NULL

它实际上比您想要的信息多一点。但是，这些字符串应该很容易“正则表达式”以仅提取基本数据类型。

正如 cmets 中已经提到的，使用特殊信息（例如 DISTSTYLE）会导致解析器无法识别 sqlparse.sql.Parenthesis 实例并被报告为 bug。因此，需要在解析之前从 SQL 查询中删除此类信息。

【讨论】：

是的，我看到了，但是如果我有一些特殊信息，例如 DISTRIBUTION STYLE（DDL 查询中的最后一条 SQL 语句），它就不起作用，它会给出这个错误 AttributeError: 'NoneType' object has no attribute 'flatten'
你是对的。这种情况需要从line 中删除此类特殊信息才能正常工作。注意这个错误has already been reported.