【问题标题】:How to insert data into hive table from arrays returned by XPath如何从 XPath 返回的数组中将数据插入配置单元表
【发布时间】:2017-02-25 00:34:58
【问题描述】:

我有一个配置单元查询,它使用 XPath 从 XML 返回一组数组。 我想将数组的这些元素插入到配置单元表中。

hivexml表中的xml内容为:

<tag><row Id="1" TagName=".net" Count="244006" ExcerptPostId="3624959" WikiPostId="3607476" /><row Id="2" TagName="html" Count="602809" ExcerptPostId="3673183" WikiPostId="3673182" /><row Id="3" TagName="javascript" Count="1274350" ExcerptPostId="3624960" WikiPostId="3607052" /><row Id="4" TagName="css" Count="434937" ExcerptPostId="3644670" WikiPostId="3644669" /><row Id="5" TagName="php" Count="1009113" ExcerptPostId="3624936" WikiPostId="3607050" /><row Id="8" TagName="c" Count="236386" ExcerptPostId="3624961" WikiPostId="3607013" /></tag>

返回数组集合的查询是:

select xpath(str,'/tag/row/@Id'), xpath(str,'/tag/row/@TagName'), xpath(str,'/tag/row/@Count'), xpath(str,'/tag/row/@ExcerptPostId'), xpath(str,'/tag/row/@WikiPostId') from hivexml;"

而上述查询的输出(数组的集合)是:

["1","2","3","4","5"] [".net","html","css","php","c"]   ["244006","602809","434937","1009113","236386"] ["3624959","3673183","3644670","3624936","3624961"] ["3607476","36
73182","3644669","3607050","3607013"]

我想将这些值插入到配置单元表中,如下所示:

1    .net    244006     3624959    3607476
2    html    602809     3673183    3673182
3    css     434937     3644670    3644669
4    php     1009113    3624936    3607050
5    c       236386     3624961    3607013

如果我对上述选择查询进行插入:

insert into newhivexml select xpath(str,'/tags/row/@Id'), xpath(str,'/tag/row/@TagName'), xpath(str,'/tag/row/@Count'), xpath(str,'/tag/row/@ExcerptPostId'), xpath(str,'/tag/row/@WikiPostId') from hivexml;"

然后我得到一个错误:

NoMatchingMethodException 类没有匹配方法 org.apache.hadoop.hive.ql.udf.UDFToInteger 与(数组)。 可能的选择:FUNC(bigint) FUNC(boolean) FU NC(decimal(38,18)) FUNC(double) FUNC(float) FUNC(smallint) FUNC(string) FUNC(struct) FUNC(timestamp) FUNC(tinyin t) FUNC(无效)

我认为我们不能像这样直接插入,我在这里缺少一些东西。谁能告诉我该怎么做?也就是说,将这些值从数组中插入到表中。

【问题讨论】:

  • 只是为了确保 - XML 只是行中的一列,而不是整个数据,对吧?

标签: xml powershell hadoop xpath hive


【解决方案1】:

xpath_... (str,concat('/tag/row[',pe.pos+1,']/@...))

create table hivexml (str string);

insert into hivexml values ('<tag><row Id="1" TagName=".net" Count="244006" ExcerptPostId="3624959" WikiPostId="3607476" /><row Id="2" TagName="html" Count="602809" ExcerptPostId="3673183" WikiPostId="3673182" /><row Id="3" TagName="javascript" Count="1274350" ExcerptPostId="3624960" WikiPostId="3607052" /><row Id="4" TagName="css" Count="434937" ExcerptPostId="3644670" WikiPostId="3644669" /><row Id="5" TagName="php" Count="1009113" ExcerptPostId="3624936" WikiPostId="3607050" /><row Id="8" TagName="c" Count="236386" ExcerptPostId="3624961" WikiPostId="3607013" /></tag>');

select  xpath_int    (str,concat('/tag/row[',pe.pos+1,']/@Id'           )) as Id  
       ,xpath_string (str,concat('/tag/row[',pe.pos+1,']/@TagName'      )) as TagName
       ,xpath_int    (str,concat('/tag/row[',pe.pos+1,']/@Count'        )) as Count
       ,xpath_int    (str,concat('/tag/row[',pe.pos+1,']/@ExcerptPostId')) as ExcerptPostId
       ,xpath_int    (str,concat('/tag/row[',pe.pos+1,']/@WikiPostId'   )) as WikiPostId

from    hivexml
        lateral view  posexplode (xpath(str,'/tag/row/@Id')) pe
;

+----+------------+---------+---------------+------------+
| id |  tagname   |  count  | excerptpostid | wikipostid |
+----+------------+---------+---------------+------------+
|  1 | .net       |  244006 |       3624959 |    3607476 |
|  2 | html       |  602809 |       3673183 |    3673182 |
|  3 | javascript | 1274350 |       3624960 |    3607052 |
|  4 | css        |  434937 |       3644670 |    3644669 |
|  5 | php        | 1009113 |       3624936 |    3607050 |
|  8 | c          |  236386 |       3624961 |    3607013 |
+----+------------+---------+---------------+------------+

【讨论】:

  • 谢谢它的工作!但是一个小故障是我们不能在查询中添加换行符。它显示错误“命令的语法不正确。”。如果我将所有内容放在一行中,它就可以工作!
【解决方案2】:

xpath (str,concat('/tag/row[',pe.pos+1,']/@*'))

这是一种将元素的所有值一起提取的非常简洁的方法。
令我惊讶的是,属性的​​顺序似乎不是根据它们在 XML 中的顺序,而是按它们的名称的字母顺序 -
@Count,@ExcerptPostId,@Id,@TagName,@WikiPostId

很遗憾,除非我知道可以保证按字母顺序排列的属性顺序,否则我不能将其视为合法解决方案。

select  xpath (str,concat('/tag/row[',pe.pos+1,']/@*')) as row_values

from    hivexml
        lateral view  posexplode (xpath(str,'/tag/row/@Id')) pe
;

--

["244006","3624959","1",".net","3607476"]
["602809","3673183","2","html","3673182"]
["1274350","3624960","3","javascript","3607052"]
["434937","3644670","4","css","3644669"]
["1009113","3624936","5","php","3607050"]
["236386","3624961","8","c","3607013"]

select  row_values[2] as Id
       ,row_values[3] as TagName
       ,row_values[0] as Count    
       ,row_values[1] as ExcerptPostId
       ,row_values[4] as WikiPostId

from   (select  xpath (str,concat('/tag/row[',pe.pos+1,']/@*')) as row_values

        from    hivexml
                lateral view  posexplode (xpath(str,'/tag/row/@Id')) pe
        ) x
;

+----+------------+---------+---------------+------------+
| id |  tagname   |  count  | excerptpostid | wikipostid |
+----+------------+---------+---------------+------------+
|  1 | .net       |  244006 |       3624959 |    3607476 |
|  2 | html       |  602809 |       3673183 |    3673182 |
|  3 | javascript | 1274350 |       3624960 |    3607052 |
|  4 | css        |  434937 |       3644670 |    3644669 |
|  5 | php        | 1009113 |       3624936 |    3607050 |
|  8 | c          |  236386 |       3624961 |    3607013 |
+----+------------+---------+---------------+------------+

【讨论】:

  • 你是真正的 Hive 大师。甚至没有想到这样的事情可以通过 Hive 在单个查询中完成。为每个解决方案 +1
【解决方案3】:

拆分 + str_to_map

select  vals["Id"]              as Id
       ,vals["TagName"]         as TagName
       ,vals["Count"]           as Count    
       ,vals["ExcerptPostId"]   as ExcerptPostId
       ,vals["WikiPostId"]      as WikiPostId

from   (select  str_to_map(e.val,' ','=') as vals

        from    hivexml 
                lateral view  posexplode(split(translate(str,'"',''),'/?><row')) e

        where   e.pos <> 0
        ) x
;

+----+------------+---------+---------------+------------+
| id |  tagname   |  count  | excerptpostid | wikipostid |
+----+------------+---------+---------------+------------+
|  1 | .net       |  244006 |       3624959 |    3607476 |
|  2 | html       |  602809 |       3673183 |    3673182 |
|  3 | javascript | 1274350 |       3624960 |    3607052 |
|  4 | css        |  434937 |       3644670 |    3644669 |
|  5 | php        | 1009113 |       3624936 |    3607050 |
|  8 | c          |  236386 |       3624961 |    3607013 |
+----+------------+---------+---------------+------------+

【讨论】:

    【解决方案4】:

    如果数据是 XML 文档

    XML serde 可以从https://github.com/01org/graphbuilder/blob/master/src/com/intel/hadoop/graphbuilder/preprocess/inputformat/XMLInputFormat.java下载

    add jar /home/cloudera/hivexmlserde-1.0.5.3.jar;
    
    create external table hivexml_ext
    (
        Id              string
       ,TagName         string
       ,Count           string
       ,ExcerptPostId   string
       ,WikiPostId      string
    )
    row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
    with serdeproperties 
    (
        "column.xpath.Id"            = "/row/@Id"
       ,"column.xpath.TagName"       = "/row/@TagName"
       ,"column.xpath.Count"         = "/row/@Count    "
       ,"column.xpath.ExcerptPostId" = "/row/@ExcerptPostId"
       ,"column.xpath.WikiPostId"    = "/row/@WikiPostId"
    )
    stored as
    inputformat     'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    outputformat    'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    location        '/user/hive/warehouse/hivexml'
    tblproperties 
    (
        "xmlinput.start" = "<row"
       ,"xmlinput.end"   = "/>"
    )
    ;
    
    select * from hivexml_ext as x
    ;
    

    +------+------------+---------+-----------------+--------------+
    | x.id | x.tagname  | x.count | x.excerptpostid | x.wikipostid |
    +------+------------+---------+-----------------+--------------+
    |    1 | .net       |  244006 |         3624959 |      3607476 |
    |    2 | html       |  602809 |         3673183 |      3673182 |
    |    3 | javascript | 1274350 |         3624960 |      3607052 |
    |    4 | css        |  434937 |         3644670 |      3644669 |
    |    5 | php        | 1009113 |         3624936 |      3607050 |
    |    8 | c          |  236386 |         3624961 |      3607013 |
    +------+------------+---------+-----------------+--------------+
    

    【讨论】:

    • 我的电脑里没有java..如果我照原样复制上面的代码会在powershell中运行吗?我担心添加 jar 文件的顶行。
    • 下载 jar 后,应在 hive 中执行 add jar 命令。将罐子放在您喜欢的任何地方并相应地更改路径。
    • jar 文件应该在我的本地机器还是天蓝色?我已经把它放在我的本地机器上,但它的显示文件不存在。
    • 我不太喜欢这些东西。尝试谷歌或打开一个新的 SO 帖子。
    【解决方案5】:

    问题在于,XPath 函数将在不加入独立数组的情况下为每个请求返回所有匹配结果。如果它适合您,您可以使用 Pig,因为批处理模型可以将过程简化为各个步骤:

    REGISTER /usr/hdp/current/pig-client/lib/piggybank.jar DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();
    
    A = LOAD '/tmp/text.xml' using org.apache.pig.piggybank.storage.XMLLoader('tag') as (x:chararray);
    
    B = FOREACH A GENERATE XPathAll(x, 'row/@Id',false,false).$0,
        XPathAll(x, 'row/@TagName',false,false).$0,
        XPathAll(x, 'row/@Count',false,false).$0,
        XPathAll(x, 'row/@ExcerptPostId',false,false).$0,
        XPathAll(x, 'row/@WikiPostId',false,false).$0;
    
    DUMP B;
    
    (1,.net,244006,3624959,3607476)
    (2,html,602809,3673183,3673182)
    (3,javascript,1274350,3624960,3607052)
    (4,css,434937,3644670,3644669)
    (5,php,1009113,3624936,3607050)
    (8,c,236386,3624961,3607013)
    
    STORE B INTO "YourTable" USING org.apache.hive.hcatalog.pig.HCatStorer();
    

    【讨论】:

      猜你喜欢
      • 2021-12-27
      • 1970-01-01
      • 2018-05-11
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-06-16
      • 2020-07-26
      • 1970-01-01
      相关资源
      最近更新 更多