如何从 XPath 返回的数组中将数据插入配置单元表答案

【问题标题】：How to insert data into hive table from arrays returned by XPath如何从 XPath 返回的数组中将数据插入配置单元表
【发布时间】：2017-02-25 00:34:58
【问题描述】：

我有一个配置单元查询，它使用 XPath 从 XML 返回一组数组。 我想将数组的这些元素插入到配置单元表中。

hivexml表中的xml内容为：

<tag><row Id="1" TagName=".net" Count="244006" ExcerptPostId="3624959" WikiPostId="3607476" /><row Id="2" TagName="html" Count="602809" ExcerptPostId="3673183" WikiPostId="3673182" /><row Id="3" TagName="javascript" Count="1274350" ExcerptPostId="3624960" WikiPostId="3607052" /><row Id="4" TagName="css" Count="434937" ExcerptPostId="3644670" WikiPostId="3644669" /><row Id="5" TagName="php" Count="1009113" ExcerptPostId="3624936" WikiPostId="3607050" /><row Id="8" TagName="c" Count="236386" ExcerptPostId="3624961" WikiPostId="3607013" /></tag>

返回数组集合的查询是：

select xpath(str,'/tag/row/@Id'), xpath(str,'/tag/row/@TagName'), xpath(str,'/tag/row/@Count'), xpath(str,'/tag/row/@ExcerptPostId'), xpath(str,'/tag/row/@WikiPostId') from hivexml;"

而上述查询的输出（数组的集合）是：

["1","2","3","4","5"] [".net","html","css","php","c"]   ["244006","602809","434937","1009113","236386"] ["3624959","3673183","3644670","3624936","3624961"] ["3607476","36
73182","3644669","3607050","3607013"]

我想将这些值插入到配置单元表中，如下所示：

1    .net    244006     3624959    3607476
2    html    602809     3673183    3673182
3    css     434937     3644670    3644669
4    php     1009113    3624936    3607050
5    c       236386     3624961    3607013

如果我对上述选择查询进行插入：

insert into newhivexml select xpath(str,'/tags/row/@Id'), xpath(str,'/tag/row/@TagName'), xpath(str,'/tag/row/@Count'), xpath(str,'/tag/row/@ExcerptPostId'), xpath(str,'/tag/row/@WikiPostId') from hivexml;"

然后我得到一个错误：

NoMatchingMethodException 类没有匹配方法 org.apache.hadoop.hive.ql.udf.UDFToInteger 与（数组）。可能的选择：FUNC(bigint) FUNC(boolean) FU NC(decimal(38,18)) FUNC(double) FUNC(float) FUNC(smallint) FUNC(string) FUNC(struct) FUNC(timestamp) FUNC(tinyin t) FUNC(无效)

我认为我们不能像这样直接插入，我在这里缺少一些东西。谁能告诉我该怎么做？也就是说，将这些值从数组中插入到表中。

【问题讨论】：

只是为了确保 - XML 只是行中的一列，而不是整个数据，对吧？

标签： xml powershell hadoop xpath hive

【解决方案1】：

xpath_... (str,concat('/tag/row[',pe.pos+1,']/@...))

create table hivexml (str string);

insert into hivexml values ('<tag><row Id="1" TagName=".net" Count="244006" ExcerptPostId="3624959" WikiPostId="3607476" /><row Id="2" TagName="html" Count="602809" ExcerptPostId="3673183" WikiPostId="3673182" /><row Id="3" TagName="javascript" Count="1274350" ExcerptPostId="3624960" WikiPostId="3607052" /><row Id="4" TagName="css" Count="434937" ExcerptPostId="3644670" WikiPostId="3644669" /><row Id="5" TagName="php" Count="1009113" ExcerptPostId="3624936" WikiPostId="3607050" /><row Id="8" TagName="c" Count="236386" ExcerptPostId="3624961" WikiPostId="3607013" /></tag>');

select  xpath_int    (str,concat('/tag/row[',pe.pos+1,']/@Id'           )) as Id  
       ,xpath_string (str,concat('/tag/row[',pe.pos+1,']/@TagName'      )) as TagName
       ,xpath_int    (str,concat('/tag/row[',pe.pos+1,']/@Count'        )) as Count
       ,xpath_int    (str,concat('/tag/row[',pe.pos+1,']/@ExcerptPostId')) as ExcerptPostId
       ,xpath_int    (str,concat('/tag/row[',pe.pos+1,']/@WikiPostId'   )) as WikiPostId

from    hivexml
        lateral view  posexplode (xpath(str,'/tag/row/@Id')) pe
;

+----+------------+---------+---------------+------------+
| id |  tagname   |  count  | excerptpostid | wikipostid |
+----+------------+---------+---------------+------------+
|  1 | .net       |  244006 |       3624959 |    3607476 |
|  2 | html       |  602809 |       3673183 |    3673182 |
|  3 | javascript | 1274350 |       3624960 |    3607052 |
|  4 | css        |  434937 |       3644670 |    3644669 |
|  5 | php        | 1009113 |       3624936 |    3607050 |
|  8 | c          |  236386 |       3624961 |    3607013 |
+----+------------+---------+---------------+------------+

【讨论】：

谢谢它的工作！但是一个小故障是我们不能在查询中添加换行符。它显示错误“命令的语法不正确。”。如果我将所有内容放在一行中，它就可以工作！

【解决方案2】：

xpath (str,concat('/tag/row[',pe.pos+1,']/@*'))

这是一种将元素的所有值一起提取的非常简洁的方法。
令我惊讶的是，属性的顺序似乎不是根据它们在 XML 中的顺序，而是按它们的名称的字母顺序 -
@Count,@ExcerptPostId,@Id,@TagName,@WikiPostId

很遗憾，除非我知道可以保证按字母顺序排列的属性顺序，否则我不能将其视为合法解决方案。

select  xpath (str,concat('/tag/row[',pe.pos+1,']/@*')) as row_values

from    hivexml
        lateral view  posexplode (xpath(str,'/tag/row/@Id')) pe
;

--

["244006","3624959","1",".net","3607476"]
["602809","3673183","2","html","3673182"]
["1274350","3624960","3","javascript","3607052"]
["434937","3644670","4","css","3644669"]
["1009113","3624936","5","php","3607050"]
["236386","3624961","8","c","3607013"]

select  row_values[2] as Id
       ,row_values[3] as TagName
       ,row_values[0] as Count    
       ,row_values[1] as ExcerptPostId
       ,row_values[4] as WikiPostId

from   (select  xpath (str,concat('/tag/row[',pe.pos+1,']/@*')) as row_values

        from    hivexml
                lateral view  posexplode (xpath(str,'/tag/row/@Id')) pe
        ) x
;

+----+------------+---------+---------------+------------+
| id |  tagname   |  count  | excerptpostid | wikipostid |
+----+------------+---------+---------------+------------+
|  1 | .net       |  244006 |       3624959 |    3607476 |
|  2 | html       |  602809 |       3673183 |    3673182 |
|  3 | javascript | 1274350 |       3624960 |    3607052 |
|  4 | css        |  434937 |       3644670 |    3644669 |
|  5 | php        | 1009113 |       3624936 |    3607050 |
|  8 | c          |  236386 |       3624961 |    3607013 |
+----+------------+---------+---------------+------------+

【讨论】：

你是真正的 Hive 大师。甚至没有想到这样的事情可以通过 Hive 在单个查询中完成。为每个解决方案 +1

【解决方案3】：

拆分 + str_to_map

select  vals["Id"]              as Id
       ,vals["TagName"]         as TagName
       ,vals["Count"]           as Count    
       ,vals["ExcerptPostId"]   as ExcerptPostId
       ,vals["WikiPostId"]      as WikiPostId

from   (select  str_to_map(e.val,' ','=') as vals

        from    hivexml 
                lateral view  posexplode(split(translate(str,'"',''),'/?><row')) e

        where   e.pos <> 0
        ) x
;

+----+------------+---------+---------------+------------+
| id |  tagname   |  count  | excerptpostid | wikipostid |
+----+------------+---------+---------------+------------+
|  1 | .net       |  244006 |       3624959 |    3607476 |
|  2 | html       |  602809 |       3673183 |    3673182 |
|  3 | javascript | 1274350 |       3624960 |    3607052 |
|  4 | css        |  434937 |       3644670 |    3644669 |
|  5 | php        | 1009113 |       3624936 |    3607050 |
|  8 | c          |  236386 |       3624961 |    3607013 |
+----+------------+---------+---------------+------------+

【讨论】：

【解决方案4】：

如果数据是 XML 文档

XML serde 可以从https://github.com/01org/graphbuilder/blob/master/src/com/intel/hadoop/graphbuilder/preprocess/inputformat/XMLInputFormat.java下载

add jar /home/cloudera/hivexmlserde-1.0.5.3.jar;

create external table hivexml_ext
(
    Id              string
   ,TagName         string
   ,Count           string
   ,ExcerptPostId   string
   ,WikiPostId      string
)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties 
(
    "column.xpath.Id"            = "/row/@Id"
   ,"column.xpath.TagName"       = "/row/@TagName"
   ,"column.xpath.Count"         = "/row/@Count    "
   ,"column.xpath.ExcerptPostId" = "/row/@ExcerptPostId"
   ,"column.xpath.WikiPostId"    = "/row/@WikiPostId"
)
stored as
inputformat     'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat    'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location        '/user/hive/warehouse/hivexml'
tblproperties 
(
    "xmlinput.start" = "<row"
   ,"xmlinput.end"   = "/>"
)
;

select * from hivexml_ext as x
;

+------+------------+---------+-----------------+--------------+
| x.id | x.tagname  | x.count | x.excerptpostid | x.wikipostid |
+------+------------+---------+-----------------+--------------+
|    1 | .net       |  244006 |         3624959 |      3607476 |
|    2 | html       |  602809 |         3673183 |      3673182 |
|    3 | javascript | 1274350 |         3624960 |      3607052 |
|    4 | css        |  434937 |         3644670 |      3644669 |
|    5 | php        | 1009113 |         3624936 |      3607050 |
|    8 | c          |  236386 |         3624961 |      3607013 |
+------+------------+---------+-----------------+--------------+

【讨论】：

我的电脑里没有java..如果我照原样复制上面的代码会在powershell中运行吗？我担心添加 jar 文件的顶行。
下载 jar 后，应在 hive 中执行 add jar 命令。将罐子放在您喜欢的任何地方并相应地更改路径。
jar 文件应该在我的本地机器还是天蓝色？我已经把它放在我的本地机器上，但它的显示文件不存在。
我不太喜欢这些东西。尝试谷歌或打开一个新的 SO 帖子。

【解决方案5】：

问题在于，XPath 函数将在不加入独立数组的情况下为每个请求返回所有匹配结果。如果它适合您，您可以使用 Pig，因为批处理模型可以将过程简化为各个步骤：

REGISTER /usr/hdp/current/pig-client/lib/piggybank.jar DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();

A = LOAD '/tmp/text.xml' using org.apache.pig.piggybank.storage.XMLLoader('tag') as (x:chararray);

B = FOREACH A GENERATE XPathAll(x, 'row/@Id',false,false).$0,
    XPathAll(x, 'row/@TagName',false,false).$0,
    XPathAll(x, 'row/@Count',false,false).$0,
    XPathAll(x, 'row/@ExcerptPostId',false,false).$0,
    XPathAll(x, 'row/@WikiPostId',false,false).$0;

DUMP B;

(1,.net,244006,3624959,3607476)
(2,html,602809,3673183,3673182)
(3,javascript,1274350,3624960,3607052)
(4,css,434937,3644670,3644669)
(5,php,1009113,3624936,3607050)
(8,c,236386,3624961,3607013)

STORE B INTO "YourTable" USING org.apache.hive.hcatalog.pig.HCatStorer();

【讨论】：