【问题标题】:Using "Attribute to Attribute" map in Hive XML SerDe在 Hive XML SerDe 中使用“属性到属性”映射
【发布时间】:2017-05-24 20:06:27
【问题描述】:

我有一个如下所示的 XML 文档:

<root>
 <unwanted>
  ...
 </unwanted>
 <wanted version="A">
  <unwanted2 type='1'>
   ...
  </unwanted2>
  <unwanted2 type='2'>
   ...
  </unwanted2>
  <unwanted2 type='3'>
   ...
  </unwanted2>
  <wanted2>
   <detail>
    <row date="Jan-17" price="100" inventory="50">
    <row date="Feb-17" price="101" inventory="40">
    <row date="Mar-17" price="102" inventory="30">
   </detail>
  </wanted2>
 </wanted>
 <wanted version="B">
  <unwanted2 type='1'>
   ...
  </unwanted2>
  <unwanted2 type='2'>
   ...
  </unwanted2>
  <unwanted2 type='3'>
   ...
  </unwanted2>
  <wanted2>
   <detail>
    <row date="Jan-17" price="200" inventory="60">
    <row date="Feb-17" price="201" inventory="70">
    <row date="Mar-17" price="202" inventory="80">
   </detail>
  </wanted2>
 </wanted>
</root>

我想将文件导入Hive 表,最好是这种格式:

Version | Date   | Price | Inventory
A         Jan-17   100     50
A         Feb-17   101     40
A         Mar-17   102     30
B         Jan-17   200     60
B         Feb-17   201     70
B         Mar-17   202     80

但我现在愿意将其作为日期和价格的地图导入:

version | spot_date
A         {Date: Jan-17, Price: 100, Inventory: 50}
A         {Date: Feb-17, ...}
A         {Date: Mar-17, ...}
B         {Date: Jan-17, ...}
B         {Date: Feb-17, ...}
B         {Date: Mar-17, ...}

我正在尝试使用XMLSerDe for Hive,并使用“attribute to attribute”功能。

我的查询如下所示:

CREATE EXTERNAL TABLE ppa_test(
    version        STRING, 
    spot_date      MAP<STRING,STRING>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
    "column.xpath.version"="/wanted/@version",
    "column.xpath.spot_date"="/wanted/wanted2/detail/row",
    "xml.map.specification.row"="date->@date"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<wanted ",
"xmlinput.end"="</wanted>"
);

但是当我加载我的数据时,我得到:

version | spot_date
A         {"row":"Mar-17"}
B         {"row":"Mar-17"}

如果我改为将 xml.map.spec 路径更改为:

"xml.map.specification.row"="@date->@price"

我可以单独读取XML的每一行,但它记录在同一个Hive表行中,而且我更喜欢使用属性名称:

Version | spot_date
A         {"Mar-17":"102", "Feb-17":"101", "Jan-17":"100"}
B         {"Mar-17":"202", "Feb-17":"201", "Jan-17":"200"}
  1. 如何将每个 XML row 节点记录到自己的 Hive 记录中
  2. 如何使用属性名称(或自定义字符串)作为键?

编辑

所以从spot_date MAP&lt;STRING,STRING&gt; 更改为...

CREATE EXTERNAL TABLE ppa_test(
    scenario    STRING, 
    spot_date   array<struct<
        date:      string, 
        price:     string, 
        inventory: string, 
    >>
)...

给我一​​个对象数组

Version | spot_date
A         [{date: Jan-17, price: 100, inventory: 50},
           {date: Feb-17, price: 101, inventory: 40},
           {date: Mar-17, price: 102, inventory: 30}]
B         [{date: Jan-17, ... ]

从上面完成了#2,但仍然不确定#1

【问题讨论】:

    标签: xml hadoop xpath hive hiveql


    【解决方案1】:

    您可以分解为#2 创建的结构数组以获得#1。

    CREATE EXTERNAL TABLE ppa_test(
        scenario    STRING, 
        spot_date   ARRAY<STRUCT<spotdates: struct<
            date:      string, 
            price:     string, 
            inventory: string, 
        >>>
    )
    

    您可以为此使用横向视图

    DROP TABLE IF EXISTS ppa_test_exploded;
    CREATE TABLE ppa_test_exploded as
     SELECT scenario,
      SD.spotdates.date as date,
      SD.spotdates.price as price,
      SD.spotdates.inventory as inventory
      FROM ppa_test
      LATERAL VIEW EXPLODE(spot_date) exploded as SD;
    

    希望这会有所帮助。

    【讨论】:

      猜你喜欢
      • 2018-09-15
      • 1970-01-01
      • 2023-04-07
      • 1970-01-01
      • 2015-03-06
      • 2017-11-02
      • 1970-01-01
      • 1970-01-01
      • 2017-09-25
      相关资源
      最近更新 更多