【问题标题】:AWS Glue Job writes Null to RedshiftAWS Glue 作业将 Null 写入 Redshift
【发布时间】:2021-12-06 12:56:07
【问题描述】:

我在 s3 存储桶文件夹中有多个 JSON 文件,每个文件都具有与以下示例 JSON 对象数组/列表相同的模式。

file1

[{"coinRank":1,"coinId":"bitcoin","coinName":"Bitcoin","coinSymbol":"BTC","coinLoc":"bitcoin","coinPrice":53501.08,"coin1hrChange":-0.6,"coin24hrChange":-6.0,"coin7dChange":-9.2,"coin24hrVol":38266934579,"coinMarketCap":1012650219321,"fetchTime":"2021-12-03 23:55:42.654921","rankDate":"2021-12-03","rate":409.98,"coinPriceNaira":21934372.7784000002},{"coinRank":2,"coinId":"ethereum","coinName":"Ethereum","coinSymbol":"ETH","coinLoc":"ethereum","coinPrice":4225.28,"coin1hrChange":-0.3,"coin24hrChange":-7.2,"coin7dChange":-6.4,"coin24hrVol":27395766224,"coinMarketCap":502376237337,"fetchTime":"2021-12-03 23:55:42.655698","rankDate":"2021-12-03","rate":409.98,"coinPriceNaira":1732280.2944},{"coinRank":3,"coinId":"binancecoin","coinName":"Binance Coin","coinSymbol":"BNB","coinLoc":"binance-coin","coinPrice":593.95,"coin1hrChange":-0.7,"coin24hrChange":-4.9,"coin7dChange":-6.9,"coin24hrVol":2379210538,"coinMarketCap":100022794436,"fetchTime":"2021-12-03 23:55:42.656393","rankDate":"2021-12-03","rate":409.98,"coinPriceNaira":243507.621}]

file2

[{"coinRank":1,"coinId":"bitcoin","coinName":"Bitcoin","coinSymbol":"BTC","coinLoc":"bitcoin","coinPrice":52936.1,"coin1hrChange":-1.5,"coin24hrChange":-6.5,"coin7dChange":-1.7,"coin24hrVol":38241025550,"coinMarketCap":998999157967,"fetchTime":"2021-12-04 02:33:23.182164","rankDate":"2021-12-04","rate":409.98,"coinPriceNaira":21702742.2780000009},{"coinRank":2,"coinId":"ethereum","coinName":"Ethereum","coinSymbol":"ETH","coinLoc":"ethereum","coinPrice":4159.85,"coin1hrChange":-1.4,"coin24hrChange":-8.1,"coin7dChange":2.8,"coin24hrVol":28661534477,"coinMarketCap":493429600914,"fetchTime":"2021-12-04 02:33:23.182785","rankDate":"2021-12-04","rate":409.98,"coinPriceNaira":1705455.3030000003},{"coinRank":3,"coinId":"binancecoin","coinName":"Binance Coin","coinSymbol":"BNB","coinLoc":"binance-coin","coinPrice":582.32,"coin1hrChange":-1.9,"coin24hrChange":-5.4,"coin7dChange":-0.6,"coin24hrVol":1059743631,"coinMarketCap":97824378011,"fetchTime":"2021-12-04 02:33:23.183415","rankDate":"2021-12-04","rate":409.98,"coinPriceNaira":238739.5536}]

file3

[{"coinRank":1,"coinId":"bitcoin","coinName":"Bitcoin","coinSymbol":"BTC","coinLoc":"bitcoin","coinPrice":49375.27,"coin1hrChange":-0.7,"coin24hrChange":4.3,"coin7dChange":-9.5,"coin24hrVol":35860857801.0,"coinMarketCap":932932346783,"fetchTime":"2021-12-05 14:34:49.339803","rankDate":"2021-12-05","rate":410.764648,"coinPriceNaira":20281615.4014549591},{"coinRank":2,"coinId":"ethereum","coinName":"Ethereum","coinSymbol":"ETH","coinLoc":"ethereum","coinPrice":4218.99,"coin1hrChange":-0.7,"coin24hrChange":7.1,"coin7dChange":3.3,"coin24hrVol":27778808883.0,"coinMarketCap":500688046117,"fetchTime":"2021-12-05 14:34:49.340495","rankDate":"2021-12-05","rate":410.764648,"coinPriceNaira":1733011.9422655201},{"coinRank":3,"coinId":"binancecoin","coinName":"Binance Coin","coinSymbol":"BNB","coinLoc":"binance-coin","coinPrice":574.23,"coin1hrChange":-0.5,"coin24hrChange":5.2,"coin7dChange":-4.0,"coin24hrVol":2265817636.0,"coinMarketCap":96576091895,"fetchTime":"2021-12-05 14:34:49.341177","rankDate":"2021-12-05","rate":410.764648,"coinPriceNaira":235873.38382104}]

使用 AWS Glue 爬虫和分类器分离 JSON 对象$[*] 我已经拆分了记录,并且我可以确认数据目录中的记录数与文件中的记录数匹配。 但是,当我将数据推送到红移时,我有一些列显示为空。如有必要,我还可以分享我的胶水脚本。

【问题讨论】:

    标签: amazon-redshift aws-glue-data-catalog


    【解决方案1】:

    我发现了数据集的问题所在,DataFrame 在列上推断出不同的数据类型 int64 和 float64,当 Glue 在 Redshift 中创建表时,它会将数字列创建为双精度 (float64) 因此,整数记录在 Redshift 上未正确转换。

    1. 我使用 .astype() 函数在 Pandas DataFrame 中手动指定了列类型
    2. 我删除了 redshift 中的表,同时删除了数据目录数据库中的表
    3. 重新抓取数据库并重新运行作业。

    现在每个数据点都在 redshift 上显示得很好。

    【讨论】:

      猜你喜欢
      • 2021-05-01
      • 2018-01-30
      • 2019-02-18
      • 2018-09-28
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-11-19
      • 2018-08-05
      相关资源
      最近更新 更多