【问题标题】:BigQuery join and UDFBigQuery 联接和 UDF
【发布时间】:2017-05-17 21:02:05
【问题描述】:

如何在同时使用 UDF 的 select 语句中连接两个表?我将 SQL 查询和 UDF 函数存储在我通过 bq 命令行调用的两个文件中。但是,当我运行它时,我收到以下错误:

查询操作中的 BigQuery 错误:错误处理作业 '[projectID]:bqjob_[error_number]': 无法解析表名:缺少数据集名称。

请注意,我通过 gcloud auth 方法登录了正确的项目。 我的 SQL 语句:

SELECT
  substr(date,1,6) as date,
  device,
  channelGroup,
  COUNT(DISTINCT CONCAT(fullVisitorId,cast(visitId as string))) AS sessions,
  COUNT(DISTINCT fullVisitorId) AS users,
FROM
  defaultChannelGroup(
    SELECT
      a.date,
      a.device.deviceCategory AS device,
      b.hits.page.pagePath AS page,
      a.fullVisitorId,
      a.visitId,
      a.trafficSource.source AS trafficSourceSource,
      a.trafficSource.medium AS trafficSourceMedium,
      a.trafficSource.campaign AS trafficSourceCampaign
    FROM FLATTEN(
      SELECT date,device.deviceCategory,trafficSource.source,trafficSource.medium,trafficSource.campaign,fullVisitorId,visitID
      FROM
        TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
    ,hits) as a
    LEFT JOIN FLATTEN(
      SELECT hits.page.pagePath,hits.time,visitID,fullVisitorId
      FROM
        TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
      WHERE
        hits.time = 0
        and trafficSource.medium = 'organic'
    ,hits) as b
    ON a.fullVisitorId = b.fullVisitorId AND a.visitID = b.visitID
  )
GROUP BY
  date,
  device,
  channelGroup
ORDER BY sessions DESC

当然,我用正确的名称替换了我的 datasetname; 和一些 UDF(与另一个查询一起使用):

function defaultChannelGroup(row, emit)
{
  function output(channelGroup) {
    emit({channelGroup:channelGroup,
      fullVisitorId: row.fullVisitorId, 
      visitId: row.visitId,
      device: row.device,
      date: row.date
      });
  }
  computeDefaultChannelGroup(row, output);
}

bigquery.defineFunction(
  'defaultChannelGroup',
  ['date', 'device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId'],
  //['device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId'],
  [{'name': 'channelGroup', 'type': 'string'},
  {'name': 'fullVisitorId', 'type': 'string'},
  {'name': 'visitId', 'type': 'integer'},
  {'name': 'device', 'type': 'string'},
  {'name': 'date', 'type': 'string'}
],
  defaultChannelGroup
);

【问题讨论】:

  • 我无法重现。 (如果您也留下完整的作业 ID,一些 BigQuery 团队成员也可以查看日志)
  • 谢谢@FelipeHoffa。我今天早上通过以下命令重新运行它:bq query --udf_resource=Desktop/bq.js "$(cat Desktop/bq-sd-mkt-channels.sql)" 并收到相同的错误消息,即:bqjob_r324a276c6f5130bc_000001596949a59f_1': Table name cannot be resolved: dataset name is missing

标签: sql join google-bigquery udf


【解决方案1】:

FLATTEN 函数中的选择语句需要放在括号中。

在 shell 中运行 bq 命令: bq query --udf_resource=udf.js "$(cat query.sql)"

query.sql 包含以下脚本:

SELECT
  substr(date,1,6) as date,
  device,
  channelGroup,
  COUNT(DISTINCT CONCAT(fullVisitorId,cast(visitId as string))) AS sessions,
  COUNT(DISTINCT fullVisitorId) AS users,
  COUNT(DISTINCT transactionId) as orders,
  CAST(SUM(transactionRevenue)/1000000 AS INTEGER) as sales
FROM
  defaultChannelGroup(
    SELECT
      a.date as date,
      a.device.deviceCategory AS device,
      b.hits.page.pagePath AS page,
      a.fullVisitorId as fullVisitorId,
      a.visitId as visitId,
      a.trafficSource.source AS trafficSourceSource,
      a.trafficSource.medium AS trafficSourceMedium,
      a.trafficSource.campaign AS trafficSourceCampaign,
      a.hits.transaction.transactionRevenue as transactionRevenue,
      a.hits.transaction.transactionID as transactionId
    FROM FLATTEN((
      SELECT  date,device.deviceCategory,trafficSource.source,trafficSource.medium,trafficSource.campaign,fullVisitorId,visitID,
              hits.transaction.transactionID, hits.transaction.transactionRevenue
      FROM
        TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
    ),hits) as a
    LEFT JOIN FLATTEN((
      SELECT hits.page.pagePath,hits.time,trafficSource.medium,visitID,fullVisitorId
      FROM
        TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
      WHERE
        hits.time = 0
        and trafficSource.medium = 'organic'
    ),hits) as b
    ON a.fullVisitorId = b.fullVisitorId AND a.visitID = b.visitID
  )
GROUP BY
  date,
  device,
  channelGroup
ORDER BY sessions DESC

udf.js 包含以下函数(不包括“computeDefaultChannelGroup”函数):

function defaultChannelGroup(row, emit)
{
  function output(channelGroup) {
    emit({channelGroup:channelGroup,
      date: row.date,
      fullVisitorId: row.fullVisitorId, 
      visitId: row.visitId,
      device: row.device,
      transactionId: row.transactionId,
      transactionRevenue: row.transactionRevenue,
      });
  }
  computeDefaultChannelGroup(row, output);
}

bigquery.defineFunction(
  'defaultChannelGroup',
  ['date', 'device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId', 'transactionId', 'transactionRevenue'],
  [{'name': 'channelGroup', 'type': 'string'},
  {'name': 'date', 'type': 'string'},
  {'name': 'fullVisitorId', 'type': 'string'},
  {'name': 'visitId', 'type': 'integer'},
  {'name': 'device', 'type': 'string'},
  {'name': 'transactionId', 'type': 'string'},
  {'name': 'transactionRevenue', 'type': 'integer'}
],
  defaultChannelGroup
);

运行无误并匹配 Google Analytics 中的数据。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-07-19
    • 1970-01-01
    • 2016-09-13
    • 2016-07-16
    • 2020-11-16
    • 1970-01-01
    相关资源
    最近更新 更多