加入 file_fdw 外部表和 postgres_fdw 外部表答案

【问题标题】：Join on file_fdw foreign table and postgres_fdw foreign table加入 file_fdw 外部表和 postgres_fdw 外部表
【发布时间】：2017-04-06 09:31:37
【问题描述】：

在 postgreSQL 9.5 中：

我有一个名为：sheetheight（由 file_fdw 创建）的外部表和一个名为：dzlog（由 postgres_fdw 创建）的外部表。

1- 对于加入外部表，我有以下查询：

SELECT * from dzlog INNER JOIN sheetheight ON dzlog.ullid = sheetheight.ullid;

EXPLAIN ANALYZE 为上述查询返回这个：

-------------------------------------------------
 Hash Join  (cost=111.66..13688.18 rows=20814 width=2180) (actual time=7670.872.
.8527.844 rows=2499 loops=1)
   Hash Cond: (sheetheight.ullid = dzlog.ullid)
   ->  Foreign Scan on sheetheight  (cost=0.00..12968.10 rows=106741 width=150)
(actual time=0.116..570.571 rows=223986 loops=1)
         Foreign File: D:\code\sources\sheetHeight_20151025_221244_0000000004987
6878996.csv
         Foreign File Size: 18786370
   ->  Hash  (cost=111.17..111.17 rows=39 width=2030) (actual time=7658.661..765
8.661 rows=34107 loops=1)
         Buckets: 2048 (originally 1024)  Batches: 32 (originally 1)  Memory Usa
ge: 4082kB
         ->  Foreign Scan on dzlog  (cost=100.00..111.17 rows=39 width=2030) (ac
tual time=47.162..7578.990 rows=34107 loops=1)
 Planning time: 8.755 ms
 Execution time: 8530.917 ms
(10 rows)

查询的输出有两列名为 ullid。

ullid,日期,颜色,sheetid,dz0,dz1,dz2,dz3,dz4,dz5,dz6,dz7,ullid,sheetid,pass,...

2- 为了从 python 应用程序直接访问 csv 文件和 sql 本地表，我有： 我通过不使用 FDW 而是使用 Pandas merge dataframe 从 python 应用程序直接访问 csv 文件和 postgreSQL 本地表来完成相同的查询。这个 join 是 raw join ，所以我先获取 csv 文件，然后使用 python 中的 pandas 库获取 sql 表，然后根据公共列合并两个数据框

import pandas as pd
def rawjoin(query,connection=psycopg2.connect("dbname='mydb' user='qfsa' host='localhost' password='123' port=5433")):
query=("SELECT * FROM dzlog;")
    firstTable= pd.read_csv('.\sources\sheetHeight_20151025_221244_000000000498768789.csv', delimiter=';', header=0)
    secondTable =pd.read_sql(query,connection)
    merged= pd.merge(firstTable, secondTable, on= 'ullid', how='inner')
    return merged

结果是带有一个 ullid 列的连接数据框。

对这种差异有什么想法吗？我做了其他类型的join，RAW访问和FDW访问的结果是一样的，其他查询如下：

 q7=("SELECT dzlog.color FROM dzlog,sheetheight WHERE dzlog.ullid = sheetheight.ullid;")
 q8=("SELECT sheetheight.defectfound FROM dzlog, sheetheight WHERE dzlog.ullid = sheetheight.ullid;")
 q9=("SELECT dzlog.color, sheetheight.defectfound FROM dzlog, sheetheight WHERE dzlog.ullid= sheetheight.ullid;")

【问题讨论】：

标签： postgresql join foreign-data-wrapper

【解决方案1】：

我不知道你的第二个例子是做什么的，所以很难说。使用哪个库？它是生成 SQL 还是在应用程序中执行连接（这几乎总是会造成性能损失）？如果这会导致一条 SQL 语句，那么该语句是什么？

第一个查询返回该列两次，因为您要求它从 all 涉及的表中返回 all 列，并且两个表都有该列，连接条件强制平等。

你可以像这样编写一个只输出该列一次的 SQL 语句：

SELECT *
FROM dzlog
   JOIN sheetheight
      USING (ullid);

这看起来很像您第二个示例中的代码，不是吗？

【讨论】：

是的，查询是问题所在。您的查询是正确的。