WAL相当于oracle中的redo log,mysql中的redolog,9.6及之前名为xlog,10+当前在pg_wal文件夹中,wal段默认大小为16M,在initdb时可以指定大小,后续原则上不可以修改。可以通过pg_waldump查看二进制日志的内容。wal的结构解析(https://www.cnblogs.com/abclife/p/13708947.html,虽然不完全正确,比如LSN的物理文件ID解析就不正确)。wal的物理结构如下:
WAL归档的执行过程可见https://wiki.moritetu.xyz/?PostgreSQL/%E8%A7%A3%E6%9E%90/WAL%E3%82%A2%E3%83%BC%E3%82%AB%E3%82%A4%E3%83%96。
clog(全称Commit Log,PostgreSQL transaction-commit-log manager,主要在clog.c中实现)里面记录了事务的执行状态,每次事务提交和回滚的时候,都需要更新该状态(调用CommitTransactionCommand(void)),PostgreSQL服务器访问该文件确定事务的状态,保存在pg_xact目录中,每个文件大小为256KB,每个事务2位(bit),故1个文件可以包含131072个事务。对于第一次修改的数据行来说,因为事务状态存储在clog中,所以修改后第一次判断行的可见性需要通过访问clog来确定,而访问clog是一个非常耗费性能的过程,故关于clog访问优化,有一个很长的discussion。
事务在clog中的状态包括:
/* * Possible transaction statuses --- note that all-zeroes is the initial * state. * * A "subcommitted" transaction is a committed subtransaction whose parent * hasn't committed or aborted yet. */ typedef int XidStatus; #define TRANSACTION_STATUS_IN_PROGRESS 0x00 #define TRANSACTION_STATUS_COMMITTED 0x01 #define TRANSACTION_STATUS_ABORTED 0x02 #define TRANSACTION_STATUS_SUB_COMMITTED 0x03
在clog.c中。
因为pg的MVCC在文件中实现undo,即使事务回滚了,新创建的行也不会被删除,但是因为clog中记录了事务的执行状态,所以其他事务在xmin和xmax判断时候可以过滤掉或不过滤掉这些记录(主要是xmax=0的情况,因为此时可能提交了、也可能稍微提交)。
pg_xact(9.6及之前名为pg_clog,虽然代码中还是clog.c)
[postgres@hs-10-20-30-194 pg_xact]$ ll total 13208 -rw------- 1 postgres postgres 262144 May 24 17:26 0000 -rw------- 1 postgres postgres 262144 May 24 17:26 0001 -rw------- 1 postgres postgres 262144 May 24 17:27 0002 -rw------- 1 postgres postgres 262144 May 24 17:27 0003 -rw------- 1 postgres postgres 262144 May 24 17:27 0004 -rw------- 1 postgres postgres 262144 May 24 17:28 0005 -rw------- 1 postgres postgres 262144 May 24 17:28 0006 -rw------- 1 postgres postgres 262144 May 24 17:28 0007
clog和wal的交互:这得先理解事务的完整过程。
在AM层,调用xlog相关接口将WAL条目写入WAL文件,PortalDrop清理执行完成后,主入口exec_simple_query()->finish_xact_command()会依次调用CommitTransactionCommand()->CommitTransaction()->RecordTransactionCommit()->XactLogCommitRecord()调用XLogInsert()将commit wal条目写入WAL文件,然后RecordTransactionCommit()调用XLogFlush刷新commit WAL日志,然后调用TransactionIdCommitTree()更新clog。TransactionIdCommitTree->TransactionIdSetTreeStatus->TransactionIdSetPageStatus->TransactionIdSetPageStatusInternal,然后根据pageno找到slotno(使用slru简单最近最少访问算法管理),调用TransactionIdSetStatusBit(其根据xid找到偏移量,然后进行位运算更新事务状态)
* Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF, * CLOG page numbering also wraps around at 0xFFFFFFFF/CLOG_XACTS_PER_PAGE, * and CLOG segment numbering at * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT. We need take no * explicit notice of that fact in this module, except when comparing segment * and page numbers in TruncateCLOG (see CLOGPagePrecedes). */ /* We need two bits per xact, so four xacts fit in a byte */ #define CLOG_BITS_PER_XACT 2 #define CLOG_XACTS_PER_BYTE 4 每字节包含的事务数 #define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE) 每BLOCK包含的事务数,32768 #define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1) 0x11 #define TransactionIdToPage(xid) ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE) 根据事务ID找到页,事务ID 整除 32768 #define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) 页内事务相对顺序号偏移量 事务ID 取余 32768 #define TransactionIdToByte(xid) (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE) 页内字节偏移量 #define TransactionIdToBIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE) 字节内事务相对顺序号偏移量 事务ID 取余 4
/* We store the latest async LSN for each group of transactions *
#define CLOG_XACTS_PER_LSN_GROUP 32 /* keep this a power of 2 */
#define CLOG_LSNS_PER_PAGE (CLOG_XACTS_PER_PAGE / CLOG_XACTS_PER_LSN_GROUP)
#define GetLSNIndex(slotno, xid) ((slotno) * CLOG_LSNS_PER_PAGE + \
((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) / CLOG_XACTS_PER_LSN_GROUP)
因为更新clog是内存中进行的,不会刷盘,那问题来了。1、重启恢复的时候哪里会用到?2、判断元祖可见性的时候哪里会调用到?
所有的元祖在被fetch时,都会检查xmin、xmax是否已经提交,如果infomask_2上没有标记的话,就回去clog缓存区查询,如下:
TransactionIdGetStatus clog.c:654 TransactionLogFetch transam.c:79 TransactionIdDidCommit transam.c:129 HeapTupleSatisfiesMVCC heapam_visibility.c:1058 HeapTupleSatisfiesVisibility heapam_visibility.c:1695 heapgetpage heapam.c:476 heapgettup_pagemode heapam.c:917 heap_getnextslot heapam.c:1390 table_scan_getnextslot tableam.h:906 SeqNext nodeSeqscan.c:80 ExecScanFetch execScan.c:133 ExecScan execScan.c:182 ExecSeqScan nodeSeqscan.c:112 ExecProcNodeFirst execProcnode.c:454 ExecProcNode executor.h:248 ExecutePlan execMain.c:1632 standard_ExecutorRun execMain.c:350 CitusExecutorRun multi_executor.c:214 pgss_ExecutorRun pg_stat_statements.c:1043 pgsk_ExecutorRun pg_stat_kcache.c:1034 pgqs_ExecutorRun pg_qualstats.c:661 explain_ExecutorRun auto_explain.c:334 ExecutorRun execMain.c:292 PortalRunSelect pquery.c:912 PortalRun pquery.c:756 exec_simple_query postgres.c:1325 PostgresMain postgres.c:4415 BackendRun postmaster.c:4527 BackendStartup postmaster.c:4211 ServerLoop postmaster.c:1740 PostmasterMain postmaster.c:1413 main main.c:231 __libc_start_main 0x00007f3353efd555 _start 0x0000000000483aa9
/* * information stored in t_infomask: */ #define HEAP_HASNULL 0x0001 /* has null attribute(s) */ #define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */ #define HEAP_HASEXTERNAL 0x0004 /* has external stored attribute(s) */ #define HEAP_HASOID_OLD 0x0008 /* has an object-id field */ #define HEAP_XMAX_KEYSHR_LOCK 0x0010 /* xmax is a key-shared locker */ #define HEAP_COMBOCID 0x0020 /* t_cid is a combo cid */ #define HEAP_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */ #define HEAP_XMAX_LOCK_ONLY 0x0080 /* xmax, if valid, is only a locker */ /* xmax is a shared locker */ #define HEAP_XMAX_SHR_LOCK (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK) #define HEAP_LOCK_MASK (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \ HEAP_XMAX_KEYSHR_LOCK) #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */ #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */ #define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID) #define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */ #define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */ #define HEAP_XMAX_IS_MULTI 0x1000 /* t_xmax is a MultiXactId */ #define HEAP_UPDATED 0x2000 /* this is UPDATEd version of row */ #define HEAP_MOVED_OFF 0x4000 /* moved to another place by pre-9.0 * VACUUM FULL; kept for binary * upgrade support */ #define HEAP_MOVED_IN 0x8000 /* moved from another place by pre-9.0 * VACUUM FULL; kept for binary * upgrade support */ #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN) #define HEAP_XACT_MASK 0xFFF0 /* visibility-related bits */
每当一个新的clog页面(和pg中其他页面一样,也是BLCKSZ宏定义,默认8KB)被初始化为0的时候,clog.c就会生成一条wal记录。xact.c中针对提交和回滚操作的记录(recording)也会写clog。对于同步提交:在clog记录commit前,XLOG会确保被刷新,所以WAL可以自动被保证。对于异步提交:必须跟踪最新的LSN影响的每个CLOG页,这样才能刷新响应的xlog。clog的细节描述具体可以参见:https://www.interdb.jp/pg/pgsql05.html。clog的清理参见:https://www.interdb.jp/pg/pgsql06.html#_6.4.,由vacuum freeze负责清理。
部分结构化描述可以参见https://blog.csdn.net/weixin_39540651/article/details/115677138。
其他目录说明:
pg_logical
pg_commit_ts
pg_multixact
pg_subtrans
pg_snapshots
pg_replslot
pg_dynshmem
10.0+目录说明(到14为止未在发生调整)
https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND