Linux传统Huge Pages与Transparent Huge Pages再次学习总结

Linux下的大页分为两种类型：标准大页（Huge Pages）和透明大页（Transparent Huge Pages）。Huge Pages有时候也翻译成大页/标准大页/传统大页，它们都是Huge Pages的不同中文翻译名而已，顺带提一下这个，免得有人被这些名词给混淆、误导了。Huge Pages是从Linux Kernel 2.6后被引入的。目的是使用更大的内存页面（memory page size）以适应越来越大的系统内存，让操作系统可以支持现代硬件架构的大页面容量功能。透明大页（Transparent Huge Pages）缩写为THP，这个是RHEL 6（其它分支版本SUSE Linux Enterprise Server 11, and Oracle Linux 6 with earlier releases of Oracle Linux Unbreakable Enterprise Kernel 2 (UEK2)）开始引入的一个功能。具体可以参考官方文档。这两者有啥区别呢？这两者的区别在于大页的分配机制，标准大页管理是预分配的方式，而透明大页管理则是动态分配的方式。相信有不少人将Huge Page和Transparent Huge Pages混为一谈。目前透明大页与传统HugePages联用会出现一些问题，导致性能问题和系统重启。Oracle 建议禁用透明大页（Transparent Huge Pages）。在 Oracle Linux 6.5 版中，已删除透明 HugePages。

标准大页（HuagePage）英文介绍

HugePages is a feature integrated into the Linux kernel with release 2.6. It is a method to have larger pages where it is useful for working with very large memory. It can be useful for both 32-bit and 64-bit configurations. HugePage sizes vary from 2MB to 256MB, depending on the kernel version and the hardware architecture. For Oracle Databases, using HugePages reduces the operating system maintenance of page states, and increases TLB (Translation Lookaside Buffer) hit ratio.

RHEL的官方文档对传统大页（Huge Pages）和透明大页（Transparent Huge Pages）这两者的描述如下(https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html)

Huge pages can be difficult to manage manually, and often require significant changes to code in order to be used effectively. As such, Red Hat Enterprise Linux 6 also implemented the use of transparent huge pages(THP). THP is an abstraction layer that automates most aspects of creating, managing, and using huge pages.

THP hides much of the complexity in using huge pages from system administrators and developers. As the goal of THP is improving performance, its developers (both from the community and Red Hat) have tested and optimized THP across a wide range of systems, configurations, applications, and workloads. This allows the default settings of THP to improve the performance of most system configurations. However, THP is not recommended for database workloads.

传统大页很难手动管理, 而且通常需要对代码进行重大更改才能有效地使用。因此, 红帽企业 Linux 6 实现引入了透明大页面 (THP)。THP 是一个抽象层, 可以自动创建、管理和使用传统大页的大多数方面。

THP为系统管理员和开发人员减少了很多使用传统大页的复杂性, 因为THP的目标是改进性能, 因此其它开发人员 (来自社区和红帽) 已在各种系统、配置、应用程序和负载中对 THP 进行了测试和优化。这样可让 THP 的默认设置改进大多数系统配置性能。但是, 不建议对数据库工作负载使用 THP。

注：THP 目前只能映射异步内存区域，比如堆和栈空间

我们知道，x86架构使用的是虚拟内存架构，其允许寻址范围超过硬件中的可用物理内存。这通过允许每个进程拥有自己可寻址的内存来实现。该进程认为此内存是专供自己使用的。这称为进程的虚拟内存。实际上，此内存可以是实际驻留于RAM 芯片上的物理内存，也可以是存储在物理磁盘上被称作交换区或分页区的专用区域中。进程不知道虚拟内存是存储在RAM 中还是磁盘上；内存由操作系统管理。如果所需内存超过可用物理内存，操作系统会将一些内存移出到分页区。这种活动效率极低，是导致性能问题的常见原因。由于磁盘的存取速度远低于RAM，“分页”的进程会遇到显著的性能问题。

另外，随着硬件的飞速发展，服务器的内存越来越大，系统中使用的内存越多，管理该内存所需的资源也就越多。对于Linux 操作系统，通过 Linux kswapd 进程和页表（Page Table）内存结构（针对系统中存在的每个进程包含一条记录）实现内存管理。每条记录包含进程使用的每页虚拟内存及其物理地址（RAM 或磁盘）。通过使用处理器的TLB（ Translation Lookaside Buffer CPU中一小块缓存）为该进程提供帮助。操作系统使用页表条目管理系统中进程所用的内存。在 Linux 中，执行此管理的操作系统进程被称作kswapd，可在操作系统工具中找到。TLB 缓存将缓存页表条目来提高性能。典型的 TLB 缓存可保存 4 到 4096 个条目。对于数百万甚至数十亿个页表条目，这种缓存就不够用了。

当大量内存被用于ORACLE数据库或其他应用时，操作系统将消耗大量资源来管理虚拟地址到物理地址转换，其结果往往是一个非常大的页表结构（Page Table）。由于每条页表条目包含进程正在使用的所有内存页面的虚拟地址到物理地址的转换，因此对于非常大的系统全局区 (SGA)，每个进程的页表条目都可能很大。举个例子，我们的一个测试服务器，内存为64GB，SGA_TARGET为32G，如果没有使用传统大页，页表结构（PageTables）大小为1573080 kB，接近1.5G大小了。您可以看到，要管理的页面数量巨大。这将导致显著的性能开销。

# grep PageTables /proc/meminfo

PageTables: 1573080 kB

这些就是传统大页为什么会被引入的原因。引入它能解决什么问题呢？内存是由块管理，即众所周知的页面。我们知道，在Linux 64位系统里面,默认内存是以4K的页面（Page）来管理的。也就是说一个页面有 4096 字节。1MB 内存等于 256 个页面。2MB内存等于512个页面。管理这些内存的消耗就比较大。CPU 有内嵌的内存管理单元TLB，这些单元中包含这些页面列表，每个页面都使用页表条目。页表（Page Table）用来存放虚拟内存和物理内存页对应关系的内存结构。如果page size较小，那么相应的页表内存结构就会比较大。而Hugepages的默认值page size为2M，是4KB的500倍，所以可以大大减小Page Table的大小。通过启用 HugePages使用大页面，可以用一个页表条目代表一个大页面，而不是使用许多条目代表较小的页面，从而可以管理更多内存，减少操作系统对页面状态的维护并提高 TLB 缓存命中率。注意，Hugepagesize的大小默认为2M，这个也是可以调整的。区间范围为2MB to 256MB。

如果上面这段解释还不够清晰、彻底，那么看看下面这段摘抄的解释：

大多数操作系统采用了分段或分页的方式进行管理。分段是粗粒度的管理方式，而分页则是细粒度管理方式，分页方式可以避免内存空间的浪费。相应地，也就存在内存的物理地址与虚拟地址的概念。通过前面这两种方式，CPU必须把虚拟地址转换程物理内存地址才能真正访问内存。为了提高这个转换效率，CPU会缓存最近的虚拟内存地址和物理内存地址的映射关系，并保存在一个由CPU维护的映射表中。为了尽量提高内存的访问速度，需要在映射表中保存尽量多的映射关系。Linux的内存管理采取的是分页存取机制，为了保证物理内存能得到充分的利用，内核会按照LRU算法在适当的时候将物理内存中不经常使用的内存页自动交换到虚拟内存中，而将经常使用的信息保留到物理内存。通常情况下，Linux默认情况下每页是4K，这就意味着如果物理内存很大，则映射表的条目将会非常多，会影响CPU的检索效率。因为内存大小是固定的，为了减少映射表的条目，可采取的办法只有增加页的尺寸。因此Hugepage便因此而来。也就是打破传统的小页面的内存管理方式，使用大页面2M,4M等。如此一来映射条目则明显减少。TLB 缓存命中率将大大提高。

而ORACLE为什么要使用标准大页（Huge Pages）来提高性能？因为ORACLE数据库使用共享内存(SGA)来管理可以共享的一些资源;比如shared pool中存储了共享的SQL语句及执行计划,buffer pool中存储了数据块。对这些资源的访问,其实就是ORACLE使用OS的API来访问内存资源的过程。内存操作理应/通常意义上都是很快的,这时候Oracle数据库可以很正常的工作。但是有些情况下也会出现性能问题：

a)如果SGA内的某一部分被swap到硬盘上,那么再次访问它,就需要花非常多的时间。

b)如果OS本身的内存非常的大,那么管理/访问到我们需要的内存的过程就需要更长时间。

在这些情况下,我们往往会碰到诸如latch/mutex/library cache lock[pin]/row cache lock的问题.

Linux下HugePage可以解决由以上两种问题引发的性能波动。

我们知道，在Linux 64位系统里面,默认内存是以4K的页面（Page）来管理的,当系统有非常多的内存的时候,管理这些内存的消耗就比较大;而HugePage使用2M大小的页面来减小管理开销。HugePage管理的内存并不能被Swap，这就避免了Swap引发的数据库性能问题。所以，如果您的系统经常碰到因为swap引发的性能问题的系统毫无疑问需要启用HugePage。另外，OS内存非常大的系统也需要启用HugePage。但是具体多大就一定需要使用HugePage？这并没有定论，有些文档曾经提到12G以上就推荐开启，我们强烈建议您在测试环境进行了充分的测试之后，再决定是否在生产环境应用HugePage。

当然，任何事情都是有两面性的，HugePage也有些小缺点。第一个缺点是它需要额外配置，但是这完全是可以忽略的。另外，如果使用了HugePage，11g新特性 AMM（Automatic Memory Management）就不能使用了，但是ASMM（Automatic Shared Memory Management）仍然可以继续使用。

下面是一些相关名词以及Huge Pages的特征等等。大部分都是RHEL官网或Mos上相关英文资料以及对应的部分翻译：

· Page Table: A page table is the data structure of a virtual memory system in an operating system to store the mapping between virtual addresses and physical addresses. This means that on a virtual memory system, the memory is accessed by first accessing a page table and then accessing the actual memory location implicitly.

· TLB: A Translation Lookaside Buffer (TLB) is a buffer (or cache) in a CPU that contains parts of the page table. This is a fixed size buffer being used to do virtual address translation faster.

· hugetlb: This is an entry in the TLB that points to a HugePage (a large/big page larger than regular 4K and predefined in size). HugePages are implemented via hugetlb entries, i.e. we can say that a HugePage is handled by a "hugetlb page entry". The 'hugetlb" term is also (and mostly) used synonymously with a HugePage (See Note 261889.1). In this document the term "HugePage" is going to be used but keep in mind that mostly "hugetlb" refers to the same concept.

· hugetlbfs: This is a new in-memory filesystem like tmpfs and is presented by 2.6 kernel. Pages allocated on hugetlbfs type filesystem are allocated in HugePages.

HugePages in 2.4 Kernels

The HugePages feature is backported to some 2.4 kernels. Kernel versions 2.4.21-* has this feature (See Note 311504.1 for the distributions with 2.4.21 kernels) but it is implemented in a different way. The feature is completely available. The difference from 2.6 implementation is the organization within the source code and the kernel parameters that are used for configuring HugePages. See Parameters/Setup section below.

Advantages of HugePages Over Normal Sharing Or AMM (see below)

· Not swappable: HugePages are not swappable. Therefore there is no page-in/page-out mechanism overhead.HugePages are universally regarded as pinned.

不可交换：HugePages不可交换。因此没有页面换入/页面换出的机制开销.HugePages被普遍认为是固定在RAM中的。

· Relief of TLB pressure:

o Hugepge uses fewer pages to cover the physical address space, so the size of “book keeping” (mapping from the virtual to the physical address) decreases, so it requiring fewer entries in the TLB

o TLB entries will cover a larger part of the address space when use HugePages, there will be fewer TLB misses before the entire or most of the SGA is mapped in the SGA

o Fewer TLB entries for the SGA also means more for other parts of the address space

减轻TLB的压力

Hugepge使用较少的页面来覆盖物理地址空间，因此“保留书签”（从虚拟地址到物理地址的映射）的大小减少，因此在TLB中要求较少的条目

当使用HugePages时，TLB条目将涵盖更大的地址空间，对于SGA中的全部或大部分的内存映射，TLB未命中将会大大减少。

SGA需要更少的TLB条目意味着TLB中可以有更多的条目来保存其他地址空间。

TLB是直接缓存虚拟地址到物理地址的缓存表，用于提升性能，省去查找page table从而减少开销，但是如果出现的大量的TLB miss，必然会给系统的性能带来较大的负面影响，尤其对于连续的读操作。从第二篇文章中我们知道如果使用hugepages能大量减少PTE的数量，也就意味着访问同样多的内容需要的PTE会更少，而通常TLB的槽位是有限的，一般只有512个，所以更少的PTE也就意味着更高的TLB的命中率。

· Decreased page table overhead: Each page table entry can be as large as 64 bytes and if we are trying to handle 50GB of RAM, the pagetable will be approximately 800MB in size which is practically will not fit in 880MB size lowmem (in 2.4 kernels - the page table is not necessarily in lowmem in 2.6 kernels) considering the other uses of lowmem. When 95% of memory is accessed via 256MB hugepages, this can work with a page table of approximately 40MB in total. See also Document 361468.1.

减少页表的开销：每个页表条目可以高达64字节，如果我们50GB的RAM保存在页表（page table）当中，那么页表（page table）大小大约为800MB，实际上对于lowmem来说，考虑到lowmem的其他用途，880MB大小是不合适的（在2.4内核当中 ,page tabel在低于2.6的内核当中不是必须的）， lowmem中通过256MB的hugepages访问95％的内存时，可以使用大约40MB的页表。另见文档361468.1。

· Eliminated page table lookup overhead: Since the pages are not subject to replacement, page table lookups are not required.

减少页表查询的开销：PTE的数量减少，那么使得很多页表的查询就不需要了，并且更少的PTE使得页表的查询更快。如果TLB miss，则可能需要额外三次内存读取操作才能将线性地址翻译为物理地址。

· Faster overall memory performance: On virtual memory systems each memory operation is actually two abstract memory operations. Since there are fewer pages to work on, the possible bottleneck on page table access is clearly avoided.

提升内存访问的整体性能：使用虚拟内存，每一次对内存的访问实际上都是由两次抽象的内存操作组成。如果只要使用更少的页面，那么原本在页表访问的瓶颈也得以避免

HugePages Reservation

The HugePages reservation feature is fully implemented in 2.6.17 kernel, and thus EL5 (based on 2.6.18) has this feature. The alloc_huge_page() is improved for this. (See kernel source mm/hugetlb.c)

From /usr/share/doc/kernel-doc-2.6.18/Documentation/vm/hugetlbpage.txt:

HugePages_Rsvd is short for "reserved," and is the number of hugepages for which a commitment to allocate from the pool has been made, but no allocation has yet been made. It's vaguely analogous to overcommit.

This feature in the Linux kernel enables the Oracle Database to be able to allocate hugepages for the sublevels of the SGA on-demand. The same behaviour is expected for various Oracle Database versions that are certified on EL5.

HugePages and Oracle 11g Automatic Memory Management (AMM)

The AMM and HugePages are not compatible. One needs to disable AMM on 11g to be able to use HugePages. See Document 749851.1 for further information.

在Linux中， kswapd是负责内核页面交换管理的一个守护进程，它的职责是保证Linux内存管理操作的高效。当物理内存不够时，它就会变得非常aggressive，有些情况下能占用单核CPU的100%. kswapd 进程负责确保内存空间总是在被释放中，它监控内核中的pages_high和pages_low阀值。如果空闲内存的数值低于pages_low,则每次 kswapd 进程启动扫描并尝试释放32个free pages.并一直重复这个过程,直到空闲内存的数值高于 pages_high。kswapd 进程完成以下几个操作:

o 如果该页处于未修改状态,则将该页放置回空闲列表中.

o 如果该页处于已修改状态并可备份回文件系统,则将页内容写入到磁盘.

o 如果该页处于已修改状态但没有任何磁盘备份,则将页内容写入到swap device.

标准大页的配置查看

查看标准大页（Huage Pages)的页面大小：

[root@DB-Server ~]$ grep Hugepagesize /proc/meminfo

Hugepagesize:     2048 kB