【Game Engine Architecture 3】
1、Computing performance—typically measured in millions of instructions per second (MIPS) or floating-point operations per second (FLOPS)—has been improving at a staggeringly rapid and consistent rate over the past four decades.
In the late 1970s, the Intel 8087 floating-point coprocessor could muster only about 50 kFLOPS(5 x104^ FLOPS), while at roughly the same time a Cray-1 supercomputer the size of a large refrigerator could opeate at a rate of roughly 160 MFLOPS (1.6 x10^8 FLOPS).
Today, the CPU in game consoles like the Playstation 4 or the Xbox One produces roughly 100 GFLOPS (10^11 FLOPS) of processing power, and the fastest supercomputer, currently China’s Sunway TaihuLight, has a LINPACK benchmark score of 93 PFLOPS (peta-FLOPS, or a staggering 9.3x10^16 floating-point operations per second).
This is an improvement of seven orders of magnitude for personal computers, and eight orders of magnitude for supercomputers.
Writing software that runs correctly and efficiently on parallel computing hardware is significantly more difficult than writing software for the serial computers of yesteryear. It requires a deep understanding of how the hardware actually works.
2、Implicit parallelism & Explicit parallelism
Implicit parallelism refers to the use of parallel hardware components within a CPU for the purpose of improving the performance of a single instruction stream. This is also known as instruction level parallelism (ILP), because the CPU executes instructions from a single stream (a single thread) but each instruction is executed with some degree of hardware parallelism.
隐式并行,只有一个 instruction stream,通过硬件的并行处理业加速该指令流的执行。
Explicit parallelism refers to the use of duplicated hardware components within a CPU, computer or computer system, for the purpose of running more than one instruction stream simultaneously. In other words, explicitly parallel hardware is designed to run concurrent software more efficiently than would be possible on a serial computing platform.
显式并行,同时运行多个 instruction tream.
3、Task Parallelism & Data Parallelism
4、Flynn’s Taxonomy
SISD,Instruction指的是ALU,D指的是 InstructionStream。单个ALU,多个 InsStream。
MIMD,Instruction指的是ALU,D指的是 InstructionStream。多个 ALU,多个 InsStream | 单个分时 ALU,多个 InsStream。
SIMD,Instructino指的是ALU,D指的是 ALU的位宽可以容纳多个数据。单个 ALU,单个InsStream,但ALU位宽增加。
MISD,Instructino指的是ALU,D指的是 InstructionStream。多个ALAU,单个 DataStream。
上述四种并行架构中,第三种SIMD的D比较特殊,它指的并不是多个Stream,而是ALU位宽增加。
• Single instruction, single data (SISD): A single instruction stream operating on a single data stream.
• Multiple instruction, multiple data (MIMD): Multiple instruction streams operating on multiple independent data streams.
• Single instruction, multiple data (SIMD): A single instruction stream operating on multiple data streams (i.e., performing the same sequence of operations on multiple independent streams of data simultaneously).
single “wide ALU” known as a vector processing unit (VPU) performs aciton.
SIMD是比较特殊的一类,并包含多个 instruction stream。
• Multiple instruction, single data (MISD): Multiple instruction streams all operating on a single data stream. (MISD is rarely used in games, but one common application is to provide fault tolerance via redundancy.)
5、Concurrency vs Parallelism
concurrent software doesn’t require parallel hardware, and parallel hardware isn’t only for running concurrent software.
As long as our system involves multiple readers and/or multiple writers of a shared data object, we have a concurrent system. Concurrency can be achieved via preemptive multitasking (on serial or parallel hardware) or via true parallelism (in which each thread executes on a distinct core)
6、4.2 Implicit Parallelism
1)pipeline architecture
Pipelining is a form of parallelism known as instruction-level parallelism (ILP).
Latency,Denoting latencies with the time variable T, we can write for a pipeline with N stages:
Throughput,The throughput or bandwidth of a pipeline is a measure of how many instructions it can process per unit time. The throughput of a pipeline is determined by the latency of its slowest stage:
2)supersscalar
3)very long intruction word(VLIW) architecture
8、Pipeline Dependencies
1)data dependencies
Ideally, we’d like to issue the mov, imul and add instructions on three consecutive clock cycles, to keep the pipeline as busy as possible. But in this case, the results of the imul instruction are used by the add instruction that follows it, so the CPU must wait until the imul has made it all the way through the pipeline before issuing the add. This is called stall.
data-dependencies stall 问题的解决方案
<1> instruction reordering,compiler 特性,programmer、compiler
<2> out-of-order execution,CPU 特性,compiler optimizations and out-of-order execution can cause bugs in a concurrent program.
2)control dependencies(branch dependencies)
branch dependencies 问题的解决方案
<1> speculative execution, also known as branch predication. 先计算其中一条分支,有50%算对的概率,从而提前完成。
<2> predication. 使用mask,将分支取消。
3)structual dependencies(resource dependencies)
11、Superscalar CPUs
a scalar processor. This means that it can start executing at most one instruction per clock cycle. Yes, multiple instructions are “in flight” at any given moment, but only one new instruction is sent down the pipeline every clock cycle.
duplicate most of the components on the chip, in such a way that two instructions could be launched each clock cycle. This is called a superscalar architecture.
The CPU still fetches instructions from a single instruction stream. the next two instructions are fetched and dispatched during each clock cycle.
Specular CPU 只有一个 instruction stream,即一条pipeline,通过扩展 pipeline上每一个stage的组件,以及每次从instruction stream中读取多个指令,来提升CPU性能。
In addition to data and branch dependencies, a superscalar CPU is prone to a third kind of dependency known as a resource dependency. 比如有两个 ALU,但只有一个FPU。
A two-way superscalar CPU requires roughly two times the silicon real-estate of a comparable scalar CPU design. In order to free up transistors, most superscalar CPUs are therefore reduced instruction set (RISC) processors.
12、Very Long Instruction Word
a CPU that has multiple compute elements (ALUs, FPUs, VPUs) on-chip.
13、Hyperthreading
what if the CPU could select its instructions from two separate instruction streams at once? This is the principle behind a hyperthreaded (HT) CPU core.
14、Multicore CPUs
When more than one core is included on a single CPU die, we call it a multicore CPU.
each core might employ a simple serial design, a pipelined design, a superscalar architecture, a VLIW design, or might be a hyperthreaded core. Figure 4.17 illustrates a simple example of a multicore CPU design.
下面是 ps4、xbox one x 的处理器架构。
15、Symmetric vs Asymmetric Multiprocessing
In symmetric multiprocessing (SMP), the available CPU cores in the machine (provided by any combination of hyperthreading, multicore CPUs or multiple CPUs on a single motherboard) are homogeneous in terms of design and ISA, and are treated equally by the operating system. Any thread can be scheduled to execute on any core.
The PlayStation 4 and the Xbox One are examples of SMP. Both of these consoles contain eight cores, of which seven are available for use by the programmer, and the application is free to run threads on any of the available cores.
asymmetric multiprocessing (AMP), the CPU cores are not necessarily homogeneous, and the operating system does not treat them equally. In AMP, one “master” CPU core typically runs the operating system, and the other cores are treated as “slaves” to which workloads are distributed by the master core.
The cell broadband engine (CBE) used in the PlayStation 3 is an example of AMP; it employs a main CPU known as the “power processing unit” (PPU) which is based on the PowerPC ISA, along with eight coprocessors known as “synergystic processing units” (SPUs) which are based around a completely different ISA.
16、Kernel
The “core” of the operating system—the part that handles all of the most fundamental and lowest-level operations—is called the kernel.
17、Kernel Mode vs User Mode
The kernel and its device drives run in a special mode called protected mode, privileged mode or kernel mode, while all other programs in the system (including all other parts of the operating system that aren’t part of the kernel) operate in user mode.
privileged mode has full access to all of the hardware in the computer, whereas user mode software is restricted in various ways in order to ensure stability of the computer system as a whole.
operating systems may implement multiple protection rings.
The kernel runs in ring 0, which is the most trusted ring and has all possible privileges within the system.
Device drivers might run in ring 1, trusted programs with I/O permissions might run in ring 2, while all other “untrusted” user programs run in ring 3.
18、Interrupts
An interrupt is a signal sent to the CPU in order to notify it of an important low-level event. When such an event occurs, an interrupt request (IRQ) is raised. If the operating system wishes to respond to the event, it pauses (interrupts) whatever processing had been going on, and calls a special kind of function called an interrupt service routine (ISR).
There are two kinds of interrupt: hardware interrupts and software interrupts.
For hardware interrupts there may be a tiny delay between the moment when a hardware interrupt is physically raised and when the CPU is in a suitable state to handle it.
A software interrupt can be triggered explicitly by executing an “interrupt” machine language instruction. Or one may be triggered in response to an erroneous condition detected by the CPU while running a piece of software—these are called traps or sometimes exceptions.
19、kernel Calls
On most modern operating systems, the user program doesn’t execute a software interrupt or system call instruction manually. Instead, a user program calls a kernel API function, which in turn marshalls the arguments and trips the software interrupt.
20、Preemptive Multitasking
multiprogramming prior to Windows NT and Windows 95.
in multiprogramming Windows.
21、Anatomy of a Process
user group it belongs;
• a reference to the process’s parent process, if any,
• the current working directory for the process,
22、Virtual Memory Map of a Process
Every process has its own virtual page table. pages owned by the kernel are protected from inadvertent or deliberate corruption by a user process because they are mapped to a special range of addresses known as kernel space which can only be accessed by code running in kernel mode.
23、Shared Library
Most operating systems also support the concept of shared libraries. In this case, the program contains only references to the library’s API functions, not a copy of the library’s machine code.
dynamic link libraries (DLL) under Windows.
The first time a shared library is needed by a process, the OS loads that library into physical memory, and maps a view of it into the process’s virtual address space. The addresses of the functions and global variables provided by the shared library are patched into the program’s machine code, allowing it to call them as if they had been statically linked into the executable.
a second process is run that uses the same shared library. Rather than loading a copy of the library’s code and global variables, the already-loaded physical pages are simply mapped into the virtual address space of the new process.
24、Kernel Pages
the address space of a process is actually divided into two large contiguous blocks—user space and kernel space.
For example, on 32-bit Windows, user space corresponds to the address range from address 0x0 through 0x7FFFFFFF (the lower 2 GiB of the address space), while kernel space corresponds to addresses between 0x80000000 and 0xFFFFFFFF (the upper 2 GiB of the space).
consistent “view” of the kernel’s internal data.
25、Process & Thread
26、Thread libraries
5)Yield. A function that yields the remainder of the thread’s time slice so other threads can get a chance to run.
6) Join. A function that puts the calling thread to sleep until another thread or group of threads has terminated.
27、Polling, Blocking and Yielding
a thread might need to wait for a time-consuming operation to complete, or for some resource to become available. In such a situation, we have three options:
1)The thread can poll
2)it can block,
put our thread to sleep so that it doesn’t waste CPU resources and rely on the kernel to wake it back up when the condition becomes true at some future time.
3)it can yield while polling.
28、sorts of OS functions that block.
the calling thread until the file has actually been opened
• Explicit sleeping. standard library) and pthread_sleep() (POSIX threads).
• Joining with another thread.
• Waiting for a mutex lock.
书上少说一种情况,即等待网络回包。
29、yielding
SwitchToThread() (Windows), or an equivalent system call.
30、Context Switching
• Running. The thread is actively running on a core.
on a core.
• Blocked. The thread is asleep, waiting for some condition to become true.
block that has been reserved for the thread by the kernel.
the CPU’s registers with that thread’s saved register contents.
31、Thread Priorities and Affinity
are used when scheduling threads.
when scheduling the thread.
32、Thread Local Storage
system call is provided that allows any one thread to obtain the address of its private TLS block.
33、Fiber
cooperatively, by each other.
“bootstraps” the process so that it can create and schedule more fibers.
a function that will serve as its entry point.
the Active and Inactive states.
DeleteFiber().
34、Fiber State
fiber can be in one of two states: Active or Inactive.
sleep waiting on a condition. Only its thread can be put to sleep.
other fibers to run cooperatively.
35、Fiber Migration
Inactive state.
fiber F has effectively migrated from thread A to thread B.
provided by the kernel.
36、User-Level Threads and Coroutines
knows nothing about them.
When a coroutine yields, its execution being yielded to by some other coroutine) it continues from where it left off.
37、kernel thread 的两种含意。
is a “user thread.”
a coroutine.