Cache Coherence for GPU Architectures

文章目录

摘要
1 Introduction
2 Related Work
3 背景

3.1 Baseline GPU Architecture

摘要

scalable coherence 已被 studied in CMPs,
- GPU new challenges.
conventional directory protocols adds unnecessary coherence traffic overhead to existing GPU apps.
these protocols increase the verification complexity of the GPU memory system.
Recent research,Library Cache Coherence (LCC) [34, 54], explored
- the use of time-based approaches
- in CMP coherence protocols.

a time-based coherence framework for GPUs,
- Temporal Coherence (TC),
- exploits globally synchronized counters in single-chip to develop a streamlined GPU coherence protocol.
Synchronized counters enable all coherence transitions,
- such as invalidation of cache blocks
- to happen synchronously,
- eliminating all coherence traffic and protocol races.
present an implementation of TC, called TC-Weak,
- eliminates LCC’s trade-off between stalling stores and increasing L1 miss rates
- to improve performance and reduce interconnect traffic.

providing coherent L1 caches, TC-Weak improves the performance of GPU applications with inter-workgroup communication by 85% over disabling the non-coherent L1 caches in the baseline GPU.
We also find that write-through protocols outperform a writeback protocol on a GPU
- the latter suffers from increased traffic
- due to unnecessary refills of write-once data.

1 Introduction

abstracting away the SIMD hardware and
- providing the illusion of independent scalar threads executing in parallel.
Traditionally limited to regular parallelism,
- recent studies[21, 41] show highly irregular algorithms can attain significant speedups on a GPU.
multi-level cache hierarchy in recent GPUs [6, 44] frees the burden of software managed caches
increases the GPU’s attractiveness as a platform for accelerating applications with irregular memory access patterns [22, 40].

GPUs lack cache一致性 and require disabling private caches if an application requires memory operations to be visible across all cores [6, 44, 45].
CMPs employ hardware cache coherence [17, 30, 32, 50] to enforce strict memory consistency models.
These consistency models form the basis
of memory models for high-level languages [10, 35] and provide the synchronization primitives employed by multithreaded CPU app
Coherence greatly simplifies supporting well-defined consistency and memory models
for high-level languages on GPUs.
It helps enable a unified address space in heterogeneous architectures with
single-chip CPU-GPU [11, 26].
This paper focuses on coherence in the GPU cores;
- CPU-GPU cache coherence as future work.

Disabling L1 caches provides coherence at the cost of app performance.
Figure 1(a) shows the potential improvement
- contain interworkgroup communication and require coherent L1 caches
Compared to disabling L1 caches,
- an ideally coherent GPU ,
- where coherence traffic does not incur any latency or traffic costs, improves performance of these applications by 88%

Cache Coherence for GPU Architectures

GPUs present three main challenges for coherence.
Figure 1(b) depicts the first of these challenges
- comparing the interconnect traffic of
- the baseline non-coherent GPU system (NO-COH) to
- writeback MESI,
- inclusive write-through GPU-VI
- non-inclusive write-through GPU-VIni (described in Section 4).
These protocols introduce unnecessary coherence traffic overheads for GPU app
- containing data that does not require coherence.

on a GPU, CPU-like worst case sizing [18] would require an impractical amount of storage for tracking thousands of in-flight coherence requests.
existing coherence protocols introduce complexity in the form of transient states and additional message classes.
They require additional virtual networks [58] on GPU interconnects to ensure forward progress, increase power consumption.
tracking a large number of sharers [28, 64] is not a problem for current GPU
- only tens of cores.

using a time-based coherence framework
- minimizing overheads of GPU coherence
- no introducing design complexity.
Traditional coherence protocols rely
- explicit message
- inform others
- when an address needs be invalidated.
describe a time-based coherence framework, TC,
- uses synchronized counters to
  self-invalidate cache blocks
- maintain coherence invariants without explicit messages
Existing hardware implements counters synchronized across components [23, Sec-
tion 17.12.1] to provide efficient timer services.
Leveraging these counters allows TC to
- eliminate coherence traffic,
- lower area overheads,
- reduce protocol complexity for GPU coherence.
TC requires prediction of cache block lifetimes for self-invalidation.

[34, 54]proposed time-based hardware coherence protocol, LCC,
- implements SC on CMPs by stalling
  writes to cache blocks until they have been self-invalidated by all sharers.
describe one implementation of the TC
framework, TC-Strong,similar to LCC.
Section 8.3： TC-Strong poorly on a GPU.
second ：TC-Weak, uses a novel timestamp-based memory fence to eliminate stalling of writes.
TC-Weak uses timestamps to drive all consistency operations.
It implements RC [19], enabling full support of C++ and Java memory models [58] on GPUs.

Cache Coherence for GPU Architectures

Figure 2 ：high-level operation of TC-Strong and TC-Weak.
C2 、C3, addresses A and B cached in private L1
TC-Strong,C1’s write to A stalls completion
- until C2 self-invalidates
- its locally cached copy of A.
C1’s write to B stalls completion
- until C3 self-invalidates
- its copy of B.
TC-Weak, C1’s writes to A and B do not stall
- waiting for other copies to be self-invalidated.
the fence operation ensures that all previously written addresses have been self-invalidated in other local caches.
This ensures that all previous writes from this core will be globally visible after the fence completes.

challenges of introducing existing coherence protocols to GPUs. introduce two optimizations to a VI protocol [30] to make it more suitable for GPU.
provides detailed complexity and performance evaluations of inclusive and non-inclusive directory protocols on a GPU.
describes Temporal Coherence,
- a GPU coherence framework for exploiting synchronous counters in single-chip systems to eliminate coherence traffic and protocol races.
proposes the TC-Weak coherence protocol which employs timestamp based memory fences to implement Release Consistency [19] on a GPU.
proposes a simple lifetime predictor for TC-Weak that performs well across a range of GPU applications.

TC-Weak with a simple lifetime predictor improves performance apps with inter-workgroup communication by 85%
- over the baseline non-coherent GPU.
performs as well as the VI protocols and 23% faster than MESI across all benchmarks.
for apps with intra-workgroup communication, it reduces the traffic overheads of MESI, GPU-VI and GPU-VIni by 56%,23% and 22%, reducing interconnect energy usage by40%, 12% and 12%.
Compared to TC-Strong, TC-Weak
performs 28% faster with 26% lower interconnect traffic across all applications.

2 discusses related work,
3 reviews GPU architectures and cache coherence,
4 describes the directory protocols
5 describes the challenges of GPU coherence.
6 details the implementations of TC-Strong and TC-Weak,
7 and 8 present our methodology and results
9 concludes.

2 Related Work

timestamps explored in software coherence [42, 63]
Nandy [43] first consider for hardware coherence.
(LCC) [34, 54] ：time-based hardware coherence proposal
- stores timestamps in directory
- delays stores to unexpired blocks
- to enforce sc on CMP.
TC-Strong similar LCC
- both enforce write atomicity
- by stalling writes
- at the shared last level cache.
Unlike LCC, TC-Strong supports multiple outstanding writes from a core and implements a rc model.
TC-Strong includes optimizations to eliminate stalls due to private writes and L2 evictions.
the stalling of writes in TC-Strong
causes poor on GPU.
propose TC-Weak and a novel time-based memory fence to eliminate all write-stalling, improve performance, and reduce interconnect traffic compared to TC-Strong.
unlike for CPU apps [34, 54],
the fixed timestamp prediction
proposed by LCC is not suited for GPU
applications.
We propose a simple yet effective lifetime predictor that can accommodate a range of GPU applications.
Lastly, present a full description of our proposed protocol, including state transition tables that describe the
implementation in detail.

3 背景

the memory system and cache hierarchy of the baseline non-coherent GPU ,
- similar to NVIDIA’s Fermi [44],
- we evaluate in this paper.
Cache coherence is also briefly discussed.

3.1 Baseline GPU Architecture

Cache Coherence for GPU Architectures

Figure 3 ：the organization of baseline non-coherent GPU.
An OpenCL[29]or CUDA[46] application begins execution on a CPU
- and launches compute kernels onto a GPU.
Each kernel launches a hierarchy of threads (an NDRange of work groups of wavefronts of work items/scalar threads) onto a GPU.
Each workgroup assigned to a multi-threaded GPU core.
Scalar threads are managed as a SIMD execution group
- consisting of 32 threads
- called a warp (NVIDIA terminology)
- or wavefront (AMD terminology).