Chapter 5 DLP and TLP
概览
- SISD
- Single instruction stream single data stream
- uniprocessor
- Can exploit instruction-level parallelism
- SIMD
- Single instruction stream multiple data stream
- The same instruction is executed by multiple processors using different data streams
- Exploits data-level parallelism
- Data memory for each processor; whereas a single instruction memory and control processor
- MISD
- Multiple instruction streams single data stream
- 少见,没有商业的处理器实现
- MIMD
- Multiple instruction streams multiple data streams
- Each processor fetches its own instructions and operates on its own data.
- Exploits task-level parallelism
本章主要介绍 SIMD 和 MIMD
5.1 SIMD: vector processor
vector processor
Definition - vector processor
- A pipeline processor, in which the vector data representation and the corresponding vector instructions are set, is called the vector processor
- 与标量处理器的区别
- 向量元素很少相关
- 造成功能单元频繁切换
- 可以最大化向量运算的性能
- Vertical and horizontal processing method
- 水平:从左到右
- 垂直:从上到下
- 可以结合
例子:D = A × (B + C) ,ABCD 都是长度为 N 的向量
horizontal
- 每个周期:\(k_i=b_i+c_i, d_i=a_i\times k_i\)
- Data related: N 次,Function switching: 2N 次
- RAW 发生,功能单元频繁切换,效率低
Vertical
- K = B + C, D = A × K
- Data related: 1,Function switching: 1
Vertical and horizontal (grouping) processing method
- 将 N 个单元分成 S 组,余数为 r
- 𝑁 =𝑆 × 𝑛+ r
- Data related: S+1,Function switching: 2(S+1)
向量处理器需要 memory-memory structure,register-register structure
- STAR-100, CYBER-205 都是 Memory-memory operation pipeline
CRAY-1
register-register-oriented vector pipeline processing machine
vector processor 的例子 - CRAY-1
There are 12 single-function pipelines that can work in parallel, which can perform various operations of address, vector, and scalar in a pipeline.
- 每个向量寄存器有 6 条总线分别连到不同的部件上
- 没有 Vi conflict and functional conflict,就可以并行
Vi conflict: The source vector or result vector of each vector instruction working in parallel uses the same Vi.
- Writing and reading data related:(V0 conflict)
- 𝑉0 ←𝑉1+𝑉2
- 𝑉3 ←𝑉0 × 𝑉4
- Reading data related(都要读 V1)
- 𝑉0 ←𝑉1+𝑉2
- 𝑉3 ←𝑉1 × 𝑉4
Functional conflict: Each vector instruction working in parallel must use the same functional unit.
Instruction types for CRAY-1
Improve performance
- Set up multiple functional components and make them work in parallel.
- CRAY-1 vector processor has 4 groups of 12 single-function pipeline components
- Use link technology to speed up the execution of a string of vector instructions.
- Link feature: 先写后读,没有部件冲突和源向量冲突,可以通过 link 方法加速
- 关键思想:向量的流水线
- Adopt recycling mining technology to speed up recycling processing.
- Using a multi-processor system to further improve the performance
Example: Use link technology to perform vector operations on CRAY-1: D=A×(B+C)
- Assuming that the vector length is N ≤ 64, the vector elements are floating numbers, and the vectors B and C have been stored in V0 and V1.
- Draw a link diagram and analyze the different execution times in the case of non-link execution and link execution.
- Solution: Use the following three vectors to complete the above operation:
V3←memory // access vector A
V2←V0+ V1 // Vector B and Vector C perform floating point addition
V4 ←V2×V3 // Floating point multiplication, the result is stored in V4
// linkable
Let the vector length be N
- serial method: [(1+6+1)+N-1] + [(1+6+1)+N-1] + [(1+7+1)+N-1] = 3N + 22
- After (1) and (2) are executed in parallel, execute (3): max{[(1+6+1)+N-1], [(1+6+1)+N-1]} + [(1+7+1)+N-1] = 2N + 15
- link technology: max{(1+6+1), (1+6+1)} + (1+7+1)+N-1 = N + 16
Other Machine
RV64V
- Loosely based on Cray-1
- 32 62-bit vector registers
- Register file has 16 read ports and 8 write port
- Vector functional units
- Fully pipelined
- Data and control hazards are detected
- Vector load-store unit
- Fully pipelined–
- One word per clock cycle after initial latency
- Scalar registers
- 31 general-purpose registers
- 32 floating-point registers
NVIDIA GPU and vector machines
GPUs are really just multithreaded SIMD Processors
- Similarities to vector machines
- Works well with data-level parallel problems
- Scatter-gather transfers
- Mask registers (根据指定的掩码值来屏蔽或选择寄存器中的特定位)
- Large register files
- thread(Private memory) -> blocks( Local memory) -> grid(GPU memory)
- Differences
- No scalar processor
- Uses multithreading to hide memory latency
- Has many functional units, as opposed to a few deeply pipelined units like a vector processor
Loop-Level Parallelism
可以利用 DLP and TLP, as well as the more aggressive static ILP approaches (e.g., VLIW)
重点 - 如何转换
Loop-carried dependence - whether data accesses in later iterations are dependent on data values produced in earlier iterations
- No loop-carried dependence
- loop-carried dependence
- not parallel 可以转换成
相关性
- 名相关 - 命名相同
- 反相关 - 实际上没有数据传递
- 数据相关 - 后续指令需要前面指令的结果
重命名消除数据依赖:
for (i=0; i<100; i=i+1) {
Y[i] = X[i]/c; /* S1 */
X[i] = X[i] +c; /* S2 */
Z[i] = Y[i] +c; /* S3 */
Y[i] = c- Y[i]; /* S4 */
}
// 重命名后
for (i=0; i<100; i=i+1) {
T[i] = X[i]/c; /* S1 */
P[i] = X[i] +c; /* S2 */
Z[i] = T[i] +c; /* S3 */
Y[i] = c-T[i]; /* S4 */
}
for (i=0; i<100; i=i+1) {
A[i] = A[i] *B[i];
B[i] = A[i] +c;
A[i] = C[i] *c;
C[i] = D[i] + A[i];
}
for (i=0; i<100; i=i+1) {
T[i] = A[i] *B[i];
Q[i] = T[i] +c;
S[i] = C[i] *c;
R[i] = D[i] + S[i];
}
5.2 SIMD: array processor
Definition - array processor (parallel processors)
- N 个 processing elements \(PE_0\) to \(PE_{N-1}\)
- Interconnect in a certain way to form an array.
- 一个控制单元 the operations specified by the same instruction are completed in parallel for the data allocated by the processing units
给每个 Processor 分配一块 Memory,每块 Memory 有不同的连接方式
- Distribute memory
- N Memories <-> N Processors
- N Memories <-> N Processors
- Centralized shared memory
- N Processors <-> ICN <-> K Memories
- N Processors <-> ICN <-> K Memories
Structure of ICN
- n 个 processing units 直接相连,需要 n(n-1)/2 个连接 - \(O(n^2)\)(太大)
- 多步连接
Parallel computer design
interconnection network generally consists of the following five parts
- CPU
- memory
- interface - 从一个 CPU/memory 到另一个 CPU/memory(network interface cards)
- link - physical channel to transmit data bits
- switch node - the information exchange and control station of the interconnected network
interconnection network 分类
- 拓扑:静态和动态 - 连接固定/可以由开关改变
- 连接层级:Single-stage/Multi-stage
cube 函数
- 最多 n 次
PM2I 函数
- 最远的距离 \(log_2n\)
shuffle 函数
- \(shuffle(P_2P_1P_0)=P_1P_0P_2\)
- N 次 shuffle 后,与原来相同
- all 0 to all 1: n exchanges and n-1 shuffles
- Maximum distance: 2n-1
Network characteristics
感觉不是重点
buses, crossbar switches, and multi-stage interconnection networks
Advantages of SIMD
这个感觉可以抄一下
- SIMD architectures can exploit significant data-level parallelism for:
- Matrix-oriented scientific computing
- Media-oriented image and sound processors
- SIMD is more energy efficient than MIMD
- Only needs to fetch one instruction per data operation
- Makes SIMD attractive for personal mobile devices
- SIMD allows programmer to continue to think sequentially
5.3 MIMD
ILP -> TLP -> MIMD
MIMD Architecture 分为两大类
- Multi-processor system - based on shared memory
- 所有 CPU 共用一个内存空间
- 都由一个 OS 管理
- 不一定只有一片物理内存
- Multi-computer system - based on message passing
- 有自己的 OS
Different memory access models
- Uniform Memory Access (UMA)
- Physical memory is uniformly shared by all processors
- It takes the same time for all processors to access any memory word
- Each processor can be equipped with private cache or private memory
- Based on Bus
- Non Uniform Memory Access (NUMA)
- All CPUs share an uniform address space
- Use LOAD and STORE instructions to access remote memory
- Access to remote memory is slower than access to local memory
- The processor in the NUMA system can use cache
- Cache Only Memory Access (COMA)
- Use the distributed cache directory for remote cache access
Further division of MIMD multi-computer system
- Massively Parallel Processors (MPP)
- Cluster of Workstations(COW)
Memory Consistency and Cache Coherence
Memory Consistency - Need Memory Consistency Model
- When a written value will be returned by a read
- If a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A
Cache Coherence - Need Cache Coherence Protocol
- All reads by any processor must return the most recently written value
- Writes to the same location by any two processors are seen in the same order by all processors
- Correct coherence ensures 分析 load 和 store 的结果来判断系统有无 cache/cache 在哪。This is because correct coherence ensures that the caches never enable new or different functional behavior.
多核同时拿到一份数据 - Cache coherence problem
防止不同版本的数据出现在多个 cache 中 - Cache coherence protocol.
UMA(SMP)- Snoopy coherence protocol
UMA is also called symmetric (shared-memory) multiprocessors (SMP) or centralized shared-memory multiprocessors.
Snoopy coherence protocol:
- 每个缓存都监视("snoop")其他处理器在共享总线上进行的读写操作
- 每当一个处理器修改了某个共享数据(例如写入数据),它会将这一操作广播到系统总线
- 其他处理器可以根据这些操作来更新自己的缓存
Write-through Cache Coherency Protocol
- write-through
- write-back
Write Invalidation Protocol
three block states (MSI protocol)
- Invalid
- shared - 其他 cache 里可能有这条数据
- modified
- indicates that the block has been updated in the private cache;
- implies that the block is exclusive
MESI exclusive:
- Invalid
- shared
- exclusive - indicates when a cache block is resident only in a single cache but is clean(其他 cache 没有这条数据,且这条数据在 memory 中是最新的
- modified
- exclusive->read by others->shared
- exclusive->write->modified
MESI protocol state transition rules
MESI protocol working process
MOESI
- owned: indicates that the associated block is owned by that cache and out-of-date in memory
- Modified -> Owned without writing the shared block to memory
NUMA(DSP)- Directory protocol
NUMA is called distributed shared-memory multiprocessor (DSP)
Directory protocol:
- 使用一个目录来追踪每个缓存行的状态和哪些处理器缓存了该数据
- 每当一个处理器需要修改或读取共享数据时,它发送 invalid signal to 其他缓存该数据的处理器 through the directory in a "point-to-point” way
- 其他缓存变成 invalid 了
三种状态
- Shared
- One or more nodes have the block cached, value in memory is up-to-date
- Set of node IDs
- Invalid
- Modified
- Exactly one node has a copy of the cache block, value in memory is out-of-date
- Owner node ID
Cache Coherence Protocols
能看懂并解释状态图即可,不要求背
- Memory Consistency
- defines the behavior of reads and writes with respect to accesses to other memory locations
- Need Memory Consistency Model
- Cache Coherence
- defines the behavior of reads and writes to the same memory location
- Need Cache Coherence Protocol
Relaxed Consistency Model
- 读写可以乱序进行,同步操作保持正确性
- Rules:
- X → Y
- Operation X must complete before operation Y is done
- Sequential consistency requires: R → W, R → R, W → R, W → W
- Relax W → R
- “Total store ordering”
- Relax W → W
- “Partial store order”
- Relax R → W and R → R
- “Weak ordering” and “release consistency”
- X → Y
MPP
- 大规模硬件并行
COW