Chapter 5 DLP and TLP

概览

SISD
- Single instruction stream single data stream
- uniprocessor
- Can exploit instruction-level parallelism
SIMD
- Single instruction stream multiple data stream
- The same instruction is executed by multiple processors using different data streams
- Exploits data-level parallelism
- Data memory for each processor; whereas a single instruction memory and control processor
MISD
- Multiple instruction streams single data stream
- 少见，没有商业的处理器实现
MIMD
- Multiple instruction streams multiple data streams
- Each processor fetches its own instructions and operates on its own data.
- Exploits task-level parallelism

本章主要介绍 SIMD 和 MIMD

alt text

5.1 SIMD: vector processor

vector processor

Definition - vector processor

A pipeline processor, in which the vector data representation and the corresponding vector instructions are set, is called the vector processor
与标量处理器的区别
- 向量元素很少相关
- 造成功能单元频繁切换
- 可以最大化向量运算的性能
Vertical and horizontal processing method
- 水平：从左到右
- 垂直：从上到下
- 可以结合

例子：D = A × (B + C) ，ABCD 都是长度为 N 的向量

horizontal

每个周期：\(k_i=b_i+c_i, d_i=a_i\times k_i\)

Data related: N 次，Function switching: 2N 次

RAW 发生，功能单元频繁切换，效率低

Vertical

K = B + C, D = A × K

Data related: 1，Function switching: 1

Vertical and horizontal (grouping) processing method

将 N 个单元分成 S 组，余数为 r

𝑁 =𝑆 × 𝑛+ r

Data related: S+1，Function switching: 2(S+1)

向量处理器需要 memory-memory structure，register-register structure

alt text

STAR-100, CYBER-205 都是 Memory-memory operation pipeline

CRAY-1

register-register-oriented vector pipeline processing machine

vector processor 的例子 - CRAY-1

There are 12 single-function pipelines that can work in parallel, which can perform various operations of address, vector, and scalar in a pipeline.

每个向量寄存器有 6 条总线分别连到不同的部件上
没有 Vi conflict and functional conflict，就可以并行

Vi conflict: The source vector or result vector of each vector instruction working in parallel uses the same Vi.

Writing and reading data related:（V0 conflict）
- 𝑉0 ←𝑉1+𝑉2
- 𝑉3 ←𝑉0 × 𝑉4
Reading data related（都要读 V1）
- 𝑉0 ←𝑉1+𝑉2
- 𝑉3 ←𝑉1 × 𝑉4

Functional conflict: Each vector instruction working in parallel must use the same functional unit.

Instruction types for CRAY-1

alt text

Improve performance

Set up multiple functional components and make them work in parallel.
- CRAY-1 vector processor has 4 groups of 12 single-function pipeline components
Use link technology to speed up the execution of a string of vector instructions.
- Link feature: 先写后读，没有部件冲突和源向量冲突，可以通过 link 方法加速
- 关键思想：向量的流水线
Adopt recycling mining technology to speed up recycling processing.
Using a multi-processor system to further improve the performance

Example: Use link technology to perform vector operations on CRAY-1: D=A×(B+C)

Assuming that the vector length is N ≤ 64, the vector elements are floating numbers, and the vectors B and C have been stored in V0 and V1.
Draw a link diagram and analyze the different execution times in the case of non-link execution and link execution.
Solution: Use the following three vectors to complete the above operation:

V3←memory // access vector A
V2←V0＋ V1 // Vector B and Vector C perform floating point addition
V4 ←V2×V3 // Floating point multiplication, the result is stored in V4
// linkable

Let the vector length be N

serial method: [(1+6+1)+N-1] + [(1+6+1)+N-1] + [(1+7+1)+N-1] = 3N + 22
After (1) and (2) are executed in parallel, execute (3): max{[(1+6+1)+N-1], [(1+6+1)+N-1]} + [(1+7+1)+N-1] = 2N + 15
link technology: max{(1+6+1), (1+6+1)} + (1+7+1)+N-1 = N + 16

alt text

Other Machine

RV64V

Loosely based on Cray-1
32 62-bit vector registers
- Register file has 16 read ports and 8 write port
Vector functional units
- Fully pipelined
- Data and control hazards are detected
Vector load-store unit
- Fully pipelined–
- One word per clock cycle after initial latency
Scalar registers
- 31 general-purpose registers
- 32 floating-point registers

alt text

NVIDIA GPU and vector machines

GPUs are really just multithreaded SIMD Processors

Similarities to vector machines
- Works well with data-level parallel problems
- Scatter-gather transfers
- Mask registers (根据指定的掩码值来屏蔽或选择寄存器中的特定位)
- Large register files
- thread(Private memory) -> blocks( Local memory) -> grid(GPU memory)
Differences
- No scalar processor
- Uses multithreading to hide memory latency
- Has many functional units, as opposed to a few deeply pipelined units like a vector processor

Loop-Level Parallelism

可以利用 DLP and TLP, as well as the more aggressive static ILP approaches (e.g., VLIW)

重点 - 如何转换

Loop-carried dependence - whether data accesses in later iterations are dependent on data values produced in earlier iterations

No loop-carried dependence

for (i=999; i>=0; i=i-1)
    x[i] = x[i] + s;

loop-carried dependence

for (i=0; i<100; i=i+1) {
    A[i+1] = A[i] + C[i]; /* S1 */
    B[i+1] = B[i] + A[i+1]; /* S2 */
}
// S1 和 S2 用了 S1 和 S2 前一个循环的结果
// S2 用了 S1 在当前循环的结果

not parallel

for (i=0; i<100; i=i+1) {
    A[i] = A[i] + B[i]; /* S1 */
    B[i+1] = C[i] + D[i]; /* S2 */
}
// S1 用了 S2 上一个循环的结果
// but dependence is not circular so loop is parallel.

可以转换成

A[0] = A[0] + B[0]
for (i=0; i<100; i=i+1) {
    B[i+1] = C[i] + D[i]; /* S2 */
    A[i+1] = A[i+1] + B[i+1]; /* S1 */
}
B[100] = C[99] + D[99]

相关性

名相关 - 命名相同
- 反相关 - 实际上没有数据传递
数据相关 - 后续指令需要前面指令的结果

重命名消除数据依赖：

for (i=0; i<100; i=i+1) {
    Y[i] = X[i]/c;   /* S1 */
    X[i] = X[i] +c; /* S2 */
    Z[i] = Y[i] +c;            /* S3 */
    Y[i] = c- Y[i]; /* S4 */
}
// 重命名后
for (i=0; i<100; i=i+1) {
    T[i] = X[i]/c;   /* S1 */
    P[i] = X[i] +c; /* S2 */
    Z[i] = T[i] +c;            /* S3 */
    Y[i] = c-T[i]; /* S4 */
}

practice

for (i=0; i<100; i=i+1) {
    A[i] = A[i] *B[i]; 
    B[i] = A[i] +c; 
    A[i] = C[i] *c;           
    C[i] = D[i] + A[i]; 
}

for (i=0; i<100; i=i+1) {
    T[i] = A[i] *B[i]; 
    Q[i] = T[i] +c; 
    S[i] = C[i] *c;           
    R[i] = D[i] + S[i]; 
}

5.2 SIMD: array processor

Definition - array processor (parallel processors)

N 个 processing elements \(PE_0\) to \(PE_{N-1}\)
Interconnect in a certain way to form an array.
一个控制单元 the operations specified by the same instruction are completed in parallel for the data allocated by the processing units

给每个 Processor 分配一块 Memory，每块 Memory 有不同的连接方式

alt text

Distribute memory
- N Memories <-> N Processors
Centralized shared memory
- N Processors <-> ICN <-> K Memories

Structure of ICN

n 个 processing units 直接相连，需要 n(n-1)/2 个连接 - \(O(n^2)\)（太大）
多步连接

Parallel computer design

interconnection network generally consists of the following five parts

CPU
memory
interface - 从一个 CPU/memory 到另一个 CPU/memory（network interface cards）
link - physical channel to transmit data bits
switch node - the information exchange and control station of the interconnected network

interconnection network 分类

拓扑：静态和动态 - 连接固定/可以由开关改变
连接层级：Single-stage/Multi-stage

cube 函数

alt text

最多 n 次

PM2I 函数

alt text

最远的距离 \(log_2n\)

shuffle 函数

\(shuffle(P_2P_1P_0)=P_1P_0P_2\)
N 次 shuffle 后，与原来相同
all 0 to all 1: n exchanges and n-1 shuffles
Maximum distance: 2n-1

alt text

Network characteristics

感觉不是重点

buses, crossbar switches, and multi-stage interconnection networks

alt text

Advantages of SIMD

这个感觉可以抄一下

SIMD architectures can exploit significant data-level parallelism for:
- Matrix-oriented scientific computing
- Media-oriented image and sound processors
SIMD is more energy efficient than MIMD
- Only needs to fetch one instruction per data operation
- Makes SIMD attractive for personal mobile devices
SIMD allows programmer to continue to think sequentially

5.3 MIMD

alt text

ILP -> TLP -> MIMD

MIMD Architecture 分为两大类

Multi-processor system - based on shared memory
- 所有 CPU 共用一个内存空间
- 都由一个 OS 管理
- 不一定只有一片物理内存
Multi-computer system - based on message passing
- 有自己的 OS

Different memory access models

Uniform Memory Access (UMA)
- Physical memory is uniformly shared by all processors
- It takes the same time for all processors to access any memory word
- Each processor can be equipped with private cache or private memory
- Based on Bus
Non Uniform Memory Access (NUMA)
- All CPUs share an uniform address space
- Use LOAD and STORE instructions to access remote memory
- Access to remote memory is slower than access to local memory
- The processor in the NUMA system can use cache
Cache Only Memory Access (COMA)
- Use the distributed cache directory for remote cache access

Further division of MIMD multi-computer system

Massively Parallel Processors (MPP)
Cluster of Workstations(COW)

Memory Consistency and Cache Coherence

Memory Consistency - Need Memory Consistency Model

When a written value will be returned by a read
If a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A

Cache Coherence - Need Cache Coherence Protocol

All reads by any processor must return the most recently written value
Writes to the same location by any two processors are seen in the same order by all processors
Correct coherence ensures 分析 load 和 store 的结果来判断系统有无 cache/cache 在哪。This is because correct coherence ensures that the caches never enable new or different functional behavior.

多核同时拿到一份数据 - Cache coherence problem

防止不同版本的数据出现在多个 cache 中 - Cache coherence protocol.

UMA（SMP）- Snoopy coherence protocol

UMA is also called symmetric (shared-memory) multiprocessors (SMP) or centralized shared-memory multiprocessors.

alt text

Snoopy coherence protocol:

每个缓存都监视（"snoop"）其他处理器在共享总线上进行的读写操作
每当一个处理器修改了某个共享数据（例如写入数据），它会将这一操作广播到系统总线
其他处理器可以根据这些操作来更新自己的缓存

Write-through Cache Coherency Protocol

write-through
write-back

alt text

Write Invalidation Protocol

three block states (MSI protocol)

Invalid
shared - 其他 cache 里可能有这条数据
modified
- indicates that the block has been updated in the private cache;
- implies that the block is exclusive

alt text

MESI exclusive:

Invalid
shared
exclusive - indicates when a cache block is resident only in a single cache but is clean（其他 cache 没有这条数据，且这条数据在 memory 中是最新的
modified
exclusive->read by others->shared
exclusive->write->modified

MESI protocol state transition rules

alt text MESI protocol working process

alt text

MOESI

owned: indicates that the associated block is owned by that cache and out-of-date in memory
Modified -> Owned without writing the shared block to memory

NUMA（DSP）- Directory protocol

NUMA is called distributed shared-memory multiprocessor (DSP)

alt text

Directory protocol:

使用一个目录来追踪每个缓存行的状态和哪些处理器缓存了该数据
每当一个处理器需要修改或读取共享数据时，它发送 invalid signal to 其他缓存该数据的处理器 through the directory in a "point-to-point” way
其他缓存变成 invalid 了

三种状态

Shared
- One or more nodes have the block cached, value in memory is up-to-date
- Set of node IDs
Invalid
Modified
- Exactly one node has a copy of the cache block, value in memory is out-of-date
- Owner node ID

alt text

Cache Coherence Protocols

能看懂并解释状态图即可，不要求背

Memory Consistency
- defines the behavior of reads and writes with respect to accesses to other memory locations
- Need Memory Consistency Model
Cache Coherence
- defines the behavior of reads and writes to the same memory location
- Need Cache Coherence Protocol

Relaxed Consistency Model

读写可以乱序进行，同步操作保持正确性
Rules：
- X → Y
  - Operation X must complete before operation Y is done
  - Sequential consistency requires: R → W, R → R, W → R, W → W
- Relax W → R
  - “Total store ordering”
- Relax W → W
  - “Partial store order”
- Relax R → W and R → R
  - “Weak ordering” and “release consistency”

MPP

大规模硬件并行

COW