Chapter 5 DLP and TLP
概览
- SISD
- Single instruction stream single data stream
 - uniprocessor
 - Can exploit instruction-level parallelism
 
 - SIMD
- Single instruction stream multiple data stream
 - The same instruction is executed by multiple processors using different data streams
 - Exploits data-level parallelism
 - Data memory for each processor; whereas a single instruction memory and control processor
 
 - MISD
- Multiple instruction streams single data stream
 - 少见,没有商业的处理器实现
 
 - MIMD
- Multiple instruction streams multiple data streams
 - Each processor fetches its own instructions and operates on its own data.
 - Exploits task-level parallelism
 
 
本章主要介绍 SIMD 和 MIMD

5.1 SIMD: vector processor
vector processor
Definition - vector processor
- A pipeline processor, in which the vector data representation and the corresponding vector instructions are set, is called the vector processor
 - 与标量处理器的区别
- 向量元素很少相关
 - 造成功能单元频繁切换
 - 可以最大化向量运算的性能
 
 - Vertical and horizontal processing method
- 水平:从左到右
 - 垂直:从上到下
 - 可以结合
 
 
例子:D = A × (B + C) ,ABCD 都是长度为 N 的向量
horizontal
- 每个周期:\(k_i=b_i+c_i, d_i=a_i\times k_i\)
 - Data related: N 次,Function switching: 2N 次
 - RAW 发生,功能单元频繁切换,效率低
 Vertical
- K = B + C, D = A × K
 - Data related: 1,Function switching: 1
 Vertical and horizontal (grouping) processing method
- 将 N 个单元分成 S 组,余数为 r
 - 𝑁 =𝑆 × 𝑛+ r
 - Data related: S+1,Function switching: 2(S+1)
 
向量处理器需要 memory-memory structure,register-register structure

- STAR-100, CYBER-205 都是 Memory-memory operation pipeline
 
CRAY-1
register-register-oriented vector pipeline processing machine
vector processor 的例子 - CRAY-1
There are 12 single-function pipelines that can work in parallel, which can perform various operations of address, vector, and scalar in a pipeline.
- 每个向量寄存器有 6 条总线分别连到不同的部件上
 - 没有 Vi conflict and functional conflict,就可以并行
 
Vi conflict: The source vector or result vector of each vector instruction working in parallel uses the same Vi.
- Writing and reading data related:(V0 conflict)
- 𝑉0 ←𝑉1+𝑉2
 - 𝑉3 ←𝑉0 × 𝑉4
 
 - Reading data related(都要读 V1)
- 𝑉0 ←𝑉1+𝑉2
 - 𝑉3 ←𝑉1 × 𝑉4
 
 
Functional conflict: Each vector instruction working in parallel must use the same functional unit.
Instruction types for CRAY-1

Improve performance
- Set up multiple functional components and make them work in parallel.
- CRAY-1 vector processor has 4 groups of 12 single-function pipeline components
 
 - Use link technology to speed up the execution of a string of vector instructions. 
- Link feature: 先写后读,没有部件冲突和源向量冲突,可以通过 link 方法加速
 - 关键思想:向量的流水线
 
 - Adopt recycling mining technology to speed up recycling processing.
 - Using a multi-processor system to further improve the performance
 
Example: Use link technology to perform vector operations on CRAY-1: D=A×(B+C)
- Assuming that the vector length is N ≤ 64, the vector elements are floating numbers, and the vectors B and C have been stored in V0 and V1.
 - Draw a link diagram and analyze the different execution times in the case of non-link execution and link execution.
 - Solution: Use the following three vectors to complete the above operation:
 
V3←memory // access vector A
V2←V0+ V1 // Vector B and Vector C perform floating point addition
V4 ←V2×V3 // Floating point multiplication, the result is stored in V4
// linkable
Let the vector length be N
- serial method: [(1+6+1)+N-1] + [(1+6+1)+N-1] + [(1+7+1)+N-1] = 3N + 22
 - After (1) and (2) are executed in parallel, execute (3): max{[(1+6+1)+N-1], [(1+6+1)+N-1]} + [(1+7+1)+N-1] = 2N + 15
 - link technology: max{(1+6+1), (1+6+1)} + (1+7+1)+N-1 = N + 16
 

Other Machine
RV64V
- Loosely based on Cray-1
 - 32 62-bit vector registers 
- Register file has 16 read ports and 8 write port
 
 - Vector functional units
- Fully pipelined
 - Data and control hazards are detected
 
 - Vector load-store unit
- Fully pipelined–
 - One word per clock cycle after initial latency
 
 - Scalar registers
- 31 general-purpose registers
 - 32 floating-point registers
 
 

NVIDIA GPU and vector machines
GPUs are really just multithreaded SIMD Processors
- Similarities to vector machines
- Works well with data-level parallel problems
 - Scatter-gather transfers
 - Mask registers (根据指定的掩码值来屏蔽或选择寄存器中的特定位)
 - Large register files
 - thread(Private memory) -> blocks( Local memory) -> grid(GPU memory)
 
 - Differences
- No scalar processor
 - Uses multithreading to hide memory latency
 - Has many functional units, as opposed to a few deeply pipelined units like a vector processor
 
 
Loop-Level Parallelism
可以利用 DLP and TLP, as well as the more aggressive static ILP approaches (e.g., VLIW)
重点 - 如何转换
Loop-carried dependence - whether data accesses in later iterations are dependent on data values produced in earlier iterations
- No loop-carried dependence
 - loop-carried dependence
 - not parallel 可以转换成
 
相关性
- 名相关 - 命名相同
- 反相关 - 实际上没有数据传递
 
 - 数据相关 - 后续指令需要前面指令的结果
 
重命名消除数据依赖:
for (i=0; i<100; i=i+1) {
    Y[i] = X[i]/c;   /* S1 */
    X[i] = X[i] +c; /* S2 */
    Z[i] = Y[i] +c;            /* S3 */
    Y[i] = c- Y[i]; /* S4 */
}
// 重命名后
for (i=0; i<100; i=i+1) {
    T[i] = X[i]/c;   /* S1 */
    P[i] = X[i] +c; /* S2 */
    Z[i] = T[i] +c;            /* S3 */
    Y[i] = c-T[i]; /* S4 */
}
for (i=0; i<100; i=i+1) {
    A[i] = A[i] *B[i]; 
    B[i] = A[i] +c; 
    A[i] = C[i] *c;           
    C[i] = D[i] + A[i]; 
}
for (i=0; i<100; i=i+1) {
    T[i] = A[i] *B[i]; 
    Q[i] = T[i] +c; 
    S[i] = C[i] *c;           
    R[i] = D[i] + S[i]; 
}
5.2 SIMD: array processor
Definition - array processor (parallel processors)
- N 个 processing elements \(PE_0\) to \(PE_{N-1}\)
 - Interconnect in a certain way to form an array.
 - 一个控制单元 the operations specified by the same instruction are completed in parallel for the data allocated by the processing units
 
给每个 Processor 分配一块 Memory,每块 Memory 有不同的连接方式

- Distribute memory
- N Memories <-> N Processors

 
 - N Memories <-> N Processors
 - Centralized shared memory
- N Processors <-> ICN <-> K Memories

 
 - N Processors <-> ICN <-> K Memories
 
Structure of ICN
- n 个 processing units 直接相连,需要 n(n-1)/2 个连接 - \(O(n^2)\)(太大)
 - 多步连接
 
Parallel computer design
interconnection network generally consists of the following five parts
- CPU
 - memory
 - interface - 从一个 CPU/memory 到另一个 CPU/memory(network interface cards)
 - link - physical channel to transmit data bits
 - switch node - the information exchange and control station of the interconnected network
 
interconnection network 分类
- 拓扑:静态和动态 - 连接固定/可以由开关改变
 - 连接层级:Single-stage/Multi-stage

 
cube 函数


- 最多 n 次
 
PM2I 函数


- 最远的距离 \(log_2n\)
 
shuffle 函数
- \(shuffle(P_2P_1P_0)=P_1P_0P_2\)
 - N 次 shuffle 后,与原来相同 

 - all 0 to all 1: n exchanges and n-1 shuffles
 - Maximum distance: 2n-1
 

Network characteristics
感觉不是重点
buses, crossbar switches, and multi-stage interconnection networks

Advantages of SIMD
这个感觉可以抄一下
- SIMD architectures can exploit significant data-level parallelism for:
- Matrix-oriented scientific computing
 - Media-oriented image and sound processors
 
 - SIMD is more energy efficient than MIMD
- Only needs to fetch one instruction per data operation
 - Makes SIMD attractive for personal mobile devices
 
 - SIMD allows programmer to continue to think sequentially
 
5.3 MIMD

ILP -> TLP -> MIMD
MIMD Architecture 分为两大类
- Multi-processor system - based on shared memory
- 所有 CPU 共用一个内存空间
 - 都由一个 OS 管理
 - 不一定只有一片物理内存
 
 - Multi-computer system - based on message passing
- 有自己的 OS
 
 
Different memory access models
- Uniform Memory Access (UMA)

- Physical memory is uniformly shared by all processors
 - It takes the same time for all processors to access any memory word
 - Each processor can be equipped with private cache or private memory
 - Based on Bus 

 
 - Non Uniform Memory Access (NUMA)

- All CPUs share an uniform address space
 - Use LOAD and STORE instructions to access remote memory
 - Access to remote memory is slower than access to local memory
 - The processor in the NUMA system can use cache
 
 - Cache Only Memory Access (COMA) 

- Use the distributed cache directory for remote cache access
 
 
Further division of MIMD multi-computer system
- Massively Parallel Processors (MPP)
 - Cluster of Workstations(COW)
 
Memory Consistency and Cache Coherence
Memory Consistency - Need Memory Consistency Model
- When a written value will be returned by a read
 - If a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A
 
Cache Coherence - Need Cache Coherence Protocol
- All reads by any processor must return the most recently written value
 - Writes to the same location by any two processors are seen in the same order by all processors
 - Correct coherence ensures 分析 load 和 store 的结果来判断系统有无 cache/cache 在哪。This is because correct coherence ensures that the caches never enable new or different functional behavior.
 
多核同时拿到一份数据 - Cache coherence problem
防止不同版本的数据出现在多个 cache 中 - Cache coherence protocol.
UMA(SMP)- Snoopy coherence protocol
UMA is also called symmetric (shared-memory) multiprocessors (SMP) or centralized shared-memory multiprocessors.

Snoopy coherence protocol:
- 每个缓存都监视("snoop")其他处理器在共享总线上进行的读写操作
 - 每当一个处理器修改了某个共享数据(例如写入数据),它会将这一操作广播到系统总线
 - 其他处理器可以根据这些操作来更新自己的缓存
 
Write-through Cache Coherency Protocol
- write-through
 - write-back
 

Write Invalidation Protocol
three block states (MSI protocol)
- Invalid
 - shared - 其他 cache 里可能有这条数据
 - modified
- indicates that the block has been updated in the private cache;
 - implies that the block is exclusive
 
 

MESI exclusive:
- Invalid
 - shared
 - exclusive - indicates when a cache block is resident only in a single cache but is clean(其他 cache 没有这条数据,且这条数据在 memory 中是最新的
 - modified
 - exclusive->read by others->shared
 - exclusive->write->modified
 
MESI protocol state transition rules
MESI protocol working process

MOESI
- owned: indicates that the associated block is owned by that cache and out-of-date in memory
 - Modified -> Owned without writing the shared block to memory
 
NUMA(DSP)- Directory protocol
NUMA is called distributed shared-memory multiprocessor (DSP)

Directory protocol:
- 使用一个目录来追踪每个缓存行的状态和哪些处理器缓存了该数据
 - 每当一个处理器需要修改或读取共享数据时,它发送 invalid signal to 其他缓存该数据的处理器 through the directory in a "point-to-point” way
 - 其他缓存变成 invalid 了
 
三种状态
- Shared
- One or more nodes have the block cached, value in memory is up-to-date
 - Set of node IDs
 
 - Invalid
 - Modified
- Exactly one node has a copy of the cache block, value in memory is out-of-date
 - Owner node ID
 
 

Cache Coherence Protocols
能看懂并解释状态图即可,不要求背
- Memory Consistency
- defines the behavior of reads and writes with respect to accesses to other memory locations
 - Need Memory Consistency Model
 
 - Cache Coherence
- defines the behavior of reads and writes to the same memory location
 - Need Cache Coherence Protocol
 
 
Relaxed Consistency Model
- 读写可以乱序进行,同步操作保持正确性
 - Rules:
- X → Y
- Operation X must complete before operation Y is done
 - Sequential consistency requires: R → W, R → R, W → R, W → W
 
 - Relax W → R
- “Total store ordering”
 
 - Relax W → W
- “Partial store order”
 
 - Relax R → W and R → R
- “Weak ordering” and “release consistency”
 
 
 - X → Y
 
MPP
- 大规模硬件并行
 
COW
