Skip to content

Chapter 5 DLP and TLP

概览

  • SISD
    • Single instruction stream single data stream
    • uniprocessor
    • Can exploit instruction-level parallelism
  • SIMD
    • Single instruction stream multiple data stream
    • The same instruction is executed by multiple processors using different data streams
    • Exploits data-level parallelism
    • Data memory for each processor; whereas a single instruction memory and control processor
  • MISD
    • Multiple instruction streams single data stream
    • 少见,没有商业的处理器实现
  • MIMD
    • Multiple instruction streams multiple data streams
    • Each processor fetches its own instructions and operates on its own data.
    • Exploits task-level parallelism

本章主要介绍 SIMDMIMD

alt text

5.1 SIMD: vector processor

vector processor

Definition - vector processor

  • A pipeline processor, in which the vector data representation and the corresponding vector instructions are set, is called the vector processor
  • 与标量处理器的区别
    • 向量元素很少相关
    • 造成功能单元频繁切换
    • 可以最大化向量运算的性能
  • Vertical and horizontal processing method
    • 水平:从左到右
    • 垂直:从上到下
    • 可以结合

例子:D = A × (B + C) ,ABCD 都是长度为 N 的向量

horizontal

  • 每个周期:\(k_i=b_i+c_i, d_i=a_i\times k_i\)
  • Data related: N 次,Function switching: 2N 次
  • RAW 发生,功能单元频繁切换,效率低

Vertical

  • K = B + C, D = A × K
  • Data related: 1,Function switching: 1

Vertical and horizontal (grouping) processing method

  • 将 N 个单元分成 S 组,余数为 r
  • 𝑁 =𝑆 × 𝑛+ r
  • Data related: S+1,Function switching: 2(S+1)

向量处理器需要 memory-memory structure,register-register structure

alt text

  • STAR-100, CYBER-205 都是 Memory-memory operation pipeline

CRAY-1

register-register-oriented vector pipeline processing machine

vector processor 的例子 - CRAY-1 alt text

There are 12 single-function pipelines that can work in parallel, which can perform various operations of address, vector, and scalar in a pipeline.

  • 每个向量寄存器有 6 条总线分别连到不同的部件上
  • 没有 Vi conflict and functional conflict,就可以并行

Vi conflict: The source vector or result vector of each vector instruction working in parallel uses the same Vi.

  • Writing and reading data related:(V0 conflict)
    • 𝑉0 ←𝑉1+𝑉2
    • 𝑉3 ←𝑉0 × 𝑉4
  • Reading data related(都要读 V1)
    • 𝑉0 ←𝑉1+𝑉2
    • 𝑉3 ←𝑉1 × 𝑉4

Functional conflict: Each vector instruction working in parallel must use the same functional unit.

Instruction types for CRAY-1

alt text


Improve performance

  • Set up multiple functional components and make them work in parallel.
    • CRAY-1 vector processor has 4 groups of 12 single-function pipeline components
  • Use link technology to speed up the execution of a string of vector instructions.
    • Link feature: 先写后读,没有部件冲突和源向量冲突,可以通过 link 方法加速
    • 关键思想:向量的流水线
  • Adopt recycling mining technology to speed up recycling processing.
  • Using a multi-processor system to further improve the performance

Example: Use link technology to perform vector operations on CRAY-1: D=A×(B+C)

  • Assuming that the vector length is N ≤ 64, the vector elements are floating numbers, and the vectors B and C have been stored in V0 and V1.
  • Draw a link diagram and analyze the different execution times in the case of non-link execution and link execution.
  • Solution: Use the following three vectors to complete the above operation:
V3memory // access vector A
V2V0 V1 // Vector B and Vector C perform floating point addition
V4 V2×V3 // Floating point multiplication, the result is stored in V4
// linkable

Let the vector length be N

  • serial method: [(1+6+1)+N-1] + [(1+6+1)+N-1] + [(1+7+1)+N-1] = 3N + 22
  • After (1) and (2) are executed in parallel, execute (3): max{[(1+6+1)+N-1], [(1+6+1)+N-1]} + [(1+7+1)+N-1] = 2N + 15
  • link technology: max{(1+6+1), (1+6+1)} + (1+7+1)+N-1 = N + 16

alt text


Other Machine

RV64V

  • Loosely based on Cray-1
  • 32 62-bit vector registers
    • Register file has 16 read ports and 8 write port
  • Vector functional units
    • Fully pipelined
    • Data and control hazards are detected
  • Vector load-store unit
    • Fully pipelined–
    • One word per clock cycle after initial latency
  • Scalar registers
    • 31 general-purpose registers
    • 32 floating-point registers

alt text

NVIDIA GPU and vector machines

GPUs are really just multithreaded SIMD Processors

  • Similarities to vector machines
    • Works well with data-level parallel problems
    • Scatter-gather transfers
    • Mask registers (根据指定的掩码值来屏蔽或选择寄存器中的特定位)
    • Large register files
    • thread(Private memory) -> blocks( Local memory) -> grid(GPU memory)
  • Differences
    • No scalar processor
    • Uses multithreading to hide memory latency
    • Has many functional units, as opposed to a few deeply pipelined units like a vector processor

Loop-Level Parallelism

可以利用 DLP and TLP, as well as the more aggressive static ILP approaches (e.g., VLIW)

重点 - 如何转换

Loop-carried dependence - whether data accesses in later iterations are dependent on data values produced in earlier iterations

  • No loop-carried dependence
    for (i=999; i>=0; i=i-1)
        x[i] = x[i] + s;
    
  • loop-carried dependence
    for (i=0; i<100; i=i+1) {
        A[i+1] = A[i] + C[i]; /* S1 */
        B[i+1] = B[i] + A[i+1]; /* S2 */
    }
    // S1 和 S2 用了 S1 和 S2 前一个循环的结果
    // S2 用了 S1 在当前循环的结果
    
  • not parallel
    for (i=0; i<100; i=i+1) {
        A[i] = A[i] + B[i]; /* S1 */
        B[i+1] = C[i] + D[i]; /* S2 */
    }
    // S1 用了 S2 上一个循环的结果
    // but dependence is not circular so loop is parallel.
    
    可以转换成
    A[0] = A[0] + B[0]
    for (i=0; i<100; i=i+1) {
        B[i+1] = C[i] + D[i]; /* S2 */
        A[i+1] = A[i+1] + B[i+1]; /* S1 */
    }
    B[100] = C[99] + D[99]
    

相关性

  • 名相关 - 命名相同
    • 反相关 - 实际上没有数据传递
  • 数据相关 - 后续指令需要前面指令的结果

重命名消除数据依赖:

for (i=0; i<100; i=i+1) {
    Y[i] = X[i]/c;   /* S1 */
    X[i] = X[i] +c; /* S2 */
    Z[i] = Y[i] +c;            /* S3 */
    Y[i] = c- Y[i]; /* S4 */
}
// 重命名后
for (i=0; i<100; i=i+1) {
    T[i] = X[i]/c;   /* S1 */
    P[i] = X[i] +c; /* S2 */
    Z[i] = T[i] +c;            /* S3 */
    Y[i] = c-T[i]; /* S4 */
}
practice
for (i=0; i<100; i=i+1) {
    A[i] = A[i] *B[i]; 
    B[i] = A[i] +c; 
    A[i] = C[i] *c;           
    C[i] = D[i] + A[i]; 
}

for (i=0; i<100; i=i+1) {
    T[i] = A[i] *B[i]; 
    Q[i] = T[i] +c; 
    S[i] = C[i] *c;           
    R[i] = D[i] + S[i]; 
}

5.2 SIMD: array processor

Definition - array processor (parallel processors)

  • N 个 processing elements \(PE_0\) to \(PE_{N-1}\)
  • Interconnect in a certain way to form an array.
  • 一个控制单元 the operations specified by the same instruction are completed in parallel for the data allocated by the processing units

给每个 Processor 分配一块 Memory,每块 Memory 有不同的连接方式

alt text

  • Distribute memory
    • N Memories <-> N Processors alt text
  • Centralized shared memory
    • N Processors <-> ICN <-> K Memories alt text

Structure of ICN

  • n 个 processing units 直接相连,需要 n(n-1)/2 个连接 - \(O(n^2)\)(太大)
  • 多步连接

Parallel computer design

interconnection network generally consists of the following five parts

  • CPU
  • memory
  • interface - 从一个 CPU/memory 到另一个 CPU/memory(network interface cards)
  • link - physical channel to transmit data bits
  • switch node - the information exchange and control station of the interconnected network

interconnection network 分类

  • 拓扑:静态和动态 - 连接固定/可以由开关改变
  • 连接层级:Single-stage/Multi-stage alt text

cube 函数

alt text

alt text

  • 最多 n 次

PM2I 函数

alt text

alt text

  • 最远的距离 \(log_2n\)

shuffle 函数

  • \(shuffle(P_2P_1P_0)=P_1P_0P_2\)
  • N 次 shuffle 后,与原来相同 alt text
  • all 0 to all 1: n exchanges and n-1 shuffles
  • Maximum distance: 2n-1

alt text

Network characteristics

感觉不是重点

buses, crossbar switches, and multi-stage interconnection networks

alt text

Advantages of SIMD

这个感觉可以抄一下

  • SIMD architectures can exploit significant data-level parallelism for:
    • Matrix-oriented scientific computing
    • Media-oriented image and sound processors
  • SIMD is more energy efficient than MIMD
    • Only needs to fetch one instruction per data operation
    • Makes SIMD attractive for personal mobile devices
  • SIMD allows programmer to continue to think sequentially

5.3 MIMD

alt text

ILP -> TLP -> MIMD

MIMD Architecture 分为两大类

  • Multi-processor system - based on shared memory
    • 所有 CPU 共用一个内存空间
    • 都由一个 OS 管理
    • 不一定只有一片物理内存
  • Multi-computer system - based on message passing
    • 有自己的 OS

Different memory access models

  • Uniform Memory Access (UMA)alt text
    • Physical memory is uniformly shared by all processors
    • It takes the same time for all processors to access any memory word
    • Each processor can be equipped with private cache or private memory
    • Based on Bus alt text
  • Non Uniform Memory Access (NUMA)alt text
    • All CPUs share an uniform address space
    • Use LOAD and STORE instructions to access remote memory
    • Access to remote memory is slower than access to local memory
    • The processor in the NUMA system can use cache
    • alt text
  • Cache Only Memory Access (COMA) alt text
    • Use the distributed cache directory for remote cache access

Further division of MIMD multi-computer system

  • Massively Parallel Processors (MPP)
  • Cluster of Workstations(COW)

Memory Consistency and Cache Coherence

Memory Consistency - Need Memory Consistency Model

  • When a written value will be returned by a read
  • If a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A

Cache Coherence - Need Cache Coherence Protocol

  • All reads by any processor must return the most recently written value
  • Writes to the same location by any two processors are seen in the same order by all processors
  • Correct coherence ensures 分析 load 和 store 的结果来判断系统有无 cache/cache 在哪。This is because correct coherence ensures that the caches never enable new or different functional behavior.

多核同时拿到一份数据 - Cache coherence problem

防止不同版本的数据出现在多个 cache 中 - Cache coherence protocol.

UMA(SMP)- Snoopy coherence protocol

UMA is also called symmetric (shared-memory) multiprocessors (SMP) or centralized shared-memory multiprocessors.

alt text

Snoopy coherence protocol:

  • 每个缓存都监视("snoop")其他处理器在共享总线上进行的读写操作
  • 每当一个处理器修改了某个共享数据(例如写入数据),它会将这一操作广播到系统总线
  • 其他处理器可以根据这些操作来更新自己的缓存

Write-through Cache Coherency Protocol

  • write-through
  • write-back

alt text

Write Invalidation Protocol

three block states (MSI protocol)

  • Invalid
  • shared - 其他 cache 里可能有这条数据
  • modified
    • indicates that the block has been updated in the private cache;
    • implies that the block is exclusive

alt text

MESI exclusive:

  • Invalid
  • shared
  • exclusive - indicates when a cache block is resident only in a single cache but is clean(其他 cache 没有这条数据,且这条数据在 memory 中是最新的
  • modified
  • exclusive->read by others->shared
  • exclusive->write->modified

MESI protocol state transition rules

alt text alt text MESI protocol working process

alt text

MOESI

  • owned: indicates that the associated block is owned by that cache and out-of-date in memory
  • Modified -> Owned without writing the shared block to memory

NUMA(DSP)- Directory protocol

NUMA is called distributed shared-memory multiprocessor (DSP)

alt text

Directory protocol:

  • 使用一个目录来追踪每个缓存行的状态和哪些处理器缓存了该数据
  • 每当一个处理器需要修改或读取共享数据时,它发送 invalid signal to 其他缓存该数据的处理器 through the directory in a "point-to-point” way
  • 其他缓存变成 invalid 了

三种状态

  • Shared
    • One or more nodes have the block cached, value in memory is up-to-date
    • Set of node IDs
  • Invalid
  • Modified
    • Exactly one node has a copy of the cache block, value in memory is out-of-date
    • Owner node ID

alt text alt text

Cache Coherence Protocols

能看懂并解释状态图即可,不要求背

  • Memory Consistency
    • defines the behavior of reads and writes with respect to accesses to other memory locations
    • Need Memory Consistency Model
  • Cache Coherence
    • defines the behavior of reads and writes to the same memory location
    • Need Cache Coherence Protocol

Relaxed Consistency Model

  • 读写可以乱序进行,同步操作保持正确性
  • Rules:
    • X → Y
      • Operation X must complete before operation Y is done
      • Sequential consistency requires: R → W, R → R, W → R, W → W
    • Relax W → R
      • “Total store ordering”
    • Relax W → W
      • “Partial store order”
    • Relax R → W and R → R
      • “Weak ordering” and “release consistency”

MPP

  • 大规模硬件并行

COW