计算机体系结构

期末 40%
作业 6%
展示 6%
实验 48%
- Forwarding+Pipeline - 8%
- Interrupt exception - 8%
- Branch prediction - 8%
- Cache design - 10%
- Out-of-order execution - 14%

重点

存储层级
乱序算法

Chapter 1: Overview

1.1 Introduction

冯诺伊曼结构

alt text

Big Men in Architecture

Mater Valero - 乱序、并行、大规模处理器
John Leroy Hennessy - Mips、Risc-I
David Andrew Patterson - RISC-V
Brooks
……

计算机分类

桌面计算机
服务器
嵌入式
个人智能设备
超级计算机

Flynn 分类

SISD - 单指令流，单数据流
SIMD - 单指令，多数据
MISD - 多、单
MIMD - 多、多

alt text

理解性能

算法 - 决定操作数量
编程语言、编译器、架构 - 决定每条操作会转换成几条机器指令
处理器和内存系统 - 决定指令执行快慢
I/O 系统（包括 OS）- 决定 I/O 操作多快

According to the process of using data, computers are developing in three fields:

Speed up processing (parallel)
Speed up transmission (accuracy)
Increase storage capacity and speed up storage (reliability)

The central issue discussed and studied in this course is

Processing
Storage
transmission

1.2 Performance

性能指标

Latency (Response Time) - 执行一个任务需要多久
Throughput (bandwidth) - 单位时间内能做多少工作

定义

\[ Performance = \frac{1}{Execution Time} \]

“X is n time faster than Y” - X's Performance/Y's Performance = n

衡量执行时间

Elapsed time - Total response time, including all aspects (Processing, I/O, OS overhead, idle time)
CPU time - Time spent processing a given job
- User CPU time : CPU time spent in the program itself
- System CPU time: CPU time spent in the OS, performing tasks on behalf of the program

改进 architecture 的目的 - 提高 performance

1.3 Technology trend

硬件性能越来越好，越来越便宜

A major turn in computer architecture

From instruction level parallelism to development thread level parallelism and data level parallelism.
The computer architecture plays an important role in the development of computer.

1.4 Quantitative approaches

CPU performance formula

\[ CPU\ Execution\ Time=CPU\ Clock\ Cycles\times Clock\ Period=\frac{CPU\ Clock\ Cycles}{Clock\ Rate} \]

Average cycles per instruction（CPI）

\[ CPI=\frac{CPU\ Clock\ Cycles}{Instruction\ Count} \]

多核不太好衡量

Amdahl's Law - make the common case fast

\[ T_{improve}=\frac{T_{affected}}{improvement\ factor}+T_{unaffected} \]

1.5 GREAT ARCHITECTURE IDEAS

摩尔定律
抽象设计
common case fast
并行
流水线
预测
存储层级
冗余增加可靠性

1.6 ISA

Instruction Set Design Issues

Where are operands stored?
- registers, memory, stack, accumulator
How many explicit operands are there? (Classification of ISAs)
How is the operand location specified? (Addressing Modes)
- register, immediate, indirect, . . .
What type & size of operands are supported? (Data Representation)
- byte, int, float, double, string, vector. . .
What operations are supported? (Types of Instructions)
- add, sub, mul, move, compare . . .

Instruction Set 设计原则：

Compatibility - 兼容性
Versatility - 有多种功能
High efficiency
Security

The Four ISA Design Principles

简单、规则
smaller is faster - 用小的 reg 而不是主存
make the common case fast
兼容 - e.g. a long jump(beyond a small constant)

GPR Classification

寄存器按照用途分类

alt text

Chapter 2: Pipeline

2.1 What is pipelining

如何提速

缩短每条指令的执行时间

减少一段程序的执行时间 - pipeline

pipeline

定义 - 在一条指令执行完之前，另一条指令就开始执行
核心 - 重叠执行（overlap）
适用情况 - 大量无关指令（如向量积）
pass time 和 empty time 很难避免
- Pass time: the time for the first task from beginning (entering the pipelining) to ending.
- Empty time: the time for the last task from entering the pipelining to having the result
longest section - the bottleneck
每个阶段需要中间寄存器 (latch)

三种执行模式

Sequential execution
Single overlapping execution - 𝑇 =(2𝑛+1)∆t
Twice overlapping execution - 𝑇 =(𝑛+2)∆t

2.2 流水线的分类

单功能 - only one fixed function pipelining.
多功能 - 可以按不同的方式连接

alt text

静态 - 同一时间只能执行一种功能
动态 - 可以同时执行不同功能

alt text

粒度分类
- Component level pipelining (in component- operation pipelining): 逻辑运算和算数运算分成了不同的段，因此可以执行多种运算
- Processor level pipelining (inter component- instruction pipelining): 每条指令被分成不同的子进程，由不同的 function unit 执行
- Inter processor pipelining (inter processor- macro pipelining): 多个处理器完成同一 task
Linear & Nonlinear - 有无 feedback loop
顺序和乱序
Scalar & Vector

2.3 Performance evaluation of pipelining

Throughput(TP) = 指令条数/总时间

\(T = (m+n-1)\times \Delta t_0\)

\(TP = n/((m+n-1)\times \Delta t_0)\)

\(if\ n>>m,\ TP_{max} = 1/\Delta t_0\)

否则 \(TP = \frac{n}{m+n-1}TP_{max}\)

2.4 Hazards of Pipelining

Dependence

Data Dependence - 需要前面指令的执行结果
- 导致 RAW 冲突 - 一条指令的源寄存器来自于它前面的某条指令计算的结果
Name Dependences
- Anti-dependence - 一条指令的目的寄存器和它前面的某条指令的源寄存器是一样的，后一条指令的写可能先于前一条指令的读（WAR）
- Output-dependence - 两条指令都将结果写到同一个目的寄存器中，前一条指令的写可能覆盖了后一条的写（WAW）
Control Dependences
- 造成 control hazard

Hazard - 阻碍下一条指令在下一周期执行的情况

都能用 stall 解决

structure hazards - A required resource is busy
- 需要同时使用 Memory 获取指令和 data
- 分开两个 memory/cache
data hazard - Need to wait for previous instruction to complete its data read/write
- 需要前面指令的执行结果
- 解决方法：前递
- load-stall 无法避免
- Code Scheduling to Avoid Stalls:
control hazard - Deciding on control action depends on previous instruction
- 需要一次 stall
- 如果需要前面指令的结果作为比较操作数，跳转指令前需要 stall 一次/两次（load）

2.5 Dynamic Branch Prediction

Branch History Table (BHT)
Branch-Target Buffer (BTB) - Cache of target addresses

2.6 Schedule of Nonlinear pipelining

没懂，考吗？

Initial conflict vector

Conflict vector

State transition graph

Circular queue

Shortest average interval