CS61C NOTES 21-30 (Great Ideas in Computer Architecture)

这是一个CS61C的笔记，由于是边上课边记的，很多地方选择偷懒进行截图，，，

Lecture 21 (Pipelining 1)

Pipelining

1. Performance Measurement & Improvement
“Our” Single-cycle RISC-V CPU executes instructions at 1.25 GHz
- 1 instruction every 800 ps
Can we improve its performance?
- Quicker response time, so one job finishes faster?
- More jobs per unit time (e.g. web server returning pages, spoken words recognized)?
- Longer battery life?

Processor Performance Iron Law

Instructions per Program
- Determined by
  - Task
  - Algorithm
  - Programming language
  - Compiler
  - Instruction Set Architecture (ISA)
(Average) Clock Cycles per Instruction (CPI)
- Determined by
  - ISA
  - Processor implementation (or microarchitecture)
- For “our” single-cycle RISC-V design, CPI = 1
- Complex instructions (e.g. strcpy), CPI >> 1
- Superscalar processors, CPI < 1 (next lectures)
Time per Cycle (1/Frequency)
- Determined by
  - Processer microarchitecture
    
    determines critical path through logic gates
  - Techinology
    
    5nm versus 28nm
  - Power budget
    
    lower voltages reduce transistor speed

Energy Efficiency

Energy per Task

voltage is a more powerful knob.

lowering the voltage affects the performance, because transistors are slower when running at a lower supply voltage.
End of Scaling
- Industry not been able to reduce supply voltage much, as reducing it further would mean increasing “leakage power” where transistor switches don’t fully turn off (more like dimmer switch than on-off switch)
- Size of transistors and hence capacitance, not shrinking as much as before between transistor generations
  - Need to go to 3D
- Power becomes a growing concern – the “power wall”
Energy iron Law
- Energy efficiency (e.g., instructions/Joule) is key metric in all computing devices
- For power-constrained systems (e.g., 20MW datacenter), need better energy efficiency to get more performance at same power
- For energy-constrained systems (e.g., 1W phone), need better energy efficiency to prolong battery life

Introducing to Pipelining

Pipelining doesn’t help latency of single task, it helps throughput of entire workload
Multiple tasks operating simultaneously using different resources
Potential speedup = Number of pipe stages
Time to “fill” pipeline and time to “drain” it reduces speedup:
- 2.3X v. 4X in this example
Pipeline rate limited by slowest pipeline stage
Unbalanced lengths of pipe stages reduce speedup

Lecture 22 (Pipelining 2)

Pipelining RISC-V

clk is 200ps to support all the instructions
- We lose on the latency od the single instruction, but we gain dramatically on the throughput

Pipelining Datapath

Insert registers at appropriate places on the boundary between each of these execution phases.

Pipeline registers separate stages, hold data for each instruction in flight
Pipelined Control
- Control signals derived from instruction
  - As in single-cycle implementation
  - Information is stored in pipeline registers for use by later stages
Once when we decode the instruction, we, in addition to our instruction, need to send down the stream all the control bits that correspond to that instruction.
So each of these registers that are storing the program counter are going to be a bit wider to save the necessary control signals.

Pipeline Hazards

A hazard is a situation that prevents starting the next instruction in the next clock cycle
1. Structural hazard
  - A required resource is busy (e.g. needed in multiple stages)
2. Data hazard
  - Data dependency between instructions
  - Need to wait for previous instruction to complete its data read/write
3. Control hazard
  - Flow of execution depends on previous instruction
Structural Hazard
- Problem: Two or more instructions in the pipeline compete for access to a single physical resource
- Solution 1: Instructions take it in turns to use resource, some instructions have to stall
- Solution 2: Add more hardware to machine
- (Can always solve a structural hazard by adding more hardware)
Regfile Structural Hazards
- Each instruction:
  - Can read up to two operands in decode stage
  - Can write one value in writeback stage
- Avoid structural hazard by having separate “ports”
  - Two independent read ports and one independent write port
- Three accesses per cycle can happen simultaneously
Structural Hazard: Memory Access
- 区分IMEM和DMEM，是因为在pipeline中同一时期会有对内存的两种访问
Instruction and Data Caches
- Fast, on-chip memory, separate for instructions and data
Summary
- Conflict for use of a resource
- In RISC-V pipeline with a single memory
  - Load/store requires data access
  - Without separate memories, instruction fetch would have to stall for that cycle
    - All other operations in pipeline would have to wait
- Pipelined datapaths require separate instruction/data memories
  - Or separate instruction/data caches
- RISC ISAs (including RISC-V) designed to avoid structural hazards
  - e.g. at most one memory access/instruction

Data Hazards

Separate ports, but what if write to same register as read?

Does sw in the example fetch the old or new value?
Solution
- Stalls reduce performance
  - But stalls are required to get correct results
- Compiler can arrange code or insert nops(addi x0, x0, 0) to avoid hazards and stalls
  - Requires knowledge of the pipeline structure
- Use result when it is computed
  - Don’t wait for it to be stored in a register
  - Requires extra connections in the datapath

Lecture 23 (Pipelining 3)

Load Data Hazard

We can only stall
Slot after a load is called a load delay slot
- If that instruction uses the result of the load, then the hardware will stall for one cycle
- Equivalent to inserting an explicit nop in the slot
  - except the latter uses more code space
- Performance loss
Idea:
- Put unrelated instruction into load delay slot
- No performance loss!

Control Hazards

We won't know whether a branch is taken or not for two cycles after we fetch it.
We should not be executing the two following instructions. So we should have two stall cycles after every single branch in the pipeline. (not quite)

Control Hazards

If branch not taken, then instructions fetched sequentially after branch are correct
If branch or jump taken, then need to flush incorrect instructions from pipeline by converting to NOPs

Reducing Branch Penalties
- Every taken branch in simple pipeline costs 2 dead cycles
- To improve performance, use “branch prediction” to guess which way branch will go earlier in pipeline
- Only flush pipeline if branch prediction was incorrect

Superscalar Processors

Increasing Processor Performance
1. Clock rate
  - Limited by technology and power dissipation
2. Pipelining
  - “Overlap” instruction execution
  - Deeper pipeline: 5 => 10 => 15 stages
    - Less work per stage  shorter clock cycle
    - But more potential for hazards (CPI > 1)
3. Multi-issue “superscalar” processor
Superscalar Processor
- Multiple issue “superscalar”
  - Replicate pipeline stages ⇒ multiple pipelines
  - Start multiple instructions per clock cycle
  - CPI < 1, so use Instructions Per Cycle (IPC)
  - E.g., 4GHz 4-way multiple-issue
    - 16 BIPS, peak CPI = 0.25, peak IPC = 4
  - Dependencies reduce this in practice
- “Out-of-Order” execution
  - Reorder instructions dynamically in hardware to reduce impact of hazards
- (CS152 discusses these techniques!)
Pipelining and ISA Design
- RISC-V ISA designed for pipelining
  - All instructions are 32-bits
    - Easy to fetch and decode in one cycle
    - Versus x86: 1- to 15-byte instructions
  - Few and regular instruction formats
    - Decode and read registers in one step
  - Load/store addressing
    - Calculate address in 3rd stage, access memory in 4th stage
  - Alignment of memory operands
    - Memory access takes only one cycle

Conclusion

Pipelining Conclusion

Lecture 24 (Caches 1)

Binary Prefix

tricks to memorize

Library Analogy

Why are Large Memories Slow? Library Analogy
- Time to find a book in a large library
  - Search a large card catalog – (mapping title/author to index number)
  - Round-trip time to walk to the stacks and retrieve the desired book
- Larger libraries worsen both delays
- Electronic memories have same issue, plus the technologies used to store a bit slow down as density increases (e.g., SRAM vs. DRAM vs. Disk)

Memory Hierarchy

(What To Do: Library Analogy)

Write a report using library books
- E.g., works of J.D. Salinger
Go to library, look up books, fetch from stacks, and place on desk in library
If need more, check out, keep on desk
- But don’t return earlier books since might need them
You hope this collection of ~10 books on desk enough to write report, despite 10 being only 0.00001% of books in UC Berkeley libraries

Memory Caching

Mismatch between processor and memory speeds leads us to add a new level…
- Introducing a “memory cache”
Implemented with same IC processing technology as the CPU (usually integrated on same chip)
- faster but more expensive than DRAM memory.
Cache is a copy of a subset of main memory
Most processors have separate caches for instructions and data.

Cache is right under the Registers
Faster, Expensive, Small capacity

Memory Hierarchy

If level closer to Processor, it is:
- Smaller
- Faster
- More expensive
- subset of lower levels (contains most recently used data)
Lowest Level (usually disk=HDD/SSD) contains all available data (does it go beyond the disk?)
Memory Hierarchy presents the processor with the illusion of a very large & fast memory

Locality, Design, Management

Memory Hierarchy Basis

Cache contains copies of data in memory that are being used.
Memory contains copies of data on disk that are being used.
Caches work on the principles of temporal and spatial locality.
- Temporal locality (locality in time): If we use it now, chances are we’ll want to use it again soon.
- Spatial locality (locality in space): If we use a piece of memory, chances are we’ll use the neighboring pieces soon.

What to Do About Locality

Temporal Locality
- If a memory location is referenced then it will tend to be referenced again soon
  $\implies$ Keep most recently accessed data items closer to the processor
Spatial Locality
- If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon
  $\implies$ Move blocks consisting of contiguous words closer to the processor

Cache Design

How organize cache
Where each memory address map to

(Remember that cache is subset of memory, so multiple memory addresses map to the same cache location.)
Know which elements are in cache
quickly locate them

Conclusion

Caches provide an illusion to the processor that the memory is infinitely large and infinitely fast.

Lecture 25 (Caches 2)

Direct Mapped Caches

In a direct-mapped cache, each memory address is associated with one possible block within the cache (block的位置总是确定的，无需搜索)
- Therefore, we only need to look in a single location in the cache for the data if it exists in the cache
- Block is the unit of transfer between cache and memory
  
  every 4 elements knows where should it go. 编码二进制的最后两位。
存在一个tag，包含了对应内存的地址。不需要二进制的最后三位 (已经由在cache的位置确定了) $\implies$ Cache Number。

Issues with Direct-Mapped

通过内存地址的二进制，可以确定它对应缓存的位置以及Tag：

Direct-Mapped Cache Terminology

All fields are read as unsigned integers.
Index
- specifies the cache index (which “row”/block of the cache we should look in)
Offset
- once we’ve found correct block, specifies which byte within the block we want
Tag
- the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location

记忆小技巧：

Direct Mapped Example

Suppose we have a 8B of data in a direct-mapped cache with 2-byte blocks
Determine the size of the tag, index and offset fields if using a 32-bit arch (RV32)
Offset
- need to specify correct byte within a block
- block contains 2 bytes
  = $2^1$ bytes
- need 1 bit to specify correct byte
Index: (~index into an “array of blocks”)
- need to specify correct block in cache
- cache contains 8 B = 23 bytes
- block contains 2 B = 21 bytes
- # blocks/cache
  $= \frac{ \text{bytes/cache} }{ \text{bytes/block} } \\ = \frac{ 2^3 \text{bytes/cache} }{ 2^1 \text{bytes/block} } \\ = 2^2 \text{blocks/caches}$
- need 2 bits to specify this many blocks
Tag: use remaining bits as tag
- tag length = addr length – offset - index
  = 32 - 1 - 2 bits
  = 29 bits
- so tag is leftmost 29 bits of memory address
- Tag can be thought of as “cache number”
Why not full 32-bit address as tag?
- All bytes within block need same address
- Index must be same for every address within a block, so it’s redundant in tag check, thus can leave off to save memory

Memory Access without Cache

Load word instruction: lw t0, 0(t1)
t1 contains $1022_{ \text{ten} }$ $102 2_{ten}$ , Memory[1022] = 99
1. Processor issues address $1022_{ \text{ten} }$ to Memory
2. Memory reads word at address $1022_{ \text{ten} } (99)$
3. Memory sends 99 to Processor
4. Processor loads 99 into register t0

Memory Access with Cache

Load word instruction: lw t0, 0(t1)
t1 contains $1022_{ \text{ten} }$ , Memory[1022] = 99
With cache (similar to a hash)
1. Processor issues address $1022_{ \text{ten} }$ to Cache
2. Cache checks to see if has copy of data at address $1022_{ \text{ten} }$
  2a. If finds a match (Hit): cache reads 99, sends to processor
  2b. No match (Miss): cache sends address 1022 to Memory
  1. Memory reads 99 at address $1022_{ \text{ten} }$
  2. Memory sends 99 to Cache
  3. Cache replaces word with new 99
  4. Cache sends 99 to processor
3. Processor loads 99 into register t0

从右上角开始

Cache Terminology

When reading memory, 3 things can happen:
- cache hit:
  - cache block is valid and contains proper address, so read desired word
- cache miss:
  - nothing in cache in appropriate block, so fetch from memory
- cache miss, block replacement:
  - wrong data is in cache at appropriate block, so discard it and fetch desired data from memory (cache always copy)

Cache Temperatures

Cold
- Cache empty
Warming
- Cache filling with values you'll hopefully be accessing again soon
Warm
- Cache is doing its job, fair % of hits
Hot
- Cache is doing very well, high % of hits

Cache Terms

Hit rate: fraction of access that hit in the cache
Miss rate: 1 – Hit rate
Miss penalty: time to replace a block from lower level in memory hierarchy to cache
Hit time: time to access cache memory (including tag comparison)
Abbreviation: "$" = cache

Valid Bit (One More Detail)

When start a new program, cache does not have valid information for this program
Need an indicator whether this tag entry is valid for this program
Add a “valid bit” to the cache tag entry
0 $\rightarrow$ cache miss, even if by chance, address = tag
1 $\rightarrow$ cache hit, if processor address = tag

Conclusion

We have learned the operation of a direct-mapped cache
Mechanism for transparent movement of data among levels of a memory hierarchy
- set of address/value bindings
- address $\rightarrow$ index to set of candidates
- compare desired address with tag
- service hit or miss
  - load new block and binding on miss

Lecture 26 (Caches 3)

Direct Mapped Example

图片太多就不放了。。。

Index -> Valid -> (Load data if Valid is 0) -> Tag -> (Load data if Tag don't match) -> Offset -> return

如果dirty bit被设置，在替换block时需要把内容重写入memory。

Writes, Block, Sizes, Misses

4个bytes都被装进mux，由offset决定读取哪个。
几个check的步骤同时发生。

What to do on a write hit

Write-through (保存cache和memory同步)
- Update both cache and memory
Write-back (只对cache更新)
- update word in cache block
- allow memory word to be “stale”
- add ‘dirty’ bit to block
  - memory & Cache inconsistent
  - needs to be updated when block is replaced
…OS flushes cache before I/O…

Performance trade-offs?

Block Size Tradeoff

Benefits of Larger Block Size
- Spatial Locality: if we access a given word, we’re likely to access other nearby words soon
- Very applicable with Stored-Program Concept
- Works well for sequential array accesses
Drawbacks of Larger Block Size
- Larger block size means larger miss penalty
  - on a miss, takes longer time to load a new block from next level
If block size is too big relative to cache size, then there are too few blocks
- Result: miss rate goes up

Extreme Example: One Big Block

Block Size Tradeoff Conclusions

Types of Cache Misses

“Three Cs” Model of Misses
1st C: Compulsory Misses
- occur when a program is first started
- cache does not contain any of that program’s data yet, so misses are bound to occur
- can’t be avoided easily, so won’t focus on these in this course
- Every block of memory will have one compulsory
  miss (NOT only every block of the cache)
2nd C: Conflict Misses
- miss that occurs because two distinct memory addresses map to the same cache location
- two blocks (which happen to map to the same location) can keep overwriting each other
- big problem in direct-mapped caches
- how do we lessen the effect of these?
Dealing with Conflict Misses
- Solution 1: Make the cache size bigger
  - Fails at some point
- Solution 2: Multiple distinct blocks can fit in the same cache Index?

Fully Associative Caches

Memory address fields:
- Tag: same as before
- Offset: same as before
- Index: non-existant （需要与所有的tag比较）
What does this mean?
- no “rows”: any block can go anywhere in the cache
- must compare with all tags in entire cache to see if data is there

Benefit of Fully Assoc Cache
- No Conflict Misses (since data can go anywhere)
Drawbacks of Fully Assoc Cache
- Need hardware comparator for every single entry: if we have a 64KB of data in cache with 4B entries, we need 16K comparators: infeasible

Final Type of Cache Miss

3rd C: Capacity Misses
- miss that occurs because the cache has a limited size
- miss that would not occur if we increase the size of the cache
- sketchy definition, so just get the general idea
This is the primary type of miss for Fully Associative caches.

How to categorize misses

Run an address trace against a set of caches:
- First, consider an infinite-size, fully-associative cache. For every miss that occurs now, consider it a compulsory miss.
- Next, consider a finite-sized cache (of the size you want to examine) with full-associativity. Every miss that is not in #1 is a capacity miss.
- Finally, consider a finite-size cache with finite-associativity. All of the remaining misses that are not #1 or #2 are conflict misses.

Conclusion

Lecture 27 (Cache 4)

Set-Associative Caches

Associative Cache Example

N-Way Set Associative Cache

Memory address fields:
- Tag
: same as before
Offset: same as before
Index: points us to the correct “row” (called a set in this case)

So what’s the difference?

each set contains multiple blocks
once we’ve found correct set, must compare with all tags in that set to find our data
Size of $ is # sets x N blocks/set x block size

Basic Idea

cache is direct-mapped w/respect to sets
each set is fully associative with N blocks in it

Given memory address:

Find correct set using Index value.
Compare Tag with all Tag values in that set.
If a match occurs, hit!, otherwise a miss.
Finally, use the offset field as usual to find the desired data within the block.

What’s so great about this?

even a 2-way set assoc cache avoids a lot of conflict misses
hardware cost isn’t that bad: only need N comparators

In fact, for a cache with M blocks,

it’s Direct-Mapped if it’s 1-way set assoc
it’s Fully Assoc if it’s M-way set assoc
so these two are just special cases of the more general set associative design

4-Way Set Associative Cache Circuit

一行是一个set，mux决定采用哪个data

Block Replacement with Example

Block Replacement Policy

Direct-Mapped Cache
- index completely specifies position which position a block can go in on a miss
N-Way Set Assoc
- index specifies a set, but block can occupy any position within the set on a miss
Fully Associative
- block can be written into any position
Question: if we have the choice, where should we write an incoming block?
- If there’s a valid bit off, write new block into first invalid.
- If all are valid, pick a replacement policy
  - rule for which block gets “cached out” on a miss.
LRU (Least Recently Used)
- Idea: cache out block which has been accessed (read or write) least recently
- Pro: temporal locality $\implies$ recent past use implies likely future use: in fact, this is a very effective policy
- Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this
FIFO
- Idea: ignores accesses, just tracks initial order
Random
- If low temporal locality of workload, works ok

Block Replacement Example
Block Replacement Example: LRU

Cache Simulator

Average Memory Access Time (AMAT)

Big Idea

How to choose between associativity, block size, replacement & write policy?
Design against a performance model
- Minimize: Average Memory Access Time
  = Hit Time + Miss Penalty x Miss Rate
  
  即使miss了，在从memory取回element后也要再取出，还是有Hit time，所以这里的计算不用乘 Hit rate。
- influenced by technology & program behavior
Create the illusion of a memory that is large, cheap, and fast - on average
How can we improve miss penalty?

Improving Miss Penalty

Analyzing Multi-level cache hierarchy

Example
- Assume
  - Hit Time = 1 cycle
  - Miss rate = 5%
  - Miss penalty = 20 cycles
  - Calculate AMAT
- Avg mem access time
  $= 1 +0.05 \times 20$
  $= 1+1$ cycles
  $=2$ cycles

Ways to reduce miss rate

Larger cache
- limited by cost and technology
- hit time of first level cache < cycle time (bigger caches are slower)
More places in the cache to put each block of memory – associativity
- fully-associative
  - any block any line
- N-way set associated
  - N places for each block
  - direct map: N=1

Typical Scale

L1:
- size: tens of KB
- hit time: complete in one clock cycle
- miss rates: 1-5%
L2:
- size: hundreds of KB
- hit time: few clock cycles
- miss rates: 10-20%
L2 miss rate is fraction of L1 misses that also miss in L2
- why so high? (L1 holds all good stuffs)

Example

Actual CPUs

Early PowerPC
Pentium M
Intel core i7

Conclusion

Lecture 28 (OS & Virtual Memory Intro)

Machine Structures

CS61C Project

Operating System Basics

What Does the OS do?

OS is the (first) thing that runs when computer starts
Finds and controls all devices in the machine in a general way
- Relying on hardware specific “device drivers”
Starts services (100+)
- File system,
- Network stack (Ethernet, WiFi, Bluetooth, …),
- TTY (keyboard),
- ...
Loads, runs and manages programs:
- Multiple programs at the same time (time-sharing)
- Isolate programs from each other (isolation)
- Multiplex resources between applications (e.g., devices)

What Does the Core of the OS Do?

Provide isolation between running processes
- Each program runs in its own little world •
Provide interaction with the outside world
- Interact with "devices": Disk, display, network, etc... 11

What Does OS Need from Hardware?

Memory translation (不同的进程可以使用相同的虚拟地址，映射到不同的物理地址)
- Each running process has a mapping from "virtual" to "physical" addresses that are different for each process
- When you do a load or a store, the program issues a virtual address... But the actual memory accessed is a physical address
Protection and privilege
- Split the processor into at least two running modes: "User" and "Supervisor"
  - RISC-V also has "Machine" below "Supervisor"
- Lesser privilege can not change its memory mapping
  - But "Supervisor" can change the mapping for any given program
  - And supervisor has its own set of mapping of virtual->physical
Traps & Interrupts
- A way of going into Supervisor mode on demand

What Happens at Boot?

Operating System Functions

Launching Applications

Applications are called “processes” in most OSs
- Thread: shared memory
- Process: separate memory
- Both threads and processes run (pseudo) simultaneously
Apps are started by another process (e.g., shell) calling an OS routine (using a “syscall”)
- Depends on OS; Linux uses fork to create a new process, and execve (execute file command) to load application
Loads executable file from disk (using the file system service) and puts instructions & data into memory (.text, .data sections), prepares stack and heap
Set argc and argv, jump to start of main
Shell waits for main to return (join)

Supervisor Mode

If something goes wrong in an application, it could crash the entire machine. And what about malware, etc.?
The OS enforces resource constraints to applications (e.g., access to memory, devices)
To help protect the OS from the application, CPUs have a supervisor mode (e.g., set by a status bit in a special register)
- A process can only access a subset of instructions and (physical) memory when not in supervisor mode (user mode)
- Process can change out of supervisor mode using a special instruction, but not into it directly – only using an interrupt
- Supervisory mode is a bit like “superuser”
  - But used much more sparingly (most of OS code does not run in supervisory mode)
  - Errors in supervisory mode often catastrophic (blue “screen of death”, or “I just corrupted your disk”)

Syscalls

What if we want to call an OS routine? E.g.,
- to read a file,
- launch a new process,
- ask for more memory (malloc),
- send data, etc.
Need to perform a syscall:
- Set up function arguments in registers,
- Raise software interrupt (with special assembly instruction)
OS will (接管并) perform the operation and return to user mode
This way, the OS can mediate access to all resources, and devices

Interrupts, Exceptions

We need to transition into Supervisor mode when "something" happens
Interrupt: Something external to the running program
- Something happens from the outside world
Exception: Something (internal) done by the running program
- Accessing memory it isn't "supposed" to, executing an illegal instruction, reading a csr not supposed at that privilege
ECALL: Trigger an exception to the higher privilege
- How you communicate with the operating system: Used to implement "syscalls"
EBREAK: Trigger an exception within the current privilege

Terminology (In 61C)

Interrupt – caused by an event external to current running program
- E.g., key press, disk I/O
- Asynchronous to current program
  - Can handle interrupt on any convenient instruction
  - “Whenever it’s convenient, just don’t wait too long”
Exception – caused by some event during execution of one instruction of current running program
- E.g., memory error, bus error, illegal instruction, raised exception
- Synchronous
  - Must handle exception precisely on instruction that causes exception
  - “Drop whatever you are doing and act now”
Trap – action of servicing interrupt or exception by hardware jump to “interrupt or trap handler” code

Trap Handling

Precise Traps

Trap handler’s view of machine state is that every instruction prior to the trapped one (e.g., memory error) has completed, and no instruction after the trap has executed.
Implies that handler can return from an interrupt by restoring user registers and jumping back to interrupted instruction
- Interrupt handler software doesn’t need to understand the pipeline of the machine, or what program was doing!
- More complex to handle trap caused by an exception than interrupt
Providing precise traps is tricky in a pipelined uperscalar out-of-order processor!
- But a requirement for things to actually work right!

Exceptions in a 5-Stage Pipeline

Trap Handling

Exceptions are handled like pipeline hazards
1. Complete execution of instructions before exception occurred
2. Flush instructions currently in pipeline (i.e., convert to nops or “bubbles”)
3. Optionally store exception cause in status register
  - Indicate type of exception
4. Transfer execution to trap handler
5. Optionally, return to original program and re-execute instruction

Multiprogramming

The OS runs multiple applications at the same time
But not really (unless you have a core per process)
Switches between processes very quickly (on human time scale) – this is called a “context switch”
When jumping into process, set timer (we will call this ‘interrupt’)
- When it expires, store PC, registers, etc. (process state)
- Pick a different process to run and load its state
- Set timer, change to user mode, jump to the new PC
Deciding what process to run is called scheduling

Protection, Translation, Paging

Supervisor mode alone is not sufficient to fully isolate applications from each other or from the OS
- Application could overwrite another application’s memory.
- Typically programs start at some fixed address, e.g. 0x8FFFFFFF
  - How can 100’s of programs share memory at location 0x8FFFFFFF?
- Also, may want to address more memory than we actually have (e.g., for sparse data structures)
Solution: Virtual Memory
- Gives each process the illusion of a full memory address space that it has completely for itself

Lecture 29 (Virtual Memory 1)

Virtual Memory Concepts

Virtual Memory

Virtual memory - Next level in the memory hierarchy:
- Provides program with illusion of a very large main memory: Working set of “pages” reside in main memory - others are on disk
- Demand paging: Provides the ability to run programs larger than the primary memory (DRAM) (我们可以使用disk作为扩展的DRAM)
- Hides differences between machine configurations
Also allows OS to share memory, protect programs from each other (every process is its own world and the OS is there to orchestrate that)
Today, more important for protection than just another level of memory hierarchy
Each process thinks it has all the memory to itself
(Historically, it predates caches)

Address Spaces

Address space = set of addresses for all available memory locations
Now, two kinds of memory addresses:
- Virtual Address Space
  - Set of addresses that the user program knows about
- Physical Address Space
  - Set of addresses that map to actual physical locations in memory
  - Hidden from user applications
Memory manager maps (‘translates’) between these two address spaces

Analogy

Book title like virtual address
Library of Congress call number like physical address
Card catalogue like page table, mapping from book title to call #
On card for book, in local library vs. in another branch like valid bit indicating in main memory vs. on disk (storage)
On card, available for 2-hour in library use (vs. 2-week checkout) like access rights

Memory Hierarchy Requirements

Allow multiple processes to simultaneously occupy memory and provide protection – don’t let one program read/write memory from another
Address space – give each program the illusion that it has its own private memory
- Suppose code starts at address 0x40000000. But different processes have different code, both residing at the same address. So each program has a different view of memory.

Physical Memory and Storage

Memory

如果我们写入 DRAM 上的附近位置，则每次连续的读取或写入都会快得多。

DRAM 通常支持所谓的 burst 模式，这意味着我们通常可以进行 8 次或 16 次读取或写入的bursts (会带来 8 或 16 倍的 64 bits 并填充一个缓存行)。

DRAM 比 SRAM 稍微不稳定一些，只要电源通电，SRAM 就会保留数据，但 DRAM 会忘记，数据以电荷的形式动态存储，并且电荷可能会从 DRAM 单元泄漏。（需要控制器定期刷新以保持数据，这就是它被称为动态随机存取存储器的原因）。

Disk

how hard drive works

What About SSD?

Made with transistors
Nothing mechanical that turns
Like “Ginormous” register file
- Does not ”forget” when power is off (non-volatile)
Fast access to all locations, regardless of address
Still much slower than register, DRAM
- Read/write blocks, not bytes
- Read/write blocks, not bytes
Some unusual requirements:
- Can’t erase single bits – only entire blocks

flash有点类似于DRAM，it is built out of these non-volatile cells, so you somehow trap the charge inside the cell，断电后数据仍然保持。
在组织上很像DRAM，但是排列更加密集。对blocks做操作而不针对words或者bytes（如果要修改words或bytes需要修改整个block）

graph LB
    A[Flash 闪存] --> B[电荷陷阱存储]
    C[DRAM 动态随机存取存储器] --> D[电容充放电存储]

Flash Memory

Memory Manger

Virtual Memory

裸机（Bare Metal）系统的内存访问特性

关键问题：在无操作系统的裸机环境中，程序直接使用物理地址访问内存
风险示例：

恶意程序可随意读写内核数据结构（如中断向量表）

驱动程序错误可能覆盖其他硬件映射区域（如 GPU 显存）

缺乏隔离性：所有进程共享同一物理地址空间

无权限控制：无法区分用户态/内核态内存区域

因此需要一个地址转换机制

虚拟化：进程使用虚拟地址（Virtual Address），硬件自动转换为物理地址（Physical Address）

权限校验：检查页表项（Page Table Entry）中的权限位

隔离性：不同进程的虚拟地址空间相互独立

在单内存系统中频繁执行进程间上下文切换时，仅保存进程内存映像是不够的——因为所有进程共享同一物理内存空间，可能引发数据冲突。因此必须引入地址转换机制（如页表），通过虚拟化隔离各进程的内存访问。

Virtual vs. Physical Addresses

内存管理器是将虚拟地址映射到物理地址的东西。
每个进程虽然都认为它是从零地址运行到地址顶部的，但实际上它只使用了内存的一部分。内存的这些部分不一定是连续的（实际上，它们更像是交错的）。

Conceptual Memory Manager

Responsibilities of Memory Manager

Map virtual to physical addresses
Protection:
- Isolate memory between processes
- Each process gets dedicate ”private” memory
- Errors in one program won’t corrupt memory of other program
- Prevent user programs from messing with OS’s memory
Swap memory to disk
- Give illusion of larger memory by storing some content on disk
- Disk is usually much larger and slower than DRAM
  - Use “clever” caching strategies

Paged Memory

How does swap be performed?
我们希望在memory和disk的一些固定数据块移动来保持simplicity，这块数据被称为page。

Memory Manager and Paged Memory
Conceptual Memory Manager

每个进程都被分配了一个page table（由操作系统管理），指向物理地址。

Paged Memory Address Translation
Protection

如果一个page被多个进程共享，OS会进行管理（需要一个share flag bit）。

Write Protection

添加一个bit表明这个page正受到保护。
无法写入会引发exception。

Where Do Page Tables Reside?

E.g., 32-Bit virtual address, 4-KiB pages
- Single page table size:
  - $4 \times 2^{20}$ Bytes = 4-MiB
  - 0.1% of 4-GiB memory
  - But much too large for a cache!
Store page tables in memory (DRAM)
- Two (slow) memory accesses per lw/sw on cache miss （1.得到page table，2.根据page table实施lw/sw）
- How could we minimize the performance penalty?
  - Transfer blocks (not words) between DRAM and processor cache
    - Exploit spatial locality
  - Use a cache for frequently used page table entries …

Page Table Stored in Memory

Page Faults

Blocks vs. Pages

In caches, we dealt with individual blocks
- Usually ~64B on modern systems
In VM, we deal with individual pages
- Usually ~4 KiB on modern systems
Common point of confusion:
- Bytes,
- Words,
- Blocks,
- Pages
  - Are all just different ways of looking at memory!

Bytes, Words, Blocks, Pages

Review: Analogy

Book title like virtual address
Library of Congress call number like physical address
Card catalogue like page table, mapping from book title to call #
On card for book, in local library vs. in another branch like valid bit indicating in main memory vs. on disk (storage)
On card, available for 2-hour in library use (vs. 2-week checkout) like access rights

Memory Access

Check page table entry:
- Valid?
  - Yes, valid $\rightarrow$ In DRAM?
    - Yes, in DRAM: read/write data
    - No, on disk: allocate new page in DRAM (Page fault OS intervention)
      
      If out of memory, evict a page from DRAM
      
      Store evicted page to disk
      
      Read page from disk into memory
      
      Read/write data
  - Not Valid (Page fault OS intervention)
    - allocate new page in DRAM
      
      If out of memory, evict a page
      
      Read/write data

Page Fault

Page faults are treated as exceptions
- Page fault handler (yet another function of the interrupt/trap handler) does the page table updates and initiates transfers
- Updates status bits
(If page needs to be swapped from disk, perform context switch)
Following the page fault, the instruction is re-executed

graph TD
    A[触发 Page Fault 异常] --> B{进入 Page Fault Handler}
    B --> C[检查访问合法性]
    C -->|非法访问| D[终止进程 SIGSEGV]
    C -->|合法访问| E{页面是否在内存中?}
    E -->|已加载| F[更新页表项状态位]
    E -->|未加载| G[分配物理页帧]
    G --> H{需要从磁盘交换?}
    H -->|是| I[发起磁盘I/O请求]
    I --> J[阻塞当前进程]
    J --> K[执行上下文切换]
    K --> L[调度其他进程运行]
    H -->|否| M[初始化零页]
    M --> F
    F --> N[设置Present/ Dirty/ Accessed位]
    N --> O[返回用户态]
    O --> P[重新执行触发异常的指令]
    
    classDef process fill:#f9f,stroke:#333;
    classDef exception fill:#f96,stroke:#333;
    classDef handler fill:#6f9,stroke:#333;
    classDef disk fill:#69f,stroke:#333;
    classDef memory fill:#9f6,stroke:#333;
    
    class A exception;
    class B,C,E,F,N handler;
    class I,J,K,L disk;
    class G,M memory;

Write-Through or `Write-Back`?

DRAM acts like “cache” for disk
- Should writes go directly to disk (write-through)?
- Or only when page is evicted?
Which option do you propose?

Lecture 30 (Virtual Memory 2)

Hierarchical Page Tables

Size of Page Tables
Options for Page Tables
Hierarchical Page Table

虚拟地址结构

虚拟地址被分为三个部分：

p1 (10-bit L1 index): 用于在第一级页表中索引。

p2 (10-bit L2 index): 用于在第二级页表中索引。

offset (12-bit): 用于在数据页中定位具体的数据。

页表结构

根页表 (Root of the Current Page Table): 存储在处理器寄存器中，是当前页表的入口。

第一级页表 (Level 1 Page Table): 页大小为10b，即1024个条目，每个条目指向一个4096B的数据页。

第二级页表 (Level 2 Page Tables): 页大小为12b，即4096个条目，每个条目指向一个4096B的数据页。

物理内存：包含映射到物理内存的数据页。

关键组件：

Supervisor Page Table Base Register (SPTBR)：存储当前页表的根地址。

主内存（DRAM）：存储在主内存中的页面。

二级存储（Disk）：存储在二级存储（硬盘）中的页面。

不存在的页面的PTE：指示页面没有被映射到内存中。

由于分页系统设置得非常严格，我们实际上为偏移量和页表的每个级别分配了精确的位数，因此只要知道从哪里开始就可以按照固定规则构建整个树。
在上下文切换时，当我们想要切换到一个新进程时，我们所需要做的就是恢复前一个进程的状态，只需指向SPTBR (一个特权寄存器,它存储了当前进程页表的基地址)。

Translation Lookaside Buffers (speed up memory access)

Address Translation and Protection
Translation Lookaside Buffers (TLB)

cache some of the memory translations in the TLB.

TLB Designs

Typically 32-128 entries, usually fully associative
- Each entry maps a large page, hence less spatial locality across pages $\rightarrow$ more likely that two entries conflict
- Sometimes larger TLBs (256-512 entries) are 4-8 way set-associative
- Larger systems sometimes have multi-level (L1 and L2) TLBs
Random or FIFO replacement policy
“TLB Reach”: Size of largest virtual address space that can be simultaneously mapped by TLB

Where Are TLBs Located?
Address Translation Using TLB

TLBs in Datapath

VM-related Events in Pipeline
Page-Based Virtual-Memory Machine
Address Translation
Modern Virtual Memory Systems

Review: Context Switching

How does a single processor run many programs at once?
Context switch: Changing of internal state of processor (switching between processes)
- Save register values (and PC) and change value in Supervisor Page Table Base register (SPTBR)
What happens to the TLB?
- Current entries are for different process
- Set all entries to invalid on context switch

VM Performance

Comparing the Cache and VM

VM Performance

Virtual Memory is the level of the memory hierarchy that sits below main memory
- TLB comes before cache, but affects transfer of data from disk to main memory
- Previously we assumed main memory was lowest level, now we just have to account for disk accesses
Same CPI, AMAT equations apply, but now treat main memory like a mid-level cache

Typical Performance Stats

Impact of Paging on AMAT

Memory Parameters:
- L1 cache hit = 1 clock cycles, hit 95% of accesses
- L2 cache hit = 10 clock cycles, hit 60% of L1 misses
- DRAM = 200 clock cycles (≈100 nanoseconds)
- Disk = 20,000,000 clock cycles (≈10 milliseconds)
Average Memory Access Time (no paging):
- 1 + 5%×10 + 5%×40%×200 = 5.5 clock cycles
Average Memory Access Time (with paging):
- 5.5 (AMAT with no paging) + ?
Average Memory Access Time (with paging) =
- 5.5 + 5%×40%× (1- $\text{HR} _{ \text{Mem} }$ )×20,000,000
AMAT if $\text{HR} _{ \text{Mem} }$ $HR_{Mem}$ = 99%?
- 5.5 + 0.02×0.01×20,000,000 = 4005.5 (≈728x slower)
- 1 in 20,000 memory accesses goes to disk: 10 sec program takes 2 hours!
AMAT if $\text{HR} _{ \text{Mem} }$ $HR_{Mem}$ = 99.9%?
- 5.5 + 0.02×0.001×20,000,000 = 405.5
AMAT if $\text{HR} _{ \text{Mem} }$ $HR_{Mem}$ = 99.9999%?
- 5.5 + 0.02×0.000001×20,000,000 = 5.9

记得太多了，这个笔记先到这里，接下来会新开一个笔记做