CS61C NOTES 31-40 (Great Ideas in Computer Architecture)

这是一个CS61C的笔记，由于是边上课边记的，很多地方选择偷懒进行截图，，，

Lecture 31 (I/O Devices)

I/O Devices

Instruction Set Architecture for I/O
用一种general way连接IO devices （通过某种层次化的总线进行连接，命令和状态寄存器组成的标准化接口）

Instruction Set Architecture for I/O
- What must the processor do for I/O?
  - Input: Read a sequence of bytes
  - Output: Write a sequence of bytes
- Interface options
  a) Special input/output instructions & hardware
  b) Memory mapped I/O ( $\sqrt{}$ )
  - Portion of address space dedicated to I/O (命令和状态寄存器)
  - I/O device registers there (no memory)
  - Use normal load/store instructions, e.g. lw/sw
  - Very common, used by RISC-V

Memory Mapped I/O
每个IO设备都有其寄存器副本在这块内存中
Processor-I/O Speed Mismatch
那些超出核心读写带宽的设计是为了支持更高端的处理器

I/O Polling

Polling: Processor Checks Status, Then Acts

出于Speed Mismatch

Device registers generally serve two functions:
- Control Register, says it’s OK to read/write (I/O ready)
  [think of a flagman on a road]
- Data Register, contains data
Processor reads from Control Register in loop
- Waiting for device to set Ready bit in Control reg (0 → 1)
- Indicates “data available” or “ready to accept data”
Processor then loads from (input) or writes to (output) data register
- I/O device resets control register bit (1 → 0)
Procedure called “Polling” （处理器会反复且定期地检查控制寄存器）

I/O Example (Polling)

Cost of Polling? (开销)

Assume for a processor with
- 1 GHz clock rate
- Taking 400 clock cycles for a polling operation
  - Call polling routine
  - Check device (e.g., keyboard or WiFi input available)
  - Return
- What’s the percentage of processor time spent polling?
Example:
- Mouse
- Poll 30 times per second
  - Set by requirement not to miss any mouse motion (which would lead to choppy motion of the cursor on the screen)

% Processor Time to Poll Mouse
% Processor Time to Poll Hard Disk

I/O Interrupts

Alternatives to Polling: Interrupts

Polling wastes processor resources
- Akin to waiting at the door for guests to show up
- What about a bell?
Computer lingo for bell:
- Interrupt
- Occurs when I/O is ready or needs attention
  - Interrupt current program
  - Transfer control to the trap handler in the operating system
Interrupts:
- No I/O activity: Nothing to do
- Lots of I/O: Expensive – thrashing caches, VM, saving/restoring state

Polling, Interrupts and DMA

Low data rate (e.g. mouse, keyboard)
- Use interrupts. Could poll with the timer interrupt, but why?
- Overhead of interrupts ends up being low
High data rate (e.g. network, disk)
- Start with interrupts...
  - If there is no data, you don't do anything!
- Once data starts coming... Switch to Direct Memory Access (DMA)

Aside: Programmed I/O

“Programmed I/O”:
- Standard for ATA hard-disk drives
- CPU execs lw/sw instructions for all data movement to/from devicess
- CPU spends time doing two things:
  1. Getting data from device to main memory
  2. Using data to compute
Not ideal because …
1. CPU has to execute all transfers, could be doing other work
2. Device speeds don’t align well with CPU speed
3. Energy cost of using beefy general-purpose CPU where simpler hardware would suffice
Until now CPU has sole control of main memory
5% of CPU cycles on Google Servers spent in memcpy() and memmove() library routines!*
*Kanev et al., “Profiling a warehouse-scale computer,” ICSA 2015, (June 2015), Portland, OR.

DMA

Direct Memory Access (DMA)

寄存器包含放置数据的内存地址、要传输的字节数、应该执行到内存或从内存传输方向的 I/O 设备的数量、应该朝那个方向转移、传输单位以及每次 burst 应执行多少操作。

Allows I/O devices to directly read/write main memory
New hardware: The DMA Engine
DMA engine contains registers written by CPU:
- Memory address to place data
- # of bytes
- I/O device #, direction of transfer
- unit of transfer, amount to transfer per burst

DMA: Incoming Data

Receive interrupt from device
CPU takes interrupt, initiates transfer
- Instructs DMA engine/device to place data @ certain address
Device/DMA engine handle the transfer
- CPU is free to execute other things
Upon completion, Device/DMA engine interrupt the CPU again

DMA: Outgoing Data

CPU decides to initiate transfer, confirms that external device is ready
CPU begins transfer
- Instructs DMA engine/device that data is available @ certain address
Device/DMA engine handle the transfer
- CPU is free to execute other things
Device/DMA engine interrupt the CPU again to signal completion

DMA: Some New Problems

Where in the memory hierarchy do we plug in the DMA engine? Two extremes:
- Between L1$ and CPU:
  - Pro: Free coherency
  - Con: Trash the CPU’s working set with transferred data
- Between Last-level cache and main memory:
  - Pro: Don’t mess with caches
  - Con: Need to explicitly manage coherency

challenge：需要实现一个新的一致性引擎

Networking

Networks: Talking to the Outside World

Originally sharing I/O devices between computers
- E.g., printers
Then communicating between computers
- E.g., file transfer protocol
Then communicating between people
- E.g., e-mail
Then communicating between networks of computers
- E.g., file sharing, www, …

The Internet (1962)
The World Wide Web (1989)

Software Protocol to Send and Receive
What do we need?

Conclusion

We have figured out how computers work!
- And figured out how the OS works and how to interact with it
We have built a virtual memory system
- And have developed understanding of physical memory, storage devices
And we can attach peripherals for I/O!

Lecture 32 (Flynn Taxonomy, SIMD Instructions)

Parallelism

Application: Machine Learning

Reference Problem: Matrix Multiplication

Matrix multiplication
Basic operation in many engineering, data, and imaging processing tasks
Image filtering, noise reduction, machine learning…
Many closely related operations

Matrices
Matrix Multiplication

Matrix Multiplication

Reference: Python

C

void dgemm_scalar(int N, double *a, double *b, double *c)
{
  for (int i=0; i<N; i++)
  {
    for (int j=0; j<N; j++)
    {
      double cij = 0;
      for (int k=0; k<N; k++)
      {
        // a[i][k] * b[k][j]
        cij += a[i+k*N] * b[k+j*N];

      }
      // c[i][j]
      c[i+j*N] = cij;
    }
  }
}

Timing Program Execution

GFLOPS （Giga Floating-point Operations Per Second）: 衡量计算机性能的重要指标,GFLOPS = 10^9 次浮点运算/秒;

# include <stdio.h>
# include <stdlib.h>
# include <time.h>
int main(void){
  // start time
  // Note: clock() measures exxecution time, not real time
  //       big difference in shared computer environments
  //       and with heavy system load
  clock_t start = clock();

  // task to time gose here:
  // dgemm(N, ,,,);

  // "stop" the timer
  clock_t end = clock();

  // compute execution time in seconds
  double delta_time = (double)(end-start)/CLOCKS_PER_SEC;
}

C远比python快

Flynn's Taxonomy

Software vs. Hardware Parallelism
Flynn’s Taxonomy
Single Instruction/Single Data Stream (SISD)
Single Instruction/Multiple Data Stream (SIMD)
Multiple Instruction/Multiple Data Stream (MIMD)
Multiple Instruction/Single Data Stream (MISD)

SIMD Architectures

SIMD Architectures
First SIMD Extensions: MIT Lincoln Labs TX-2, 1957
“Advanced Digital Media Boost”
Intel x86 SIMD Evolution
C vs. Python
XMM Registers in SSE
Intel Architecture SSE2+128-Bit SIMD Data Types
SIMD Registers in AVX512

SIMD Array Processing

Example: SIMD Array Processing
Add Single-Precision FP Vectors
Intel SSE Intrinsics

Matrix Multiply Example

Example: 2 x 2 Matrix Multiply

#include <stdio.h>
// header file for SSE compiler intrinsics
#include <emmintrin.h>  // SSE2 指令集头文件
// NOTE: vector registers will be represented in 
//comments as v1 = [ a | b]
// where v1 is a variable of type __m128d and a, b are doubles
int main(void) {
// allocate A,B,C aligned on 16-byte boundaries
  double A[4] __attribute__ ((aligned (16)));  // 16 字节对齐
  double B[4] __attribute__ ((aligned (16)));
  double C[4] __attribute__ ((aligned (16)));
  int lda = 2;
  int i = 0;
// declare several 128-bit vector variables
  __m128d c1,c2,a,b1,b2;
  // Initialize A, B, C for example
  /* A = (note column order!) 
  1 0
  0 1
  */
  A[0] = 1.0; A[1] = 0.0; A[2] = 0.0; A[3] = 1.0;
  /* B = (note column order!)
  1 3
  2 4
  */
  B[0] = 1.0; B[1] = 2.0; B[2] = 3.0; B[3] = 4.0;
  /* C = (note column order!)
  0 0
  0 0
  */
  C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
// used aligned loads to set // c1 = [c_11 | c_21]
  c1 = _mm_load_pd(C+0*lda); // c2 = [c_12 | c_22]
  c2 = _mm_load_pd(C+1*lda);
  for (i = 0; i < 2; i++) { /* a = 
    i = 0: [a_11 | a_21]
    i = 1: [a_12 | a_22]
    */
    a = _mm_load_pd(A+i*lda); /* b1 = 
    i = 0: [b_11 | b_11]
    i = 1: [b_21 | b_21]
    */
    b1 = _mm_load1_pd(B+i+0*lda); /* b2 = 
    i = 0: [b_12 | b_12]
    i = 1: [b_22 | b_22]
    */
    b2 = _mm_load1_pd(B+i+1*lda);
    /* c1 = 
    i = 0: [c_11 + a_11*b_11 | c_21 + a_21*b_11]
    i = 1: [c_11 + a_21*b_21 | c_21 + a_22*b_21]
    */
    c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
    /* c2 = 
    i = 0: [c_12 + a_11*b_12 | c_22 + a_21*b_12]
    i = 1: [c_12 + a_21*b_22 | c_22 + a_22*b_22]
    */
    c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); 
  }
  // store c1,c2 back into C for completion
  _mm_store_pd(C+0*lda,c1);
  _mm_store_pd(C+1*lda,c2);
  // print C
  printf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);
  return 0;
}

Back to RISC-V: Vector Extensions (Draft)

Conclusion

Flynn Taxonomy of Parallel Architectures
- SIMD: Single Instruction Multiple Data
- MIMD: Multiple Instruction Multiple Data
- SISD: Single Instruction Single Data
- MISD: Multiple Instruction Single Data (unused)
Intel AVX SIMD Instructions
- One instruction fetch that operates on multiple operands simultaneousl
- 512/256/128/64-bit AVX registers
- Use C intrinsics

Lecture 33 (Thread-level Parallelism 1)

Parallel Computer Architectures

Improving Performance

Increase clock rate $f_s$ $f_{s}$
- Reached practical maximum for today’s technology
- < 5GHz for general purpose computers
Lower CPI (cycles per instruction)
- SIMD, “instruction level parallelism”
Perform multiple tasks simultaneously
- Multiple CPUs, each executing different program
- Tasks may be related
  - E.g. each CPU performs part of a big matrix multiplication
- or unrelated
  - E.g. distribute different web http requests over different computers
  - E.g. run pptx (view lecture slides) and browser (youtube) simultaneously
Do all of the above:
- High $f_s$ , SIMD, multiple parallel tasks

Multiprocessor Execution Model

Each processor (core) executes its own instructions
Separate
resources (not shared) - Datapath (PC, registers, ALU) - Highest level caches (e.g., 1st and 2nd)
Shared
resourcecs - Memory (DRAM) - Often 3rd level cache - Often on same silicon chip - But not a requirement
Nomenclature
- “Multiprocessor Microprocessor”
- Multicore processor
  - E.g., four core CPU (central processing unit)
  - Executes four different instruction streams simultaneously

Multicore

Multiprocessor Execution Model

Shared memory
- Each “core” has access to the entire memory in the processor
- Special hardware keeps caches consistent (next lecture!)
- Advantages:
  - Simplifies communication in program via shared variables
- Drawbacks
  - Does not scale well:
    - “Slow” memory shared by many “customers” (cores)
    - May become bottleneck (Amdahl’s Law)
Two ways to use a multiprocessor:
- Job-level parallelism
  - Processors work on unrelated problems
  - No communication between programs
- Partition work of single task between several cores
  - E.g., each performs part of large matrix multiplication

Parallel Processing

It's difficult
It's inevitable
- Only path to increase performance
- Only path to energy consumption (improve battery life)
In mobile systems (e.g., smart phones, tablets)
- Multiple cores
- Dedicated processors, e.g.,
  - Motion processor, image processor, neural processor in iPhone 8 + X
  - GPU (graphics processing unit)
Warehouse-scale computers (next week!)
- Multiple “nodes”
  - “Boxes” with several CPUs, disks per box
- MIMD (multi-core) and SIMD (e.g. AVX) in each node

Threads

Programs Running on a typical Computer

unix% ps -x

Threads

Thoughts about Threads

“Although threads seem to be a small step from sequential computation, in fact, they represent a huge step. They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism. Threads, as a model of computation, are wildly non-deterministic, and the job of the programmer becomes one of pruning that nondeterminism.”
— The Problem with Threads, Edward A. Lee, UC Berkeley, 2006

Operating System Threads

Give illusion of many “simultaneously” active threads

Multiplex software threads onto hardware threads:
a) Switch out blocked threads (e.g., cache miss, user input, network access)
b) Timer (e.g., switch active thread every 1 ms)

避免某个线程stall造成的cost（可以优先执行该线程的其他部分）
Remove a software thread from a hardware thread by
a) Interrupting its execution
b) Saving its registers and PC to memory
Start executing a different software thread by
a) Loading its previously saved registers into a hardware thread’s registers
b) Jumping to its saved PC

Multithreading

Typical scenario:
- Active thread encounters cache miss
- Active thread waits ～ 1000 cycles for data from DRAM $\rightarrow$ switch out and run different thread until data available
Problem
- Must save current thread state and load new thread state
  - PC, all registers (could be many, e.g. AVX)
  $\rightarrow$ must perform switch in ≪ 1000 cycles
Can hardware help?
- Moore’s Law: transistors are plenty

Hardware Assisted Software Multithreading

one core, two threads

Hyper-Threading

Multithreading

Logical threads
- ≈ 1% more hardware
- ≈ 10% (?) better performance
  - Separate registers
  - Share datapath, ALU(s), caches
Multicore
- => Duplicate Processors
- 50% more hardware
- 2X better performance?
Modern machines do both
- Multiple cores with multiple threads per core

Example: 6 Cores, 24 Logical Threads

Review: Definitions

Thread Level Parallelism
- Thread: sequence of instructions, with own program counter and processor state (e.g., register file)
- Multicore:
  - Physical CPU: One thread (at a time) per CPU, in software OS switches threads typically in response to I/O events like disk read/write
  - Logical CPU: Fine-grain thread switching, in hardware, when thread blocks due to cache miss/memory access
  - Hyper-Threading aka Simultaneous Multithreading(SMT): Exploit superscalar architecture to launch instructions from different threads at the same time!

Conclusion

Sequential software execution speed is limited
- Clock rates flat or declining
Parallelism the only path to higher performance
- SIMD: instruction level parallelism
  - Implemented in all high perf. CPUs today (x86, ARM, …)
  - Partially supported by compilers
  - 2X width every 3-4 years
- MIMD: thread level parallelism
  - Multicore processors
  - Supported by Operating Systems (OS)
  - Requires programmer intervention to exploit at single program level (we see later)
  - Add 2 cores every 2 years (2, 4, 6, 8, 10, …)
    - Intel Xeon W- • 3275: 28 Cores, 56 Threads
- SIMD & MIMD for maximum performance
Key challenge: craft parallel programs with high performance on multiprocessors as # of processors increase – i.e., that scale
- Scheduling, load balancing, time for synchronization, overhead communication

Lecture 34 (Thread-level Parallelism 2)

Parallel Programming Languages

Languages Supporting Parallel Programming

Why So Many Parallel Programming Languages?

Why “intrinsics”?
- TO Intel: fix your #()&$! compiler, thanks…
It’s happening ... but
- SIMD features are continually added to compilers (Intel, gcc)
- Intense area of research
- Research progress:
  - 20+ years to translate C into good (fast!) assembly
  - How long to translate C into good (fast!) parallel code?
    - General problem is very hard to solve
    - General problem is very hard to solve
    - Your opportunity to become famous!

Parallel Programming Languages

Number of choices is indication of
- No universal solution
  - Needs are very problem specific
- E.g.,
  - Scientific computing/machine learning (matrix multiply)
  - Webserver: handle many unrelated requests simultaneously
  - Input / output: it’s all happening simultaneously!
Specialized languages for different tasks
- Some are easier to use (for some problems)
- None is particularly “easy” to use
61C
- Parallel language examples for high-performance computing
- OpenMP

OpenMP

Parallel Loops

Parallel for in OpenMP

#include <omp.h>
#pragma omp parallel for
for (int i=0; i<100; i++) { 
…
}

OpenMP Example

OpenMP

C extension: no new language to learn
Multi-threaded, shared-memory parallelism
- Compiler Directives, #pragma
- Runtime Library Routines, #include <omp.h>
#pragma
- Ignored by compilers unaware of OpenMP
- Same source for multiple architectures
  - E.g., same program for 1 & 16 cores
Only works with shared memory

OpenMP Programming Model

What Kind of Threads?

OpenMP threads are operating system (software) threads
OS will multiplex requested OpenMP threads onto available hardware threads
Hopefully each gets a real hardware thread to run on, so no OS-level time-multiplexing
But other tasks on machine compete for hardware threads!
But other tasks on machine compete for hardware threads!
- 5AM?
- Job queue?

Computing $\pi$

Example 2: Computing pi

多线程同时访问pi，导致覆盖的发生。
solution: Lock

Synchronization

Locks

Computers use locks to control access to shared resources
- Serves purpose of microphone in example
- Also referred to as “semaphore”
Usually implemented with a variable
- int lock;
  - 0 for unlocked
  - 1 for locked

Synchronization with Locks

// wait for lock released
while (lock!=0);
// lock == 0 now (unlocked)
lock = 1;
    // access shared resource ...
    // e.g. pi
    // sequential execution! (Amdahl ...)

// release lock
lock = 0;

Lock Synchronization

Conclusion

OpenMP as simple parallel extension to C
- Threads level programming with parallel for pragma
- ≈ C: small so easy to learn, but not very high level and it’s easy to get into trouble
Race conditions – result of program depends on chance (bad)
- Need assembly-level instructions to help with lock synchronization
- …next time

Lecture 35 (Thread-level Parallelism 3)

Hardware Synchronization

Review: OpenMP Building Block: for loop

OpenMP as simple parallel extension to C
- Threads level programming with parallel for pragma
- ≈ C: small so easy to learn, but not very high level and it’s easy to get into trouble
```
for (i=0; i<max; i++) zero[i] = 0;
```
Breaks for loop into chunks, and allocate each to a separate thread
- e.g. if max = 100 with 2 threads:
  assign 0-49 to thread 0, and 50-99 to thread 1
Must have relatively simple “shape” for an OpenMP-aware compiler to be able to parallelize it
- Necessary for the run-time system to be able to determine how many of the loop iterations to assign to each thread
No premature exits from the loop allowed
- i.e. No break, return, exit, goto statements
In general, don’t jump outside of any pragma block

Review: Data Races and Synchronization

Two memory accesses form a data race if from different threads access same location, at least one is a write, and they occur one after another
If there is a data race, result of program varies depending on chance (which thread first?)
Avoid data races by synchronizing writing and reading to get deterministic behavior
Synchronization done by user-level routines that rely on hardware synchronization instructions

Hardware Synchronization

Solution:
- Atomic read/write
- Read & write in single instruction
  - No other access permitted between read and write
- Note:
  - Must use shared memory (multiprocessing)
Common implementations:
- Atomic swap of register ↔ memory
- Pair of instructions for “linked” read and write
  - write fails if memory location has been “tampered” with after linked read
RISC-V has variations of both, but for simplicity we will focus on the former

RISC-V Atomic Memory Operations (AMOs)

rd 得到 rs1指向的初始值， rs1指向的值更新为 *rs1 + rs2

发生在atomic尺度，不会有锁的冲突

Lock Synchronization

OpenMP Locks

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(void){
  opm_lock_t lock;
  omp_init_lock(&lock);

#pragma omp parallel
  {
    int id = omp_get_thread_num();

    // parallel section
    // ...

    omp_set_lock(&lock);
    // start sequential section
    // ...
    printf("id = %d\n", id);

    // end sequential section
    omp_unset_lock(&lock);

    // parallel section
    // ...
  }
  omp_destory_lock(&lock);
}

Synchronization in OpenMP

Typically are used in libraries of higher level parallel programming constructs
E.g. OpenMP offers #pragmas for common cases:
- critical
- atomic
- barrier
- ordered
OpenMP offers many more features
- E.g., private variables, reductions
- See online documentation
- Or tutorial at
  - http://openmp.org/mp-documents/omp-hands-on-SC08.pdf (404,,,)

OpenMP Critical Section
Deadlock

线程之间互相等待的状况，整体停滞

OpenMP Timing

Elapsed wall clock time:
```
double omp_get_wtime(void);
```
- Returns elapsed wall clock time in seconds
- Time is measured per thread, no guarantee can be made that two distinct threads measure the same time
- Time is measured from “some time in the past”, so subtract results of two calls to omp_get_wtime to get elapsed time

Shared Memory and Caches

(Chip) Multicore Multiprocessor

Multiprocessor Key Questions

Q1 – How do they share data?
Q2 – How do they coordinate?
Q3 – How many processors can be supported?

Shared Memory Multiprocessor (SMP)

Q1 – Single address space shared by all processors/cores
Q2 – Processors coordinate/communicate through shared variables in memory (via loads and stores)
- Use of shared data must be coordinated via synchronization primitives (locks) that allow access to data to only one processor at a time
All multicore computers today are SMP

Multiprocessor Caches

Cache Coherency

Keeping Multiple Caches Coherent

Architect’s job: shared memory $\rightarrow$ keep cache values coherent
Idea: When any processor has cache miss or writes, notify other processors via interconnection network
- If only reading, many processors can have copies
- If a processor writes, invalidate any other copies (snoopy protocol)
Write transactions from one processor, other caches “snoop” the common interconnect checking for tags they hold
- Invalidate any copies of same address modified in other cache
将block分为几种states

How Does HW Keep $ Coherent?

Each cache tracks state of each block in cache:
1. Shared: up-to-date data, other caches may have a copy
2. Modified: up-to-date data, changed (dirty), no other cache has a copy, OK to write, memory out-of-date (i.e., write back)

Two Optional Performance Optimizations of Cache Coherency via New States

Each cache tracks state of each block in cache:
1. Exclusive: up-to-date data, no other cache has a copy, OK to write, memory up-to-date
  - Avoids writing to memory if block replaced
  - Supplies data on read instead of going to memory
2. Owner: up-to-date data, other caches may have a copy (they must be in Shared state)
- This cache is one of several with a valid copy of the cache line, but has the exclusive right to make changes to it. It must broadcast those changes to all other caches sharing the line. The introduction of owned state allows dirty sharing of data, i.e., a modified cache block can be moved around various caches without updating main memory. The cache line may be changed to the Modified state after invalidating all shared copies, or changed to the Shared state by writing the modifications back to main memory. Owned cache lines must respond to a snoop request with data.

Common Cache Coherency Protocol: MOESI
Shared Memory and Caches
Cache Coherency Tracked by Block

Coherency Tracked by Cache Block

Block ping-pongs between two caches even though processors are accessing disjoint variables
Effect called false sharing
How can you prevent it?

solution: use smaller block

Remember The 3Cs?

Compulsory (cold start or process migration, 1st reference):
- First access to block, impossible to avoid; small effect for long-running programs
- Solution: increase block size (increases miss penalty; very large blocks could increase miss rate)
Capacity (not compulsory and…)
- Cache cannot contain all blocks accessed by the program even with perfect replacement policy in fully associative cache
- Solution: increase cache size (may increase access time)
Conflict (not compulsory or capacity and…):
- Multiple memory locations map to the same cache location
- Solution 1: increase cache size
- Solution 2: increase associativity (may increase access time)
- Solution 3: improve replacement policy, e.g.. LRU

Fourth “C” of Cache Misses! Coherence Misses

Misses caused by coherence traffic with other processor
Also known as communication misses because represents data moving between processors working together on a parallel program
For some parallel programs, coherence misses can dominate total misses

Conclusion

OpenMP as simple parallel extension to C
- Threads level programming with parallel for pragma, private variables, reductions, …
- ≈ C: small so easy to learn, but not very high level and it’s easy to get into trouble
TLP
- Cache coherency implements shared memory even with multiple copies in multiple caches
- False sharing a concern; watch block size!

Lecture 36 (MapRecude, Spark)

Amdanl's Law

speedup被不可加速部分制约

Request-Level and Data-Level Parallelism

New-School Machine Structures

Request-Level Parallelism (RLP)

Hundreds or thousands of requests/sec
- Not your laptop or cell-phone, but popular Internet services like web search, social networking, …
- Such requests are largely independent
  - Often involve read-mostly databases
  - Rarely involve strict read–write data sharing or synchronization across requests
Computation easily partitioned within a request and across different requests

Google Query-Serving Architecture

Data-Level Parallelism (DLP)

Two kinds:
- Lots of data in memory that can be operated on in parallel (e.g. adding together 2 arrays)
- Lots of data on many disks that can be operated on in parallel (e.g. searching for documents)
Today’s lecture: DLP across many servers and disks using MapReduce

MapReduce

What is MapReduce? (An abstraction)

Simple data-parallel programming model designed for scalability and fault-tolerance
Pioneered by Google
- Processes > 25 petabytes of data per day
Open-source Hadoop project
- Used at Yahoo!, Facebook, Amazon, …
Hadoop

What is MapReduce used for?

At Google:
- Index construction for Google Search
- Article clustering for Google News
- Statistical machine translation
- For computing multi-layer street maps
At Yahoo!:
- “Web map” powering Yahoo! Search
- Spam detection for Yahoo! Mail
At Facebook:
- Data mining
- Ad optimization
- Spam detection

Example: Facebook Lexicon

MapReduce Design Goals

Scalability to large data volumes:
- 1000’s of machines, 10,000’s of disks
Cost-efficiency:
- Commodity machines (cheap, but unreliable)
- Commodity network
- Automatic fault-tolerance via re-execution (fewer administrators)
- Easy, fun to use (fewer programmers)
Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” 6th USENIX Symposium on Operating Systems Design and Implementation, 2004.
- optional reading, linked on course homepage – a digestible CS paper at the 61C level

Typical Hadoop Cluster
MapReduce in CS10 & CS61A{,S}

MapReduce Programming Model

Input & Output: each a set of key/value pairs (tagging 上 key 以更好地分发data)
Programmer specifies two functions:
map (in_key, in_value) $\rightarrow$
list(interm_key, interm_value)

Processes input key/value pair
Slices data into “shards” or “splits”; distributed to workers
Produces set of intermediate pairs

reduce (interm_key, list(interm_value)) $\rightarrow$
list(out_value)

Combines all intermediate values for a particular key
Produces a set of merged output values (usu just one)
code.google.com/edu/parallel/mapreduce-tutorial.html

MapReduce WordCount Example

MapReduce WordCount Diagram

MapReduce Processing Time Line

MapReduce WordCount Java code

Spark

DAG: Directed Acyclic Graph

Word Count in Spark’s Python API

file.flatMap(lambda line: line.split())
    .map(lambda word: (word,1))
    .reduceByKey(lambda a, b: a+b)

flatMap in Spark’s Python API

def neighbor(n):
  return [n-1, n, n+1]
R = sc.parallelize(range(5))  // sc: spark context
R.collect()
// [0, 1, 2, 3, 4]
R.map(neighbor).collect()
// [[-1, 0, 1], [0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5]]
R.flatMap(neighbor).collect()
// [-1, 0, 1, 0, 1, 2, 1, 2, 3, 2, 3, 4, 3, 4, 5]

Word Count in Spark’s Python API

Conclusion

4th big idea is parallelism
Amdahl’s Law constrains performance wins
- With infinite parallelism, Speedup = 1/s (s=serial %)
MapReduce is a wonderful abstraction for programming thousands of machines
- Hides details of machine failures, slow machines
- File-based
Spark does it even better
- Memory-based
- Lazy evaluation

Lecture 37 (Data Centers, Cloud Computing)

Eras of Computer Hardware

skip

Warehouse Scale Computer

Why Cloud Computing Now?

“The Web Space Race”: Build-out of extremely large datacenters (10,000’s of commodity PCs)
- Build-out driven by growth in demand (more users)
- Infrastructure software and Operational expertise
Discovered economy of scale: 5-7x cheaper than provisioning a medium-sized (1000 servers) facility
More pervasive broadband Internet so can access remote computers efficiently
Commoditization of HW & SW
- Standardized software stacks

November 2020 AWS Instances & Prices

Warehouse Scale Computers

Massive scale datacenters: 10,000 to 100,000 servers + networks to connect them together
- Emphasize cost-efficiency
- Attention to power: distribution and cooling
(relatively) homogeneous hardware/software
Offer very large applications (Internet services): search, social networking, video sharing
Very highly available: < 1 hour down/year
- Must cope with failures common at scale
“…WSCs are no less worthy of the expertise of computer systems architects than any other class of machines”
-Barroso and Hoelzle 2009

Design Goals of a WSC

Unique to Warehouse-scale
- 充足的并行性（Ample Parallelism）:
  - Batch apps (批处理): large number independent data sets with independent processing.
  - Also known as Data-Level Parallelism
- Scale and its Opportunities/Problems
  - Relatively small number of these make design cost expensive and difficult to amortize. WSC 的架构设计复杂且成本高昂，但由于其部署次数有限（如全球仅有少数超大规模数据中心），难以通过大规模复制分摊设计成本。
  - But price breaks are possible from purchases of very large numbers of commodity servers. 规模化采购优势：通过集中采购数万台商品化服务器（Commodity Servers），可大幅降低硬件单价（如获得厂商折扣）。
  - Must also prepare for high # of component failures. 组件数量庞大（如数万块硬盘、CPU、内存条），导致硬件故障常态化（例如每天可能发生数十次磁盘故障）。
- Operational Costs Count (运营成本主导):
  - Cost of equipment purchases << cost of ownership

Equipment Inside a WSC

Defining Performance

What does it mean to say X is faster than Y?
2009 Ferrari 599 GTB
- 2 passengers, 11.1 secs for quarter mile (call it 10sec)
2009 Type D school bus
- 54 passengers, quarter mile time? (let’s guess 1 min)
Response Time or Latency
- time between start and completion of a task
- E.g., time to move vehicle ¼ mile
Throughput or Bandwidth
- total amount of work in a given time
- E.g., passenger-miles in 1 hour

Coping with Performance in Array

Power Usage Effectiveness (PUE)

Coping with Workload Variation

Impact of latency, bandwidth, failure, varying workload on WSC software?

WSC Software must take care where it places data within an array to get good performance
WSC Software must cope with failures gracefully
WSC Software must scale up and down gracefully in response to varying demand
More elaborate hierarchy of memories, failure tolerance, workload accommodation makes WSC software development more challenging than software for single computer

Power vs. Server Utilization

Power Usage Effectiveness

Overall WSC Energy Efficiency: amount of computational work performed divided by the total energy used in the process
Power Usage Effectiveness (PUE): Total building power / IT equipment power
- A power efficiency measure for WSC, not including efficiency of servers, networking gear
- 1.0 = perfection

High PUE: Where Does Power Go?

Google WSC A PUE: 1.24

Careful air flow handling
- Don’t mix server hot air exhaust with cold air (separate warm aisle from cold aisle)
- Short path to cooling so little energy spent moving cold or hot air long distances
- Keeping servers inside containers helps control air flow
Elevated cold aisle temperatures
- 81°F instead of traditional 65°- 68°F
- Found reliability OK if run servers hotter
Use of free cooling
- Cool warm water outside by evaporation in cooling towers
- Locate WSC in moderate climate so not too hot or too cold
Per-server 12-V DC UPS
- Rather than WSC wide UPS, place single battery per server board
- Increases WSC efficiency from 90% to 99%
Measure vs. estimate PUE, publish PUE, and improve operation

Computing in the News

2011 (www.nytimes.com/2011/09/09/technology/google-details-and-defends-its-use-of-electricity.html)
- Google disclosed that it continuously uses enough electricity to power 200,000 homes, but it says that in doing so, it also makes the planet greener.
- Search cost per day (per person) same as running a 60-watt bulb for 3 hours
2018 (techcrunch.com/2018/04/04/google-matches-100-percent-of-its-power-consumption-with-renewables/)
- Google: “Over the course of 2017, across the globe, for every kilowatt-hour of electricity we consumed, we purchased a kilowatt-hour of renewable energy from a wind or solar farm that was built specifically for Google. This makes us the first public Cloud, and company of our size, to have achieved this feat”

Summary

Parallelism is one of the Great Ideas
- Applies at many levels of the system – from instructions to warehouse scale computers
Post PC Era: Parallel processing, smart phone to WSC
WSC SW must cope with failures, varying load, varying HW latency bandwidth
WSC HW sensitive to cost, energy efficiency
WSCs support many of the applications we have come to depend on

Lecture 38 (Dependability)

6 Great Ideas in Computer Architecture

Abstraction (Layers of Representation/Interpretation)
Moore’s Law
Principle of Locality/Memory Hierarchy
Parallelism
Performance Measurement & Improvement
Dependability via Redundancy

We will discuss hardware failures and methods to mitigate them

Great Idea #6: Dependability via Redundancy

Dependability Metrics

Dependability via Redundancy: Time vs. Space

Spatial Redundancy – replicated data or check information or hardware to handle hard and soft (transient) failures. (通过复制硬件、数据或校验信息实现冗余)
Temporal Redundancy – redundancy in time (retry) to handle soft (transient) failures. (通过重复执行操作（重试）实现冗余)

Dependability Measures

Reliability: Mean Time To Failure (MTTF)
Service interruption: Mean Time To Repair (MTTR)
Mean time between failures (MTBF)
- MTBF = MTTF + MTTR
Availability = MTTF / (MTTF + MTTR)
Improving Availability
- Increase MTTF: More reliable hardware/software + Fault Tolerance
- Reduce MTTR: improved tools and processes for diagnosis and repair

Availability Measures

Availability = MTTF / (MTTF + MTTR) as %
- MTTF, MTBF usually measured in hours
Since hope rarely down, shorthand is “number of 9s of availability per year”
1 nine: 90% => 36 days of repair/year
2 nines: 99% => 3.6 days of repair/year
3 nines: 99.9% => 526 minutes of repair/year
4 nines: 99.99% => 53 minutes of repair/year
5 nines: 99.999% => 5 minutes of repair/year

Reliability Measures

Another is average number of failures per year: Annualized Failure Rate (AFR)
- E.g., 1000 disks with 100,000 hour MTTF
- 365 days * 24 hours = 8760 hours
- (1000 disks * 8760 hrs/year) / 100,000
  = 87.6 failed disks per year on average
- 87.6/1000 = 8.76% annual failure rate
Google’s 2007 study* found that actual AFRs for individual drives ranged from 1.7% for first year drives to over 8.6% for three-year old drives
research.google.com/archive/disk_failures.pdf

Hard Drive Failures

Failures In Time (FIT) Rate

The Failures In Time (FIT) rate of a device is the number of failures that can be expected in one billion ( $10^9$ $1 0^{9}$ ) device-hours of operation
- Or 1000 devices for 1 million hours, 1 million devices for 1000 hours each
MTBF = 1,000,000,000 x 1/FIT
Relevant: Automotive safety integrity level (ASIL) defines FT rates for different classes of components in vehicles

Dependability Design Principle

Design Principle: No single points of failure
- “Chain is only as strong as its weakest link”
Dependability corollary of Amdahl’s Law
- Doesn’t matter how dependable you make one portion of system
- Dependability limited by part you do not improve

Error Detection

Error Detection/Correction Codes

Memory systems generate errors (accidentally flipped bits)
- DRAMs store very little charge per bit
- “Soft” errors occur occasionally when cells are struck by alpha particles or other environmental upsets
- “Hard” errors can occur when chips permanently fail
- Problem gets worse as memories get denser and larger
Memories protected against soft errors with EDC/ECC
Extra bits are added to each data-word
- Used to detect and/or correct faults in the memory system
- Each data word value mapped to unique code word
- A fault changes valid code word to invalid one, which can be detected

Block Code Principles

Parity: Simple Error-Detection Coding

Parity Example

Error Detection and Correction

Suppose Want to Correct One Error?

Hamming came up with simple to understand mapping to allow Error Correction at minimum distance of three
- Single error correction, double error detection
Called “Hamming ECC”
- Worked weekends on relay computer with unreliable card reader, frustrated with manual restarting
- Got interested in error correction; published 1950
- R. W. Hamming, “Error Detecting and Correcting Codes,” The Bell System Technical Journal, Vol. XXVI, No 2 (April 1950) pp 147-160.

Detecting/Correcting Code Concept

Hamming Distance: Eight Code Words

ECC Example

Set parity bits to create even parity for each group

A byte of data: 10011010
Create the coded word, leaving spaces for the parity bits:
_ _ 1 _ 0 0 1 _ 1 0 1 0
1 2 3 4 5 6 7 8 9 a b c – bit position
Calculate the parity bits

Final code word: 011100101010
Data word: 1 001 1010

What if More Than 2-Bit Errors?

Use double-error correction, triple-error detection (DECTED)
Network transmissions, disks, distributed storage common failure mode is bursts of bit errors, not just one or two bit errors
- Contiguous sequence of B bits in which first, last and any number of intermediate bits are in error
- Caused by impulse noise or by fading in wireless
- Effect is greater at higher data rates
Solve with Cyclic Redundancy Check (CRC), interleaving or other more advanced codes

Redundancy with RAID

RAID: Redundant Arrays of (Inexpensive) Disks

Data is stored across multiple disks
Files are "striped" across multiple disks
Redundancy yields high data availability
- Availability: service still provided to user, even if some components failed
Disks will still fail
Contents reconstructed from data redundantly stored in the array
- Capacity penalty to store redundant info
- Bandwidth penalty to update redundant info

Conclusion

Great Idea: Redundancy to Get Dependability
- Spatial (extra hardware) and Temporal (retry if error)
Reliability: MTTF, Annualized Failure Rate (AFR), and FIT
Availability: % uptime (MTTF/MTTF+MTTR)
Memory
- Hamming distance 2: Parity for Single Error Detect
- Hamming distance 3: Single Error Correction Code + encode bit position of error
Treat disks like memory, except you know when a disk has failed—erasure makes parity an Error Correcting Code
RAID-2, -3, -4, -5 (and -6, -10): Interleaved data and parity

Lecture 39

What is a GPU

相比CPU：通过规划并行threads最大化Throughput。
更多的cores，更低的frequency，更简单，Software Managed Coherency。
更高的Memory Latency，但是更高的Bandwidth。
High throughput SIMD leveraging。
线程被分组为Threadgroups (aka Blocks or Workgroups)

Graphics Pipeline Overview

Parallel Programming Example

当data变大时，GPU上比CPU快得多

完结撒花，lab和proj有空整理了再发！

Lecture 31 (I/O Devices)

I/O Devices

I/O Polling

Polling: Processor Checks Status, Then Acts

Cost of Polling? (开销)

I/O Interrupts

Alternatives to Polling: Interrupts

Polling, Interrupts and DMA

Aside: Programmed I/O

DMA

Direct Memory Access (DMA)

DMA: Incoming Data

DMA: Outgoing Data

DMA: Some New Problems

Networking

Networks: Talking to the Outside World

Conclusion

Lecture 32 (Flynn Taxonomy, SIMD Instructions)

Parallelism

Reference Problem: Matrix Multiplication

Matrix Multiplication

C

Timing Program Execution

Flynn's Taxonomy

SIMD Architectures

SIMD Array Processing

Matrix Multiply Example

Conclusion

Lecture 33 (Thread-level Parallelism 1)

Parallel Computer Architectures

Improving Performance

Multiprocessor Execution Model

Multicore

Multiprocessor Execution Model

Parallel Processing

Threads

Threads

Thoughts about Threads

Operating System Threads

Multithreading

Multithreading

Multithreading

Review: Definitions

Conclusion

Lecture 34 (Thread-level Parallelism 2)

Parallel Programming Languages

Why So Many Parallel Programming Languages?

Parallel Programming Languages

OpenMP

Parallel for in OpenMP

OpenMP

What Kind of Threads?

Computing π\piπ

Synchronization

Locks

Synchronization with Locks

Conclusion

Lecture 35 (Thread-level Parallelism 3)

Hardware Synchronization

Review: OpenMP Building Block: for loop

Review: Data Races and Synchronization

Hardware Synchronization

RISC-V Atomic Memory Operations (AMOs)

Lock Synchronization

OpenMP Locks

Synchronization in OpenMP

OpenMP Timing

Shared Memory and Caches

Multiprocessor Key Questions

Shared Memory Multiprocessor (SMP)

Cache Coherency

Keeping Multiple Caches Coherent

How Does HW Keep $ Coherent?

Two Optional Performance Optimizations of Cache Coherency via New States

Coherency Tracked by Cache Block

Remember The 3Cs?

Fourth “C” of Cache Misses! Coherence Misses

Conclusion

Lecture 36 (MapRecude, Spark)

Amdanl's Law

Computing $\pi$