CS61C NOTES 1-20 (Great Ideas in Computer Architecture)

这是一个CS61C的笔记，由于是边上课边记的，很多地方选择偷懒进行截图，，，

Lecture 01 (Intro)

6 Great Ideas in Computer Architecture

1. Abstraction (Layers of Representation/Interpretation)

Maintain all these layers of abstraction that pass relevant things up to the upper layer while obfuscating things that we really don't care about (e.g. how the wires work inside microprocessor).

2. Moore's Law & the Principle of Scaling

moore's law

3. Principle of Locality/Memory Hierarchy

局部性原理：

时间局部性：最近访问的数据很可能再次被访问
空间局部性：相邻数据很可能被一起访问
内存层次结构（缓存体系）正是基于此原理设计

principle of locality

4. Parallelism

并行化是突破单核性能瓶颈的关键方向，包含：

线程级并行（多核CPU） Thread-level parallelism
数据级并行（SIMD指令集） Data-level parallelism
任务级并行（GPU架构） Executed on a processor concurrently

Amdahl's Law: the principle of the longest pole.

阿姆达尔定律：系统加速比受限于串行部分的比例
公式：S = 1 / [(1-P) + P/N] （P为并行比例，N为处理器数量）
举例：若程序90%可并行，使用10核的极限加速比为 1/(0.1+0.9/10) ≈ 5.26倍

5. Performance Measurement & Feedback of Improvement Based on the Measurements

Matching application to underlying hardware to exploit.

Locality
Parallelism
Special hardware features, like specialized instructions (e.g. matrix manipulation)
Latency/Throughput (metricize what we care about)

How long to set the problem up and complete it (or how many tasks can be completed in given time)
How much faster does it execute once it gets going
Latency is all about time to finish

6. Dependability via Redundancy

Redundancy so that a failing piece doesn't make the whole system fail.
Applies to everything from datacenters to storage to memory to instructions.

Lecture 02 (Number Representation)

数据从模拟到数字的转换流程

Tipically the data begins in the analog domain, and we need to then convert it to the digital domain. (Sample -> Quantize -> encode)

Binary(b,0b10011010010)

Octal(0,02322)

Decimal(d)

Hex(x,0x4d2)

Two's Complement（二进制补码）

Used by all modern hardware.

Roughly evenly split between positive and negative. (One more negative)

Can still use MSB as sign bit.

To negate: Flip the bits and add one.

Overflow (溢出与负溢出)

The result of an arithmetic operation can't be represented by the (finite) hardware bits.

Solution: add more bits.

Sign Extension（符号扩展）

Want to represent the same number using more bits.

For sign and magnitude: add 0's after the sign bit.

For One's complement: copy MSB.

For Two's complement: copy MSB.

Base Translation（进制转换）

skip.

Biased notation（偏置表示法）

Take some unsigned value and add some bias to it to get the "actual" value. (Same range but change the location)

Two's complement tricks（补码技巧）

-1 means all bits are set.

Flip bits and add 1.

May cause overflow when adding two positive or two negative numbers.

Lecture 03 (C Intro:Basic)

Why C?

We can write programs that allow us to exploit underlying features of the architecture.

memory management, spacial instructions, parallelism.

Compilation: Overview

C compilers map C programs direactly into architecture-specific machine code.
For C, generally a two part process of compiling: .c files to .o files, then linking the .o files into executables.

C Syntax: main

To get the main function to accept arguments, use this:
```
int main(int argc, char *argv[])
```
- argc will contain the number of strings on the command line (the executable counts as one, plus one for each argument).
- argv is a pointer to an array containing the arguments as strings (more on pointers later).

C Syntax: True or False

False in C

0
NULL
Bollean types provided by C99's stdbool.h

True in C

...everything else... (Same idea as in Scheme: Only #f is false, everything else is true)

Typed Variables in C

Must declare the type of data a variable will hold, E.g:
```
int var = 2;
```
C: int should be integer type that target processor works with most efficiently
Only guarantee:

sizeof(long long) $\geq$ sizeof(long) $\geq$ sizeof(int) $\geq$ sizeof(short)
- Also, short $\geq$ 16 bits, long $\geq$ 32 bits.
- All could be 64 bits.
- encourage you to use intN_t and uintN_t
Constant is assigned a typed value once in the declaration; value can't change during entire execution of program
```
const float golden_ratio = 1.618;
const int days_in_week = 7;
const double the_law = 2.99792458e8;
```

Enums: a group of related integer constants.

enum cardsuit {CLUBS,DIAMONDS,HEARTS,SPADES};
enum color {RED, GREEN, BLUE};

Typed Function in C

You have to declare the type of data you plan to return from a function
Return type can be any C variable type, and is placed to the left of the function name
You can also specify the return type as void
Also need to declare types for values passed into a function

Variables and functions MUST be declared before they are used

int number_of_people () { return 3; }
float dollars_and_cents () { return 10.33; }

Struct in C

Typedef allows you to define new types.
```
typedef uint8_t BYTE;
BYTE b1, b2;
```

Structs are structured groups of variables e.g.,

typedef struct {
  int length_in_seconds;
  int year_recorded;
  } SONG;

SONG song1;
song1.length_in_seconds = 213;
song1.year_recorded = 1994;

SONG song2;
song2.length_in_seconds = 248;
song2.year_recorded = 1988;

Control Flow

if-else

if (x == 0) y++;
if (x == 0) {y++;}
if (x == 0) {y++; j = j + y;}

while
- while (expression) statement
- do statement while (expression);

for

for (initialize; check; update) statement

switch

switch (expression){
case const1: statements
case const2: statements
default: statements
}
break;

Note: until you do a break statement things keep executing in the switch statement

goto
- But it can result in spectacularly bad code if you use it, so don't!

An example

#include <stdio.h>
#include <math.h>
int main(void)
{
  int angle_degree;
  double angle_radian, pi, value;

  printf("Compute a table of the sine function\n\n");
  pi = 4.0*atan(1.0); /* could also just use pi = M_PI */
  printf("Value of PI = %f \n\n", pi);
  printf("Angle\tSine\n");
  angle_degree = 0;/* initial angle value */
  while (angle_degree <= 360) { /* loop til angle_degree > 360 */
      angle_radian = pi * angle_degree / 180.0;
      value = sin(angle_radian);
      printf ("%3d\t%f\n ", angle_degree, value);
      angle_degree += 10; /* increment the loop index */
  }
return 0;
}

Lecture 04 (C Intro: Pointers, Arrays, Strings)

Undefined Behavior

Heisenbugs: Bugs that seem random/hard to reproduce, and seem to disappear or change when debugging.
Bohrbugs: repeatable.

Pointers

An address refers to a particular memory location. In other words, it points to a memory location.

Pointers: A variable that contains the address of a variable.

int *p;
// Tells compiler that variable p is address of an int.
p = &y;
// Tells compiler to assign address of y to p.
// & called the “address operator” in this context
z = *p;
// Tells compiler to assign value at address in p to z
// * called the “dereference operator” in this context

Using Pointers Effectively

Normally a pointer can only point to one type (int, char, a struct, etc.)
- void * is a type that can point to anything (generic pointer)
- Use sparingly to help avoid program bugs… and security issues… and a lot of other bad things!
You can even have pointers to functions...
- int (*fn) (void *, void *) = &foo // function that takes two void* and return an int
- fn is a function that accepts two void * pointers and returns an int and is initially pointing to the function foo
- (*fn)(x, y) will then call the function
NULL pointers: The pointer of all 0's, should never read or write a null pointer.

Arrays

Declaration and initialization:
```
int ar[2];
int arr[] = {795, 635};
```
Accessing elements
- ar[num]
- returns the $\text{num}^{ \text{th} }$ element.
Arrays are (almost) identical to pointers
- char *string and char string[] are nearly identical declarations
- They differ in very subtle ways: incrementing, declaration of filled arrays
Key Concept: An array variable is a “pointer” to the first element
Consequences:
- ar is an array variable but looks like a pointer in many respects (though not all)
- ar[0] is the same as *ar
- ar[2] is the same as *(ar+2)
- We can use pointer arithmetic to access arrays more conveniently.

Declared arrays are only allocated while the scope is valid

\\ Incorrect code
char *foo() {
char string[32]; ...;
return string;
}

Pitfall: An array in C does not know its own length, and bounds not checked.
- Consequence: We must pass the array and its size to a procedure which is going to traverse it.

Function Pointer Example

#include <stdio.h>

int x10(int), x2(int);
void mutate_map(int [], int n, int(*)(int));
void print_array(int [], int n);

int x2 (int n) { return 2*n; }
int x10(int n) { return 10*n; }

void mutate_map(int A[], int n, int(*fp)(int)) {
    for (int i = 0; i < n; i++) 
        A[i] = (*fp)(A[i]);
}

void print_array(int A[], int n) {
    for (int i = 0; i < n; i++) 
        printf("%d ",A[i]);
    printf("\n");
}

int main(void)
{
    int A[] = {3,1,4}, n = 3;
    print_array(A, n);
    mutate_map (A, n, &x2);
    print_array(A, n);
    mutate_map (A, n, &x10);
    print_array(A, n);
}

Lecture 05 (C Intro: Memory Management)

Dynamic Memory Allocation

C has operator sizeof() which gives size in bytes (of type or variable), and it knows the size of arrays.
To allocate room for something new to point to, use malloc() (with the help of a typecast and sizeof)
ptr = (int *) malloc(sizeof(int));
- Now, ptr points to a space somewhere in memory of size (sizeof(int)) in bytes.
- (int *) simply tells the compiler what will go into that space (called a typecast)
malloc is almost never used for 1 var
ptr = (int ) malloc (nsizeof(int));
- This allocates an array of n integers.
Once malloc() is called, the memory location contains garbage, so don’t use it until you’ve set its value.
After dynamically allocating space, we must dynamically free it:
```
free(ptr);
```
- Even though the program frees all memory on exit(or when main returns), don’t be lazy!
- You never know when your main will get transformed into a subroutine!
Two things that will cause your program to crash or behave strangely later on, and cause VERY VERY hard to figure out bugs:
- free() the same piece of memory twice
- calling free() on something you didn’t get back from malloc()
managing the Heap: realloc(p, size)
- Resize a previously allocated block at p to a new size
- If p is NULL, then realloc behaves like malloc
- If size is 0, then realloc behaves like free, deallocating the block from the heap
- Returns new address of the memory block; NOTE: it is likely to have moved!
```
int *ip; 
ip = (int *) malloc(10*sizeof(int));
/* always check for ip == NULL */
… … …
ip = (int *) realloc(ip,20*sizeof(int));
/* always check NULL, contents of first 10 
elements retained */
… … …
realloc(ip,0); /* identical to free(ip) */
```

Linked List Example

Let’s look at an example of using structures, pointers, malloc(), and free() to implement a linked list of strings.

struct Node {
char *value;
struct Node *next; 
};
typedef struct Node *List;

/* Create a new (empty) list */
List ListNew(void)
{ return NULL; }

/* add a string to an existing list */
List list_add(List list, char *string)
{
  struct Node *node =
    (struct Node*) malloc(sizeof(struct Node));
  node->value =
    (char*) malloc(strlen(string) + 1);
  strcpy(node->value, string);
  node->next = list;
  return node;
}

Memory Locations

globals: Data declared outside of any procedure; Similar to #1 above, but has “global” scope.
C has 3 pools of memory
- Static storage: global variable storage, basically permanent, entire program run (can't change the size,e.g.)
- The Stack: local variable storage, parameters, return address (location of “activation records” in Java or “stack frame” in C)
- The Heap (dynamic malloc storage): data lives until deallocated by programmer

The Heap (Dynamic memory)

Large pool of meemory, not allocated in contiguous order
- back-to-back requests for heap memory could result blocks very far apart
In C, specify number of bytes of memory explicitly to allocate item
Heap Management Requirements
- want malloc() and free() to run quickly
- want minimal memory overhead
- want to avoid fragmentaation* - when most of our free memory is in many small chunks

K&R Malloc/Free Implementation

From Section 8.7 of K&R
- Code in the book uses some C language features we haven't discussed and is written in a very terse style, don't worry if you can't decipher the code
Each block of memory is preceded by a header that has two fields:
size of the block and
a pointer to the next block
All free blocks are kept in a circular linked list, the pointer field is unused in an allocated block

Choosing a block in malloc()

If there are multiple free blocks of memory that are big enough for some request, how do we choose which one to use?
- best-fit: choose the smallest block that is big enough for the request
- first-fit: choose the first block we see that is big enough
- next-fit: like first-fit but remember where we finished searching and resume searching from there

When Memory Goes Bad

Don't return pointer into the stack
Don't forget realloc() can move data
Don't free the wrong stuff
Don't do double-free
Don't lose the initial pointer (Memory Leak)
Valgrind to the rescue

Lecture 06 (Floating Pointer)

Basics & Fixed Point

Representation of Fractions with Fixed Pt. (e.g. 10.0101 = 2 + 0 + 0 + 1/4 + 0 + 1/16)

Floating Point

pass in another input which tells where the binary point might be.

IEEE 754 Floating Point Standard
Note: 0 has no leading 1, so reserve exponent value 0 just for number 0
IEEE 754 uses "biased exponent" representation

Calculate: (-1)^ S $\times$ (1+ Significand) $\times$ 2 ^(Exponent-127)

Special Numbers

Representation for $\pm \infty$ : all Exponent is all 1's and Significand is all 0's.
Representation for 0 : all Exponent and Significand is 0.

Exponent Significand Object

0 0 0

0 nonzero Denorms

1-254 anything +/-fl. pt. #

255 255 +/- $\infin$

255 nonzero NaN
- Can use the significand bits of NaN to identify which NaN it is.
- The gap between 0 and the least number (with Exponent not all 0's) is $2^{-126}$ , which is $2^{-149}$ between the least number and the second least number, that's why we have denorms.
- Denorms don't have an implicit leading 1 but a leading 0, and the inplicit exponent is -126, then the least number is $2^{-149}$ .
- Everytime we cross to a bigger Exponent, we double the stride. (don't forget $2^{-23}$ )

Exponent	Significand	Object
0	0	0
0	nonzero	Denorms
1-254	anything	+/-fl. pt. #
255	255	+/- $\infin$
255	nonzero	NaN

Examples, Discussion

representation of $\frac{1}{3}$ ( $\frac{1}{4} + \frac{1}{16} + \frac{1}{64} + \dots$ ) : 0 | 0 1 1 1 1 1 0 1 | 0 1 0 1 $\dots$ |
The FP don't have add associative. (e.g. $(-1.5 \times 10^{38} + 1.5 \times 10^{38}) + 1.0 = 1.0 \neq 0.0 = -1.5 \times 10^{38} + (1.5 \times 10^{38} +1.0)$ ) Because PF approximates real result.

Precision and Accuracy

Precision is a count of the number bits in used to represent a value.
Accuracy is the difference between the actual value of a # and its computer representation.
High precision permits high accuracy but don't guarantee it.

Rounding

Addition

More difficult, can't just add significand.
- De-normalize to match exponents
- Add significands to get resulting one
- Keep the same exponent
- Normalize (possibly changing exponent)

Other Floating Point Representations

Double Precision (binary64)
Quad-Precision (binary128)
Oct-Precision (binary256)
Half-Precision (binary16 / bfloat16 which keeps the same bits for Exponent)
u-bit : tell whether the number is exact or range

Lecture 07 (RISC-V: Intro)

The set of instructions a particular CPU implements is a Instruction Set Architecture (ISA). (e.g. ARM, Intel x86, IBM Power, RISC-V...)
Instruction set for a particular architecture (e.g. RISC-V) is represented by the Assembly language.
Each line of assembly code represents one instruction for the computer.

Assembly Variables: Registers

Assembly cannot use variables.
Assembly operands are registers. (operations are performed on the data that sits inside registers)

Benefits: Since registers are directly in hardware, they're very fast

Processor is the one that orchestrates loading memory from the I/O devices and sending the data from the memory to the output.
Inside the datapath, the main elements are those registers and the execution unit, typically called the artihmetic logic unit.
The processer communicates to the memory by issuing addresses and reading data from the memory or writing the data to the memory.
Assert the enable signal to decide whether to write to the memory.
Since registers are in hardware and right next to the execution, there is a predetermined number of them.
32 registers in RISC-V
Each RISC-V register is 32 bits wide (in RV32 variant)
- Groups of 32 bits called a word in RV32
Registers are numbered from 0 to 31
- Referred to by number x0 - x31
x0 is special, always holds value zero
- So only 31 registers able to hold variable values
Each register can be referred to by number or name
In assembly language, the registers have no type
- Operation determines how register contents are treated

Add/Sub Instructiond

Syntax of Instruction: one two, three, four (rigid: 1 operator, 3 operands)
- one: operation name
- two: destination
- three: first source
- four: second source
```
add x1, x2, x3
sub x3 ,x4, x5
```

Immediates

Immediates are numerical constants

addi x3, x4, 10 # the immediates should be less the 32 bits

Syntax similar to add instruction, expect that last argument is a number instead of a register
There is no Subtract Immediate in RISC-V (substitution is not commutative)
We can change the sign of the immediate in add to implement sub immediate
If we set the destination to x0, will do nothing

Lecture 08 (RISC-V: lw, sw, Decision 1)

Register Footprint For a Particular Kernel: The certain number of registers we have to use for a paticular computational routine or for a compute kernel.

Loading from and Storing to Memory

We work with laeger amounts of data, so we have to spill out of registers to memory.
Memory can be viewed as a large one-dimensional array with the address acting as the index to that array starting at 0.
To access a word in memory, processor must supply an address.
Address is calculated by a processor and is used to point to a particular word in the memory.
The address is typically specified as an offset to a base pointer.
Little-endian: When we store bytes in words in memory, the least-significant byte in a word gets the smallest address.

Data Transfer Instructions

Load from memory to register

lw x10, 12(x15) 
# Reg x10 gwts A[3], 3 words is 12 bytes, x15 points to A[0]

Store from Register to Memory

sw x10, 40(x15)
# A[10] = h + A[3], the offsets must be multiple of 4

Loading and Storing bytes

lb x10, 3(x11) # copy the content to the low byte position fo the register x10, and the other 3 bytes is copied to "sign-extend"
lbu x10, 3(x11) # copy with zero-extend to the other 3 bytes
sb x10, 7(x11)
# the same syntax as lw and sw but jest deal with Bytes not word

Decision 1: Decision Making

if-statement

beq reg1, reg2, L1 # the label is essential an immediate value
# go to statement labeled L1 if (value in reg1) == (value in reg2), otherwise go to the next statement
beq reg1, reg2, L1 
# go to statement labeled L1 if (value in reg1) != (value in reg2)

Types of Branches (Branch - change of control flow)
- Conditional Branch - change control flow depending on outcome of comparison
  - Branch if equal (beq) or not equal (bne)
  - Branch if less than (blt) and be greater than or equal (bge)
  - the unsigned versions (bltu,bgeu)
- Unconditional Branch - always branch
  - jump(j)
```
j label 
```
Example

    bne x13, x15, Exit  
    add x10, x11, x12  
Exit:

# if (i == j)
#   h = g + h;


    bne x13, x14, Else
    add x10, x11, x12
    j Exit
Else:sub x10, x11, x12
Exit:

# if (i == j)
#   f = g + h;
# else
#   f = g - h；

Example

    add x9, x8, x0  # x9 = &A[0]
    add x10, x0, x0  # sum
    add x11, x0, x0  # i
    addi x13, x0, 20  
Loop:
    bge x11, x13, Done  
    lw x12 , 0(x9)  # x12 A[i]
    add x10, x10 , x12  # sum
    addi x9, x9, 4  # &A[i+1]
    addi x11, x11, 1  # i++
    j Loop
Done:

# int A[20];
# int sum = 0;
# for (int i=0 ; i<20 ; i++)
#    sum += A[i];

Lecture 09 (RISC-V: Decision 2)

Logocal Instruction

and x5, x6, x7  # x5 = x6 & x7
andi x5, x6, 3  # x5 = x6 & 3

There is no logical NOT in RISC-V (use xor with $11111111_{ \text{two} }$ )
Shift Left Logical (sll) and immediate (slli)
```
slli x11, x12, 2  # x11=x12<<2
```
- store in x11 the value from x12 shifted by 2 bits to the left, inserting 0's on right. (multiplication)

Program Execution

Helpful RISC-V Assembler Features

Symbolic register names
- a0-a7 for argument registers (x10-x17) for function calls
- zero for x0
Pseudo-instructions
- Shorthand syntax for common assembly idioms
- e.g. mv rd, rs = addi rd, rs, 0
- e.g. li rd, 13 = addi ra, x0 13
- e.g. nop = addi x0, x0, 0 # no operation

RISC-V Function Calls

Six Fundamental Steps in Calling a Function
1. Put arguments in a place where function can access them
2. Tranfer control to function (jal)
3. Acquire (local) storage resources needed for function
4. Perform desired task of the function
5. Put return value in a place where calling code can access it and restore any registers you used; release local storage
6. Return control to point of origin, since a function can be called from several points in a program
RISC-V Function Call Conventions
- Registers faster than memory, so use them
- a0-a7 (x10-x17): eight argument registers to pass parameters and two return calues (a0-a1)
- ra: one return address register to return to the point of origin (x1)
- Also s0-s1 (x8-x9) and s2-s11 (x18-x27): saved registers
  
  jr: jump register, shorthand: ret = jr ra
  ra is the return address
  we don't use jump because it needs us to code the address for many times
- Before:
```
1008 addi ra, zero, 1016  # ra = 1016
1012 j sum  # goto sum
```
- After:
```
1008 jal sum  # ra = 1012. goto sum, jump and link
```
Syntax

jal rd, offset # rd: return destination(noamally PC+4)
jal rd, Label
jalr rd, rs1, offset

Lecture 10 (RISC-V Procedures)

Function Call Example

Stack: where the old register values saved to restore them after function call

Stack in memory, so need register to point to it: sp (stack pointer, x2)
Stack

int Leaf(int g, int h, int i, int j)
{
  int f;
  f = (g + h) - (i + j);
  return f;
}

prologue - operation - epilogue

Nested Calls and Register Conventions

Resister Conventions

CalleR: the calling function
CallE: the function being called
When callee returns from executing, the caller needs to know which registers may have changed and which are guaranteed to be unchanged.
Register Conventions: A set of generally accepted rules as to which registers will be unchanged after a procedure call (jal) and which may be changed.
RISC-V function-calling conventions divides registers into two categories:
- Preserved across function call
  - Caller can rely on values being unchanged
  - sp (stack pointer), gp (global pointer), tp (thread pointer), "saved registers" s0-s11 (s0 is also sp)
- Not preserved across function call
  - Caller cannot rely on values being unchanged
  - Argument/return registers a0-a7, ra, "temporary registers" t0-t6

Memory Allocation

C ahs two storage classes: automatic and static
- Automatic variables are local to function and discarded when function exits
- Static variables exist across exits from and entries to procedures
Use stack for automatic (local) variables that don't fit in registers
Procedure frame or activation record: segment of stack with saved registers and local variables

Using the Stack

Memory Allocation

When a C program is run, there are three important memory areas allocated:
- Static: Variables declared once per program, cease to exist only after execution completes - e.g., C globals
- Heap: Variables decleared dynamically via malloc
- Stack: Space to be used by procedure during execution; this is where we can save register values
Where is the Stack in Memory
- RV32 convention (RV64/RV128 have different memory layouts)
- Stack starts in high memory and grows down
  - Hexadecimal: $\text{bfff} \_ \text{fff} 0_{ \text{hex} }$
  - Stack must be aligned on 16-byte boundary (not ture in previous examples)
- RV32 programs (text segment) in low end
  - $0001 \_ 0000_{ \text{hex} }$
- static data segment (constants and other static variables) above text for static variables
  - RISC-V convention global pointer (gp) points to static
  - RV32 gp = $1000 \_ 0000_{ \text{hex} }$
- Heap above static for data structures that grow and shrink; grows up to high addresses

Summary

Lecture 11 (RISC-V Instruction Formats 1: Intro)

Big idea: Stored-Program Computer
- Instructions are represented as bit patterns -> entire programs can be stored in memory to be read or written juat like data in seconds ("Von Neumann" computers)
- Everything like instructions and data are stored in memory, so they are addressable, we can access our data by loads and stores.
C pointers can point to anything, keep the memory of program and data separate.
One register keeps address of instruction being executed: "Program Counter" (PC)
- Intel calls it Instruction Pointer (IP)
Binary Compatibility
- Programs are distributed in binary form
  - Programs bound to specific instruction set
  - Different version for phones and PCs
- New machines want to run old programs ("binaries") as well as programs compiled to new instructions -> leads to "backward-compatible" instruction set evolving over time.
Instructions as Numbers
- Since data is in words, make instructions be fixed-size 32-bit words also (same 32-bit instructions used for RV32, RV64, RV128)
- One word is 32 bits, so divide instruction word into "fields"
- Each field tells processor clues what that instruction is -- what type is it and which registers does it operate on
- RISC-V define 6 basic types of instruction formats:
  - R-format for register-register arithmetic operations
  - I-format for register-immediate arithmetic operations and loads
  - S-format for stores
  - B-format for branches (minor variant of S-format)
  - U-format for 20-bit upper immediate instructions
  - J-format for jumps (minor variant of U-format)

R-Format Layout

sub and sra need to perform sign extension, so their func7 are the same
Example

I-Format Layout

Loads

S-Format Layout

combine two parts of offset together to get 12 bits offset

Lecture 12 (RISC-V Instruction Formats 2)

B-Format Layout

Branches read two registers but don't write to a register (similar to stores)
Branches typically used for loops
Recall: Instructions stored in a localized area of memory
PC-Relative Addressing: Use the immediate field as a two's-complement offset to PC.
- Branches generally change the PC by a small amount
- Can specify $\pm 2^{11}$ 'unit' addresses from the PC (we can encode 12-bit offsets as immediates)
- It's an idea to multiply the offset by four bytes before adding to PC, thus we have 4 times greater reach than using byte offset
Branch Caculation
- If we don't take the branch: PC = PC + 4
- If we do take the branch: PC = PC + immediate*4 (not ture in RISC-V)
RISC-V Feature, n $\times$ $\times$ 16-bit Instruction
- Extension to RISC-V base ISA support 16-bit compressed instructions and also variable-length instructions that are multiples of 16-bits in length (save the memory)
- To enable this, RISC-V scales the branch offset by 2 bytes even when there are no 16-bit instructions
- Reduces branch reach by half and means that $\frac{1}{2}$ of possible targets will be errors on RISC-V processors that only support 13-bit instructions
- RISC-V conditional branches can only reach $\pm 2^{10} \times$ 32-bit instructions on either side of PC
Actual RV32 Branch Caculation
- If we don't take the branch: PC = PC + 4
- If we do take the branch: PC = PC + immediate*2

RISC-V Immediate Encodeing

Upper Immediates

Long Immediates

Does the value in branch immediate field change if we move the code?
- If moving individual lines of code, then yes
- If moving all of the code, then no (benefit of PC-relative addressing)
What do we do if destination is $> 2^{10}$ $> 2^{10}$ instructions away from branch
- wisely use j to replace it (wider range)

U-Format for "Upper Immediate" Instructions

lui writes the upper 20 bits of the destination with the immediate value, and clears the lower 12 bits
Together with addi to set low 12 bits, can create any 32-bit value in a register using two instructions
```
lui x10, 0x87654  # x10 = 0x87654000
addi x10, x10, 0x321  # x10 = 0x87654321
```
addi 12-bit ummdiate is always sign-extended, if top bit is set, will subtract -1 from upper 20 bits
- Pre-increment value placed in upper 20 bits, if sign bit will be set on immediate in lower 12 bits
- li x10, 0xDEADBEEF will de all of these
auipc adds upper immediate value to PC and places result in destination register (used for PC-relative addressing)
```
auipc x10, 0  # just store the current PC address in x10
```

J-Format Layout

`jal`

`jalr`

Immediates here are not like in jal and in branches, because we are using the immediate format. We can't just assume that the least significant bit is 0, so there is no multiplication by two bytes.
ret = jr ra = jalr x0, ra, 0

Lecture 13 (Compilation, Assembly, Linking, Loading)

Interpretation vs Translation

Interpreter is a program that executes other programs
- Directly executes a program in the source language.
- Interpret a high-level language when efficiency is not critical.
- Interpret machine code: simulate something to have more debugging information and make itself backward-compatible.
- Interpreter closer to high-level, so can give better error messages
- Interpreter slower, code smaller.
- Interpreter provides instruction set independence: run on any machine.
Language translation gives us another option.
- Converts a program from the source language to an equivalent program in another language.
- Translate to a lower-level language to increase performance.
- Translated/compiled code almost always more efficient and therefore higher performance.
- Translation/compilation helps “hide” the program “source” from the users.

`Compiler`

Input: High-level language code (.c)
Output: Assembly language code (.s)
Note: output may contain pseudo-instructions

`Assembler`

Input: Assembly language code (includes pseudo ops) (.s)
Output: Object code, information tables (true assembly only) (.o)
Reads ans uses directives
Replace pseudo-instructions
Produce machine language
Creates object file
Producing Mechine Language
- "Forward Reference" problem
  - Branch instructions can refer to labels that are "forward" in the program
  - Solved by taking two passes over the program
    - First pass replaces the pseudo instructions and remembers position of labels
    - Second pass uses label positions to generate code
- PC-relative jumps and branches
  - j offset pseudo instruction expands to jal zero, offset
  - Just count the number of instruction half-words between target and jump to determine the offset: position-independent code (PIC)
- References to static data
  - la gets broken up into lui and addi (use auipc/addi for PIC)
  - These require the full 32-bit address of the data (the final resting place in a.out, not relative to where it is in .o file)
  - These can't be determined yet, so we create two tables
    - Symbol Table: List of "items" in this file that may be used by other files (Labels / function calling, Data / anything in the .data section; variables which may be accessed across files)
    - Relocation Table: List of "items" whise address this file needs (any absolute label jumped to: jal, jalr ; Any piece of data in static section)
Object File Format
- object file header: size and position of the other pieces of the object file
- text segment: the machine code
- data segment: binary representation of the static data in the source file
- relocation information: identifies lines of code that need to be fixed up later
- symbol table: list of this file’s labels and static data that can be referenced
- debugging information
- A standard format is ELF (except MS)

`Linker`

Input: Object coed files, information tables (.o)
Output: Executable code (a.out)
Combines several object (.o) files into a single executable ("linking")
Enable separate compilation of filse
- Changes to one file do not require recompilation of the whole program
Steps (static link)
1. Take text segment from each .o file and put them together
2. Take data segment from each .o file, put them together, and concatenate this onto end of text segments
3. Resolve references:Go through Relocation Table; handle each entry; file in all absolute address
Resolving References
- Linker assumes first word of first text segment is at address 0x10000 for RV32
- Linker knows:
  - Length of each text and data segment
  - Ordering of text and data segments
- Linker calculates:
  - Absolute address of each label to be jumped to (internal or external) and each piece of data being referenced
- To resolve references:
  - Search for reference (data or label) in all “user” symbol tables
  - If not found, search library files (e.g., for printf)
  - Once absolute address is determined, fill in the machine code appropriately
- Output of linker: executable file containing text and data (plus header)
Static vs. Dynamically linked library
- statically-linked approach
  - Library is now part of the executable, so if the library updates, we don’t get the fix (have to recompile if we have source)
  - Includes the entire library even if not all of it will be used
  - Executable is self-contained
- Dynamically Linked Libraries (link at runtime)
  - Space/time issues
    - +Storing a program requires less disk space
    - +Sending a program requires less time
    - +Executing two programs requires less memory (if they share a library)
    - –At runtime, there’s time overhead to do link
  - Upgrades
    - +Replacing one file (libXYZ.so) upgrades every program that uses library “XYZ“
    - -Having the executable isn’t enough anymore
  - Prevailing approach to dynamic linking uses machine code as the “lowest common denominator”
    - Linker does not use information about how the program or library was compiled (i.e., what compiler or language)
    - Can be described as “linking at the machine code level”
    - This isn’t the only way to do it ...

`Loader`

Input: Executable Code (e.g., a.out for RISC-V)
Output: (program is run)
Executable files are stored on disk
When one is run, loader’s job is to load it into memory and start it running
In reality, loader is the operating system (OS)
What Dose Loader Do:
- Reads executable file’s header to determine size of text and data segments
- Creates new address space for program large enough to hold text and data segments, along with a stack segment
- Copies instructions + data from executable file into the new address space
- Copies arguments passed to the program onto the stack
- Initializes machine registers
  - Most registers cleared, but stack pointer assigned address of 1st free stack location
- Jumps to start-up routine that copies program’s arguments from stack to registers & sets the PC
  - If main routine returns, start-up routine terminates program with exit system call

Example

C Program
Compiled
Assembled
Linked
Address Calculation

Summary

Lecture 14 (Intro to Synchronous DIgital Systems)

Synchronous Digital Systems

Hardware of a processor (e.g., RISC-V) is a Synchronous Digital System
Synchronous:
- All operations coordinated by a central clock
- "Heartbeat" of the system
Digital:
- All values represented by discrete values
- Electrical signals are treated as 1's and 0's; grouped together to form words

Switches

Transistors

Use it as an amplifier or as a switcher
Modern digital systems designed in CMOS act as voltage-controlled switches
- MOS: Metal-Oxide on Semiconductor
- C for complementary: normally-open and normally-closed switches

Signals and Waveforms

Signals is only treated as 1 or 0, is transmitted over wires continuously
Real world signals have noisy and delay
Type of Circuits(Synchronous Digital Systems are made up
of two basic types of circuits):
- Combinational Logic (CL) circuits
  - Our previous adder circuit is an example.
  - Output is a function of the inputs only.
  - Similar to a pure function in mathematics, y = f(x). (No way to store information from one invocation to the next. No side effects)
- State Elements
  - circuits that store information.

Conclusion

Clocks control pulse of our circuits
Voltages are analog, quantized to 0/1
Circuit delays are fact of life
Two types of circuits:
- Stateless Combinational Logic (&,|,~)
- State circuits (e.g., registers)

Lecture 15 (State, State Machine)

Accumulator

Uses for State Elements
- As a place to store values for some indeterminate amount of time:
  - Register files (like x0-x31 on the RISC-V)
  - Memory (caches, and main memory)
- Help control the flow of information between combinational logic blocks.
  - State elements are used to hold up the movement of information at the inputs to combinational logic blocks and allow for orderly passage.

Register Details: Flip-flops

Inside register: n instances of a "Flip-Flop"
Plip-flop name because the output flips and flops between and 0, 1
D is "data", Q is "output". Also called "D-type Flip-Flop" (threr are other types)

Accumulator Revisited

Pipelining for Performance

Maximum Clock Frequency

Max Delay = CLK-to-Q Delay + CL Delay + Setup Time
Extra Registers are often added to help speed up the clock rate
Insertion of register allows higher clock frequency

Recap of Timing Terms

Clock (CLK) : steady square wave that synchronizes system
Setup Time : when the input must be stable before the rising edge of the CLK
Hold Time : when the input must be stable after the rising edge of the CLK
“CLK-to-Q” Delay : how long it takes the output to change, measured from the rising edge of the CLK
Flip-flopone bit of state that samples every rising edge of the CLK (positive edge-triggered)
Register : several bits of state that samples on rising edge of CLK or on LOAD (positive edge-triggered)

Finite State Machine

you have some states, and the arrows are the transitions

General Model for Synchronous Systems

Collection of CL blocks separated by registers.
Registers may be back-to-back and CL blocks may be back-to-back.
Feedback is optional.
Clock signal(s) connects only to clock input of registers.

Conclusion

lecture 16 (Combinational Logic)

高中学过了，不多赘述

Truth Table

Logic Gates

Boolean Algebra

$+$ means or, $*$ means and, $^-$ means not
Used to see if two circuits are equiv

Laws of Boolean Algebra

Canonical forms

对真值表每个1输出的情况做或组合。(Sum of products / ORs of ANDs)，然后用布尔运算简化。(distribution complementarity identy)

Lecture 17 (Combinational Logic Blocks)

Data Multiplexors

通过S的值决定A与B谁来驱动C的输出。
$C = \bar{S} A + SB$

Arithmetic Logic Unit (ALU)

Adder/Subtractor

输入，输出，进位，很好理解
- 在two's complement里，当结果溢出表示范围时，计算会出现问题。
- XOR( $c_{ \text{out} }$ , $c_{ \text{in} }$ )=1时需要考虑overflow的情况。
减法器
- 当SUB是1，相当于既反转了B的输入，同时又加了1。

Lecture 18 (Single-Cycle CPU Datapath 1)

RISC-V Processor Design

connect the software story with the hardware story

The CPU

Processor (CPU): the active part of the computer that does all the work (data manipulation and decision-making)
Datapath: portion of the processor that contains hardware necessary to perform operations required by the processor (the brawn)
Control: portion of the processor (also in hardware) that tells the datapath what needs to be done (the brain)

Building a RISC-V Processor

Solution: break up the process of “executing an instruction” into stages, and then connect the stages to create the whole datapath
- smaller stages are easier to design
- easy to optimize (change) one stage without touching the others (modularity)

5 Stages of the Datapath

1. Instruction Fetch (IF)
get the instruction from the memory and stores it in the processor
1. Instruction Decode (ID)
looks at that instruction and determines what it is
1. Execute (EX) - ALU (Arithmetic-Logic Unit)
perform that operation (commonly may be done by the ALU)
1. Memory Access (MEM)
access the memory if need
1. Write Back to Register (WB)

The mux here is to bypass that increment of PC and write a branch target if the branch is be taken. (根据"分支条件"决定是继续还是转向新的branch)

be executed in one clock cycle

State and Sequencing

RISC-V: two-read, single-write

State Required by RV32I ISA

R-Type Add Datapath

Datapath with Immediates

Lecture 19 (Single-Cycle CPU Datapath 2)

Supporting Loads

select (by WBSel) whrther it comes from ALU or memory

Datapath for Stores

determined by MemRW

Implementing Branches

Adding jalr to datapath

Adding jal

Adding U-Types

Imm.Gen 的使用变了，没什么结构上的大变动

Conclusion

5 Phases of execution
- IF, ID, EX, MEM, WB

Lecture 20 (Single-Cycle CPU Control)

Control and Status Registers

Control and status registers (CSRs) are separate from the register file (x0-x31)
- Used for monitoring the status and performance
- There can be up to 4096 CSRs
Not in the base ISA, but almost mandatory in every implementation
- ISA is modular
- Necessary for counters and timers, and communication with peripherals
The CSRRW (Atomic Read/Write CSR) instruction ‘atomically’ swaps values in the CSRs and integer registers.
CSRRW reads the previous value of the CSR and writes it to integer register rd. Then writes rs1 to CSR
Pseudoinstruction csrw csr, rs1 is csrrw x0, csr, rs1
- rd=x0, just writes rs1 to CSR
Pseudoinstruction csrwi csr, uimm is csrrwi x0, csr, uimm
- rd=x0, just writes uimm to CSR
Hint: Use write enable and clock…

Datapath Control

Instruction Timing

Control Logic Design

Control Logic Truth Table

Control Realization Options

ROM
- “Read-Only Memory”
- Regular structure
- Can be easily reprogrammed
  - fix errors
  - add instructions
- Popular when designing control logic manually
Combinatorial Logic
- Today, chip designers use logic synthesis tools to convert truth tables to networks of gates

Combinational Logic Control

做一些逻辑运算,,,

记得太多了，这个笔记先到这里，接下来会新开一个笔记做