这是一个CS61C的笔记,由于是边上课边记的,很多地方选择偷懒进行截图,,,
Lecture 01 (Intro)
6 Great Ideas in Computer Architecture
1. Abstraction (Layers of Representation/Interpretation)
- Maintain all these layers of abstraction that pass relevant things up to the upper layer while obfuscating things that we really don't care about (e.g. how the wires work inside microprocessor).
2. Moore's Law & the Principle of Scaling
3. Principle of Locality/Memory Hierarchy
局部性原理:
- 时间局部性:最近访问的数据很可能再次被访问
- 空间局部性:相邻数据很可能被一起访问
内存层次结构(缓存体系)正是基于此原理设计
4. Parallelism
并行化是突破单核性能瓶颈的关键方向,包含:
- 线程级并行(多核CPU) Thread-level parallelism
- 数据级并行(SIMD指令集) Data-level parallelism
- 任务级并行(GPU架构) Executed on a processor concurrently
Amdahl's Law: the principle of the longest pole.
阿姆达尔定律:系统加速比受限于串行部分的比例
公式:S = 1 / [(1-P) + P/N] (P为并行比例,N为处理器数量)
举例:若程序90%可并行,使用10核的极限加速比为 1/(0.1+0.9/10) ≈ 5.26倍
5. Performance Measurement & Feedback of Improvement Based on the Measurements
-
Matching application to underlying hardware to exploit.
Locality
Parallelism
Special hardware features, like specialized instructions (e.g. matrix manipulation) -
Latency/Throughput (metricize what we care about)
How long to set the problem up and complete it (or how many tasks can be completed in given time)
How much faster does it execute once it gets going
Latency is all about time to finish
6. Dependability via Redundancy
- Redundancy so that a failing piece doesn't make the whole system fail.
- Applies to everything from datacenters to storage to memory to instructions.
Lecture 02 (Number Representation)
数据从模拟到数字的转换流程
Tipically the data begins in the analog domain, and we need to then convert it to the digital domain. (Sample -> Quantize -> encode)
- Binary(b,0b10011010010)
- Octal(0,02322)
- Decimal(d)
- Hex(x,0x4d2)
Two's Complement(二进制补码)
- Used by all modern hardware.
- Roughly evenly split between positive and negative. (One more negative)
- Can still use MSB as sign bit.
- To negate: Flip the bits and add one.
Overflow (溢出与负溢出)
- The result of an arithmetic operation can't be represented by the (finite) hardware bits.
- Solution: add more bits.
Sign Extension(符号扩展)
- Want to represent the same number using more bits.
- For sign and magnitude: add 0's after the sign bit.
- For One's complement: copy MSB.
- For Two's complement: copy MSB.
Base Translation(进制转换)
skip.
Biased notation(偏置表示法)
- Take some unsigned value and add some bias to it to get the "actual" value. (Same range but change the location)
Two's complement tricks(补码技巧)
- -1 means all bits are set.
- Flip bits and add 1.
- May cause overflow when adding two positive or two negative numbers.
Lecture 03 (C Intro:Basic)
Why C?
- We can write programs that allow us to exploit underlying features of the architecture.
memory management, spacial instructions, parallelism.
Compilation: Overview
- C compilers map C programs direactly into architecture-specific machine code.
- For C, generally a two part process of compiling: .c files to .o files, then linking the .o files into executables.
C Syntax: main
- To get the main function to accept arguments, use this:
int main(int argc, char *argv[])
argc
will contain the number of strings on the command line (the executable counts as one, plus one for each argument).argv
is a pointer to an array containing the arguments as strings (more on pointers later).
C Syntax: True or False
False in C
- 0
- NULL
- Bollean types provided by C99's stdbool.h
True in C
- ...everything else... (Same idea as in Scheme: Only #f is false, everything else is true)
Typed Variables in C
- Must declare the type of data a variable will hold, E.g:
int var = 2;
- C:
int
should be integer type that target processor works with most efficiently - Only guarantee:
sizeof(long long) sizeof(long) sizeof(int) sizeof(short)
- Also,
short
16 bits,long
32 bits. - All could be 64 bits.
- encourage you to use
intN_t
anduintN_t
- Also,
- Constant is assigned a typed value once in the declaration; value can't change during entire execution of program
const float golden_ratio = 1.618; const int days_in_week = 7; const double the_law = 2.99792458e8;
- Enums: a group of related integer constants.
enum cardsuit {CLUBS,DIAMONDS,HEARTS,SPADES}; enum color {RED, GREEN, BLUE};
Typed Function in C
- You have to declare the type of data you plan to return from a function
- Return type can be any C variable type, and is placed to the left of the function name
- You can also specify the return type as void
- Also need to declare types for values passed into a function
- Variables and functions MUST be declared before they are used
int number_of_people () { return 3; } float dollars_and_cents () { return 10.33; }
Struct in C
- Typedef allows you to define new types.
typedef uint8_t BYTE; BYTE b1, b2;
- Structs are structured groups of variables e.g.,
typedef struct { int length_in_seconds; int year_recorded; } SONG; SONG song1; song1.length_in_seconds = 213; song1.year_recorded = 1994; SONG song2; song2.length_in_seconds = 248; song2.year_recorded = 1988;
Control Flow
- if-else
if (x == 0) y++; if (x == 0) {y++;} if (x == 0) {y++; j = j + y;}
- while
- while (expression) statement
- do statement while (expression);
- for
for (initialize; check; update) statement
- switch
switch (expression){ case const1: statements case const2: statements default: statements } break;
- Note: until you do a
break
statement things keep executing in the switch statement
- Note: until you do a
- goto
- But it can result in spectacularly bad code if you use it, so don't!
- An example
#include <stdio.h> #include <math.h> int main(void) { int angle_degree; double angle_radian, pi, value; printf("Compute a table of the sine function\n\n"); pi = 4.0*atan(1.0); /* could also just use pi = M_PI */ printf("Value of PI = %f \n\n", pi); printf("Angle\tSine\n"); angle_degree = 0;/* initial angle value */ while (angle_degree <= 360) { /* loop til angle_degree > 360 */ angle_radian = pi * angle_degree / 180.0; value = sin(angle_radian); printf ("%3d\t%f\n ", angle_degree, value); angle_degree += 10; /* increment the loop index */ } return 0; }
Lecture 04 (C Intro: Pointers, Arrays, Strings)
Undefined Behavior
- Heisenbugs: Bugs that seem random/hard to reproduce, and seem to disappear or change when debugging.
- Bohrbugs: repeatable.
Pointers
- An address refers to a particular memory location. In other words, it points to a memory location.
- Pointers: A variable that contains the address of a variable.
int *p; // Tells compiler that variable p is address of an int. p = &y; // Tells compiler to assign address of y to p. // & called the “address operator” in this context z = *p; // Tells compiler to assign value at address in p to z // * called the “dereference operator” in this context
Using Pointers Effectively
- Normally a pointer can only point to one type (int, char, a struct, etc.)
- void * is a type that can point to anything (generic pointer)
- Use sparingly to help avoid program bugs… and security issues… and a lot of other bad things!
- You can even have pointers to functions...
- int (*fn) (void *, void *) = &foo // function that takes two void* and return an int
- fn is a function that accepts two void * pointers and returns an int and is initially pointing to the function foo
- (*fn)(x, y) will then call the function
- NULL pointers: The pointer of all 0's, should never read or write a null pointer.
Arrays
- Declaration and initialization:
int ar[2]; int arr[] = {795, 635};
- Accessing elements
- ar[num]
- returns the element.
- Arrays are (almost) identical to pointers
- char *string and char string[] are nearly identical declarations
- They differ in very subtle ways: incrementing, declaration of filled arrays
- Key Concept: An array variable is a “pointer” to the first element
- Consequences:
- ar is an array variable but looks like a pointer in many respects (though not all)
- ar[0] is the same as *ar
- ar[2] is the same as *(ar+2)
- We can use pointer arithmetic to access arrays more conveniently.
- Declared arrays are only allocated while the scope is valid
\\ Incorrect code char *foo() { char string[32]; ...; return string; }
- Pitfall: An array in C does not know its own length, and bounds not checked.
- Consequence: We must pass the array and its size to a procedure which is going to traverse it.
Function Pointer Example
#include <stdio.h>
int x10(int), x2(int);
void mutate_map(int [], int n, int(*)(int));
void print_array(int [], int n);
int x2 (int n) { return 2*n; }
int x10(int n) { return 10*n; }
void mutate_map(int A[], int n, int(*fp)(int)) {
for (int i = 0; i < n; i++)
A[i] = (*fp)(A[i]);
}
void print_array(int A[], int n) {
for (int i = 0; i < n; i++)
printf("%d ",A[i]);
printf("\n");
}
int main(void)
{
int A[] = {3,1,4}, n = 3;
print_array(A, n);
mutate_map (A, n, &x2);
print_array(A, n);
mutate_map (A, n, &x10);
print_array(A, n);
}
Lecture 05 (C Intro: Memory Management)
Dynamic Memory Allocation
- C has operator
sizeof()
which gives size in bytes (of type or variable), and it knows the size of arrays. - To allocate room for something new to point to, use
malloc()
(with the help of a typecast and sizeof)
ptr = (int *) malloc(sizeof(int));- Now, ptr points to a space somewhere in memory of size (sizeof(int)) in bytes.
- (int *) simply tells the compiler what will go into that space (called a typecast)
- malloc is almost never used for 1 var
- ptr = (int ) malloc (nsizeof(int));
- This allocates an array of n integers.
- Once malloc() is called, the memory location contains garbage, so don’t use it until you’ve set its value.
- After dynamically allocating space, we must dynamically free it:
free(ptr);
- Even though the program frees all memory on exit(or when main returns), don’t be lazy!
- You never know when your main will get transformed into a subroutine!
- Two things that will cause your program to crash or behave strangely later on, and cause VERY VERY hard to figure out bugs:
- free() the same piece of memory twice
- calling free() on something you didn’t get back from malloc()
- managing the Heap: realloc(p, size)
- Resize a previously allocated block at p to a new size
- If p is NULL, then realloc behaves like malloc
- If size is 0, then realloc behaves like free, deallocating the block from the heap
- Returns new address of the memory block; NOTE: it is likely to have moved!
int *ip; ip = (int *) malloc(10*sizeof(int)); /* always check for ip == NULL */ … … … ip = (int *) realloc(ip,20*sizeof(int)); /* always check NULL, contents of first 10 elements retained */ … … … realloc(ip,0); /* identical to free(ip) */
- Linked List Example
- Let’s look at an example of using structures, pointers, malloc(), and free() to implement a linked list of strings.
struct Node { char *value; struct Node *next; }; typedef struct Node *List; /* Create a new (empty) list */ List ListNew(void) { return NULL; } /* add a string to an existing list */ List list_add(List list, char *string) { struct Node *node = (struct Node*) malloc(sizeof(struct Node)); node->value = (char*) malloc(strlen(string) + 1); strcpy(node->value, string); node->next = list; return node; }
Memory Locations
- globals: Data declared outside of any procedure; Similar to #1 above, but has “global” scope.
- C has 3 pools of memory
- Static storage: global variable storage, basically permanent, entire program run (can't change the size,e.g.)
- The Stack: local variable storage, parameters, return address (location of “activation records” in Java or “stack frame” in C)
- The Heap (dynamic malloc storage): data lives until deallocated by programmer
The Heap (Dynamic memory)
- Large pool of meemory, not allocated in contiguous order
- back-to-back requests for heap memory could result blocks very far apart
- In C, specify number of bytes of memory explicitly to allocate item
- Heap Management Requirements
- want malloc() and free() to run quickly
- want minimal memory overhead
- want to avoid fragmentaation* - when most of our free memory is in many small chunks
K&R Malloc/Free Implementation
- From Section 8.7 of K&R
- Code in the book uses some C language features we haven't discussed and is written in a very terse style, don't worry if you can't decipher the code
- Each block of memory is preceded by a header that has two fields:
size of the block and
a pointer to the next block - All free blocks are kept in a circular linked list, the pointer field is unused in an allocated block
Choosing a block in malloc()
- If there are multiple free blocks of memory that are big enough for some request, how do we choose which one to use?
- best-fit: choose the smallest block that is big enough for the request
- first-fit: choose the first block we see that is big enough
- next-fit: like first-fit but remember where we finished searching and resume searching from there
When Memory Goes Bad
- Don't return pointer into the stack
- Don't forget realloc() can move data
- Don't free the wrong stuff
- Don't do double-free
- Don't lose the initial pointer (Memory Leak)
- Valgrind to the rescue
Lecture 06 (Floating Pointer)
Basics & Fixed Point
- Representation of Fractions with Fixed Pt. (e.g. 10.0101 = 2 + 0 + 0 + 1/4 + 0 + 1/16)
Floating Point
- pass in another input which tells where the binary point might be.
IEEE 754 Floating Point Standard
- Note: 0 has no leading 1, so reserve exponent value 0 just for number 0
- IEEE 754 uses "biased exponent" representation
Calculate: (-1)^ S (1+ Significand) 2 ^(Exponent-127)
Special Numbers
- Representation for : all Exponent is all 1's and Significand is all 0's.
- Representation for 0 : all Exponent and Significand is 0.
Exponent Significand Object 0 0 0 0 nonzero Denorms 1-254 anything +/-fl. pt. # 255 255 +/- 255 nonzero NaN - Can use the significand bits of NaN to identify which NaN it is.
- The gap between 0 and the least number (with Exponent not all 0's) is , which is between the least number and the second least number, that's why we have denorms.
- Denorms don't have an implicit leading 1 but a leading 0, and the inplicit exponent is -126, then the least number is .
- Everytime we cross to a bigger Exponent, we double the stride. (don't forget )
Examples, Discussion
- representation of ( ) : 0 | 0 1 1 1 1 1 0 1 | 0 1 0 1 |
- The FP don't have add associative. (e.g. ) Because PF
approximates
real result.
Precision and Accuracy
Precision
is a count of the number bits in used to represent a value.Accuracy
is the difference between the actual value of a # and its computer representation.- High precision permits high accuracy but don't guarantee it.
Rounding
Addition
- More difficult, can't just add significand.
- De-normalize to match exponents
- Add significands to get resulting one
- Keep the same exponent
- Normalize (possibly changing exponent)
Other Floating Point Representations
- Double Precision (binary64)
- Quad-Precision (binary128)
- Oct-Precision (binary256)
- Half-Precision (binary16 / bfloat16 which keeps the same bits for Exponent)
- u-bit : tell whether the number is exact or range
Lecture 07 (RISC-V: Intro)
- The set of instructions a particular CPU implements is a Instruction Set Architecture (ISA). (e.g. ARM, Intel x86, IBM Power, RISC-V...)
- Instruction set for a particular architecture (e.g. RISC-V) is represented by the Assembly language.
- Each line of assembly code represents one instruction for the computer.
Assembly Variables: Registers
- Assembly cannot use variables.
- Assembly operands are registers. (operations are performed on the data that sits inside registers)
Benefits: Since registers are directly in hardware, they're very fast
- Processor is the one that orchestrates loading memory from the I/O devices and sending the data from the memory to the output.
- Inside the datapath, the mainn elements are those registers and the execution unit, typically called the artihmetic logic unit.
- The processer communicates to the memory by issuing addresses and reading data from the memory or writing the data to the memory.
- Assert the enable signal to decide whether to write to the memory.
- Since registers are in hardware and right next to the execution, there is a predetermined number of them.
- 32 registers in RISC-V
- Each RISC-V register is 32 bits wide (in RV32 variant)
- Groups of 32 bits called a word in RV32
- Registers are numbered from 0 to 31
- Referred to by number x0 - x31
- x0 is special, always holds value zero
- So only 31 registers able to hold variable values
- Each register can be referred to by number or name
- In assembly language, the registers have no type
- Operation determines how register contents are treated
- Operation determines how register contents are treated
Add/Sub Instructiond
- Syntax of Instruction: one two, three, four (rigid: 1 operator, 3 operands)
- one: operation name
- two: destination
- three: first source
- four: second source
add x1, x2, x3 sub x3 ,x4, x5
Immediates
- Immediates are numerical constants
addi x3, x4, 10 # the immediates should be less the 32 bits
- Syntax similar to add instruction, expect that last argument is a number instead of a register
- There is no Subtract Immediate in RISC-V (substitution is not commutative)
- We can change the sign of the immediate in add to implement sub immediate
- If we set the destination to x0, will do nothing
Lecture 08 (RISC-V: lw, sw, Decision 1)
- Register Footprint For a Particular Kernel: The certain number of registers we have to use for a paticular computational routine or for a compute kernel.
Loading from and Storing to Memory
- We work with laeger amounts of data, so we have to spill out of registers to memory.
- Memory can be viewed as a large one-dimensional array with the address acting as the index to that array starting at 0.
- To access a word in memory, processor must supply an address.
- Address is calculated by a processor and is used to point to a particular word in the memory.
- The address is typically specified as an offset to a base pointer.
- Little-endian: When we store bytes in words in memory, the least-significant byte in a word gets the smallest address.
Data Transfer Instructions
- Register is faster than memory (about 50-500 times faster /in terms of latency of one access - tens of ns)
- Load from memory to register
lw x10, 12(x15) # Reg x10 gwts A[3], 3 words is 12 bytes, x15 points to A[0]
- Store from Register to Memory
sw x10, 40(x15) # A[10] = h + A[3], the offsets must be multiple of 4
- Loading and Storing bytes
lb x10, 3(x11) # copy the content to the low byte position fo the register x10, and the other 3 bytes is copied to "sign-extend" lbu x10, 3(x11) # copy with zero-extend to the other 3 bytes sb x10, 7(x11) # the same syntax as lw and sw but jest deal with Bytes not word
Decision 1: Decision Making
- if-statement
beq reg1, reg2, L1 # the label is essential an immediate value # go to statement labeled L1 if (value in reg1) == (value in reg2), otherwise go to the next statement beq reg1, reg2, L1 # go to statement labeled L1 if (value in reg1) != (value in reg2)
- Types of Branches (Branch - change of control flow)
- Conditional Branch - change control flow depending on outcome of comparison
- Branch if equal (
beq
) or not equal (bne
) - Branch if less than (
blt
) and be greater than or equal (bge
) - the unsigned versions (
bltu
,bgeu
)
- Branch if equal (
- Unconditional Branch - always branch
- jump(
j
)j label
- jump(
- Conditional Branch - change control flow depending on outcome of comparison
- Example
bne x13, x15, Exit
add x10, x11, x12
Exit:
# if (i == j)
# h = g + h;
bne x13, x14, Else
add x10, x11, x12
j Exit
Else:sub x10, x11, x12
Exit:
# if (i == j)
# f = g + h;
# else
# f = g - h;
- Example
add x9, x8, x0 # x9 = &A[0]
add x10, x0, x0 # sum
add x11, x0, x0 # i
addi x13, x0, 20
Loop:
bge x11, x13, Done
lw x12 , 0(x9) # x12 A[i]
add x10, x10 , x12 # sum
addi x9, x9, 4 # &A[i+1]
addi x11, x11, 1 # i++
j Loop
Done:
# int A[20];
# int sum = 0;
# for (int i=0 ; i<20 ; i++)
# sum += A[i];
Lecture 09 (RISC-V: Decision 2)
and x5, x6, x7 # x5 = x6 & x7
andi x5, x6, 3 # x5 = x6 & 3
- There is no logical NOT in RISC-V (use
xor
with ) - Shift Left Logical (
sll
) and immediate (slli
)slli x11, x12, 2 # x11=x12<<2
- store in x11 the value from x12 shifted by 2 bits to the left, inserting 0's on right. (multiplication)
Program Execution
Helpful RISC-V Assembler Features
- Symbolic register names
- a0-a7 for argument registers (x10-x17) for function calls
- zero for x0
- Pseudo-instructions
- Shorthand syntax for common assembly idioms
- e.g. mv rd, rs = addi rd, rs, 0
- e.g. li rd, 13 = addi ra, x0 13
- e.g. nop = addi x0, x0, 0 # no operation
RISC-V Function Calls
- Six Fundamental Steps in Calling a Function
- Put arguments in a place where function can access them
- Tranfer control to function (jal)
- Acquire (local) storage resources needed for function
- Perform desired task of the function
- Put return value in a place where calling code can access it and restore any registers you used; release local storage
- Return control to point of origin, since a function can be called from several points in a program
- RISC-V Function Call Conventions
- Registers faster than memory, so use them
- a0-a7 (x10-x17): eight argument registers to pass parameters and two return calues (a0-a1)
- ra: one return address register to return to the point of origin (x1)
- Also s0-s1 (x8-x9) and s2-s11 (x18-x27): saved registers
jr: jump register, shorthand: ret = jr ra
ra is the return address
we don't use jump because it needs us to code the address for many times - Before:
1008 addi ra, zero, 1016 # ra = 1016 1012 j sum # goto sum
- After:
1008 jal sum # ra = 1012. goto sum, jump and link
- Syntax
jal rd, offset # rd: return destination(noamally PC+4)
jal rd, Label
jalr rd, rs1, offset
Lecture 10 (RISC-V Procedures)
Function Call Example
- Stack: where the old register values saved to restore them after function call
- Stack in memory, so need register to point to it:
sp
(stack pointer, x2)
int Leaf(int g, int h, int i, int j) { int f; f = (g + h) - (i + j); return f; }
- prologue - operation - epilogue
Nested Calls and Register Conventions
Resister Conventions
- CalleR: the calling function
- CallE: the function being called
- When callee returns from executing, the caller needs to know which registers may have changed and which are guaranteed to be unchanged.
Register Conventions
: A set of generally accepted rules as to which registers will be unchanged after a procedure call (jal) and which may be changed.- RISC-V function-calling conventions divides registers into two categories:
- Preserved across function call
- Caller can rely on values being unchanged
- sp (stack pointer), gp (global pointer), tp (thread pointer), "saved registers" s0-s11 (s0 is also sp)
- Not preserved across function call
- Caller cannot rely on values being unchanged
- Argument/return registers a0-a7, ra, "temporary registers" t0-t6
- Preserved across function call
Memory Allocation
- C ahs two storage classes: automatic and static
- Automatic variables are local to function and discarded when function exits
- Static variables exist across exits from and entries to procedures
- Use stack for automatic (local) variables that don't fit in registers
- Procedure frame or activation record: segment of stack with saved registers and local variables
Using the Stack
Memory Allocation
- When a C program is run, there are three important memory areas allocated:
Static
: Variables declared once per program, cease to exist only after execution completes - e.g., C globalsHeap
: Variables decleared dynamically via mallocStack
: Space to be used by procedure during execution; this is where we can save register values
- Where is the Stack in Memory
- RV32 convention (RV64/RV128 have different memory layouts)
- Stack starts in high memory and grows down
- Hexadecimal:
- Stack must be aligned on 16-byte boundary (not ture in previous examples)
- RV32 programs (text segment) in low end
- static data segment (constants and other static variables) above text for static variables
- RISC-V convention global pointer (gp) points to static
- RV32 gp =
- Heap above static for data structures that grow and shrink; grows up to high addresses
Summary
Lecture 11 (RISC-V Instruction Formats 1: Intro)
- Big idea: Stored-Program Computer
- Instructions are represented as bit patterns -> entire programs can be stored in memory to be read or written juat like data in seconds ("Von Neumann" computers)
- Everything like instructions and data are stored in memory, so they are addressable, we can access our data by loads and stores.
- C pointers can point to anything, keep the memory of program and data separate.
- One register keeps address of instruction being executed:
"Program Counter" (PC)
- Intel calls it Instruction Pointer (IP)
- Binary Compatibility
- Programs are distributed in binary form
- Programs bound to specific instruction set
- Different version for phones and PCs
- New machines want to run old programs ("binaries") as well as programs compiled to new instructions -> leads to "backward-compatible" instruction set evolving over time.
- Programs are distributed in binary form
- Instructions as Numbers
- Since data is in words, make instructions be fixed-size 32-bit words also (same 32-bit instructions used for RV32, RV64, RV128)
- One word is 32 bits, so divide instruction word into "fields"
- Each field tells processor clues what that instruction is -- what type is it and which registers does it operate on
- RISC-V define 6 basic types of instruction formats:
- R-format for register-register arithmetic operations
- I-format for register-immediate arithmetic operations and loads
- S-format for stores
- B-format for branches (minor variant of S-format)
- U-format for 20-bit upper immediate instructions
- J-format for jumps (minor variant of U-format)
R-Format Layout
- sub and sra need to perform sign extension, so their func7 are the same
- Example
I-Format Layout
Loads
S-Format Layout
- combine two parts of offset together to get 12 bits offset
Lecture 12 (RISC-V Instruction Formats 2)
B-Format Layout
- Branches read two registers but don't write to a register (similar to stores)
- Branches typically used for loops
- Recall: Instructions stored in a localized area of memory
- PC-Relative Addressing: Use the immediate field as a two's-complement offset to PC.
- Branches generally change the PC by a small amount
- Can specify 'unit' addresses from the PC (we can encode 12-bit offsets as immediates)
- It's an idea to multiply the offset by four bytes before adding to PC, thus we have 4 times greater reach than using byte offset
- Branch Caculation
- If we don't take the branch: PC = PC + 4
- If we do take the branch: PC = PC + immediate*4 (not ture in RISC-V)
- RISC-V Feature, n 16-bit Instruction
- Extension to RISC-V base ISA support 16-bit compressed instructions and also variable-length instructions that are multiples of 16-bits in length (save the memory)
- To enable this, RISC-V scales the branch offset by 2 bytes even when there are no 16-bit instructions
- Reduces branch reach by half and means that of possible targets will be errors on RISC-V processors that only support 13-bit instructions
- RISC-V conditional branches can only reach 32-bit instructions on either side of PC
- Actual RV32 Branch Caculation
- If we don't take the branch: PC = PC + 4
- If we do take the branch: PC = PC + immediate*2
RISC-V Immediate Encodeing
Upper Immediates
Long Immediates
- Does the value in branch immediate field change if we move the code?
- If moving individual lines of code, then yes
- If moving all of the code, then no (benefit of PC-relative addressing)
- What do we do if destination is instructions away from branch
- wisely use
j
to replace it (wider range)
- wisely use
U-Format for "Upper Immediate" Instructions
lui
writes the upper 20 bits of the destination with the immediate value, and clears the lower 12 bits- Together with
addi
to set low 12 bits, can create any 32-bit value in a register using two instructionslui x10, 0x87654 # x10 = 0x87654000 addi x10, x10, 0x321 # x10 = 0x87654321
addi
12-bit ummdiate is always sign-extended, if top bit is set, will subtract -1 from upper 20 bits- Pre-increment value placed in upper 20 bits, if sign bit will be set on immediate in lower 12 bits
- li x10, 0xDEADBEEF will de all of these
auipc
adds upper immediate value to PC and places result in destination register (used for PC-relative addressing)auipc x10, 0 # just store the current PC address in x10
J-Format Layout
jal
jalr
- Immediates here are not like in jal and in branches, because we are using the immediate format. We can't just assume that the least significant bit is 0, so there is no multiplication by two bytes.
- ret = jr ra = jalr x0, ra, 0
Lecture 13 (Compilation, Assembly, Linking, Loading)
Interpretation vs Translation
Interpreter
is a program that executes other programs- Directly executes a program in the source language.
- Interpret a high-level language when efficiency is not critical.
- Interpret machine code: simulate something to have more debugging information and make itself backward-compatible.
- Interpreter closer to high-level, so can give better error messages
- Interpreter slower, code smaller.
- Interpreter provides instruction set independence: run on any machine.
- Language
translation
gives us another option.- Converts a program from the source language to an equivalent program in another language.
- Translate to a lower-level language to increase performance.
- Translated/compiled code almost always more efficient and therefore higher performance.
- Translation/compilation helps “hide” the program “source” from the users.
Compiler
- Input: High-level language code (.c)
- Output: Assembly language code (.s)
- Note: output may contain pseudo-instructions
Assembler
- Input: Assembly language code (includes pseudo ops) (.s)
- Output: Object code, information tables (true assembly only) (.o)
- Reads ans uses directives
- Replace pseudo-instructions
- Produce machine language
- Creates object file
- Producing Mechine Language
- "Forward Reference" problem
- Branch instructions can refer to labels that are "forward" in the program
- Solved by taking two passes over the program
- First pass replaces the pseudo instructions and remembers position of labels
- Second pass uses label positions to generate code
- PC-relative jumps and branches
- j offset pseudo instruction expands to jal zero, offset
- Just count the number of instruction half-words between target and jump to determine the offset: position-independent code (PIC)
- References to static data
- la gets broken up into lui and addi (use auipc/addi for PIC)
- These require the full 32-bit address of the data (the final resting place in a.out, not relative to where it is in .o file)
- These can't be determined yet, so we create two tables
- Symbol Table: List of "items" in this file that may be used by other files (Labels / function calling, Data / anything in the .data section; variables which may be accessed across files)
- Relocation Table: List of "items" whise address this file needs (any absolute label jumped to: jal, jalr ; Any piece of data in static section)
- "Forward Reference" problem
- Object File Format
object file header
: size and position of the other pieces of the object filetext segment
: the machine codedata segment
: binary representation of the static data in the source filerelocation information
: identifies lines of code that need to be fixed up latersymbol table
: list of this file’s labels and static data that can be referenceddebugging information
- A standard format is ELF (except MS)
Linker
- Input: Object coed files, information tables (.o)
- Output: Executable code (a.out)
- Combines several object (.o) files into a single executable ("linking")
- Enable separate compilation of filse
- Changes to one file do not require recompilation of the whole program
- Steps (static link)
- Take text segment from each .o file and put them together
- Take data segment from each .o file, put them together, and concatenate this onto end of text segments
- Resolve references:Go through Relocation Table; handle each entry; file in all absolute address
- Resolving References
- Linker assumes first word of first text segment is at address 0x10000 for RV32
- Linker knows:
- Length of each text and data segment
- Ordering of text and data segments
- Linker calculates:
- Absolute address of each label to be jumped to (internal or external) and each piece of data being referenced
- To resolve references:
- Search for reference (data or label) in all “user” symbol tables
- If not found, search library files (e.g., for printf)
- Once absolute address is determined, fill in the machine code appropriately
- Output of linker: executable file containing text and data (plus header)
- Static vs. Dynamically linked library
- statically-linked approach
- Library is now part of the executable, so if the library updates, we don’t get the fix (have to recompile if we have source)
- Includes the entire library even if not all of it will be used
- Executable is self-contained
- Dynamically Linked Libraries (link at runtime)
- Space/time issues
- +Storing a program requires less disk space
- +Sending a program requires less time
- +Executing two programs requires less memory (if they share a library)
- –At runtime, there’s time overhead to do link
- Upgrades
- +Replacing one file (libXYZ.so) upgrades every program that uses library “XYZ“
- -Having the executable isn’t enough anymore
- Prevailing approach to dynamic linking uses machine code as the “lowest common denominator”
- Linker does not use information about how the program or library was compiled (i.e., what compiler or language)
- Can be described as “linking at the machine code level”
- This isn’t the only way to do it ...
- Space/time issues
- statically-linked approach
Loader
- Input: Executable Code (e.g., a.out for RISC-V)
- Output: (program is run)
- Executable files are stored on disk
- When one is run, loader’s job is to load it into memory and start it running
- In reality, loader is the operating system (OS)
- What Dose Loader Do:
- Reads executable file’s header to determine size of text and data segments
- Creates new address space for program large enough to hold text and data segments, along with a stack segment
- Copies instructions + data from executable file into the new address space
- Copies arguments passed to the program onto the stack
- Initializes machine registers
- Most registers cleared, but stack pointer assigned address of 1st free stack location
- Jumps to start-up routine that copies program’s arguments from stack to registers & sets the PC
- If main routine returns, start-up routine terminates program with exit system call
Example
Summary
Lecture 14 (Intro to Synchronous DIgital Systems)
Synchronous Digital Systems
- Hardware of a processor (e.g., RISC-V) is a Synchronous Digital System
- Synchronous:
- All operations coordinated by a central clock
- "Heartbeat" of the system
- Digital:
- All values represented by discrete values
- Electrical signals are treated as 1's and 0's; grouped together to form words
Switches
Transistors
- Use it as an amplifier or as a switcher
- Modern digital systems designed in CMOS act as voltage-controlled switches
- MOS: Metal-Oxide on Semiconductor
- C for complementary: normally-open and normally-closed switches
Signals and Waveforms
- Signals is only treated as 1 or 0, is transmitted over wires continuously
- Real world signals have noisy and delay
- Type of Circuits(Synchronous Digital Systems are made up
of two basic types of circuits):- Combinational Logic (CL) circuits
- Our previous adder circuit is an example.
- Output is a function of the inputs only.
- Similar to a pure function in mathematics, y = f(x). (No way to store information from one invocation to the next. No side effects)
- State Elements
- circuits that store information.
- Combinational Logic (CL) circuits
Conclusion
- Clocks control pulse of our circuits
- Voltages are analog, quantized to 0/1
- Circuit delays are fact of life
- Two types of circuits:
- Stateless Combinational Logic (&,|,~)
- State circuits (e.g., registers)
Lecture 15 (State, State Machine)
Accumulator
- Uses for State Elements
- As a place to store values for some indeterminate amount of time:
- Register files (like x0-x31 on the RISC-V)
- Memory (caches, and main memory)
- Help control the flow of information between combinational logic blocks.
- State elements are used to hold up the movement of information at the inputs to combinational logic blocks and allow for orderly passage.
- State elements are used to hold up the movement of information at the inputs to combinational logic blocks and allow for orderly passage.
- As a place to store values for some indeterminate amount of time:
Register Details: Flip-flops
- Inside register: n instances of a "Flip-Flop"
- Plip-flop name because the output flips and flops between and 0, 1
- D is "data", Q is "output". Also called "D-type Flip-Flop" (threr are other types)
Accumulator Revisited
Pipelining for Performance
- Maximum Clock Frequency
Max Delay = CLK-to-Q Delay + CL Delay + Setup Time
- Extra Registers are often added to help speed up the clock rate
- Insertion of register allows higher clock frequency
Recap of Timing Terms
- Clock (CLK) : steady square wave that synchronizes system
- Setup Time : when the input must be stable before the rising edge of the CLK
- Hold Time : when the input must be stable after the rising edge of the CLK
- “CLK-to-Q” Delay : how long it takes the output to change, measured from the rising edge of the CLK
- Flip-flopone bit of state that samples every rising edge of the CLK (positive edge-triggered)
- Register : several bits of state that samples on rising edge of CLK or on LOAD (positive edge-triggered)
Finite State Machine
- you have some states, and the arrows are the transitions
General Model for Synchronous Systems
- Collection of CL blocks separated by registers.
- Registers may be back-to-back and CL blocks may be back-to-back.
- Feedback is optional.
- Clock signal(s) connects only to clock input of registers.
Conclusion
lecture 16 (Combinational Logic)
高中学过了,不多赘述
Truth Table
Logic Gates
Boolean Algebra
- means or, means and, means not
- Used to see if two circuits are equiv
Laws of Boolean Algebra
Canonical forms
对真值表每个1输出的情况做或组合。(Sum of products / ORs of ANDs),然后用布尔运算简化。(distribution complementarity identy)
Lecture 17 (Combinational Logic Blocks)
Data Multiplexors
- 通过S的值决定A与B谁来驱动C的输出。
Arithmetic Logic Unit (ALU)
Adder/Subtractor
- 输入,输出,进位,很好理解
- 在two's complement里,当结果溢出表示范围时,计算会出现问题。
- XOR( , )=1时需要考虑overflow的情况。
- 减法器
- 当SUB是1,相当于既反转了B的输入,同时又加了1。
Lecture 18 (Single-Cycle CPU Datapath 1)
RISC-V Processor Design
- connect the software story with the hardware story
The CPU
- Processor (CPU): the active part of the computer that does all the work (data manipulation and decision-making)
- Datapath: portion of the processor that contains hardware necessary to perform operations required by the processor (the brawn)
- Control: portion of the processor (also in hardware) that tells the datapath what needs to be done (the brain)
Building a RISC-V Processor
- Solution: break up the process of “executing an instruction” into stages, and then connect the stages to create the whole datapath
- smaller stages are easier to design
- easy to optimize (change) one stage without touching the others (modularity)
5 Stages of the Datapath
-
- Instruction Fetch (IF)
get the instruction from the memory and stores it in the processor
-
- Instruction Decode (ID)
looks at that instruction and determines what it is
-
- Execute (EX) - ALU (Arithmetic-Logic Unit)
perform that operation (commonly may be done by the ALU)
-
- Memory Access (MEM)
access the memory if need
-
- Write Back to Register (WB)
- Write Back to Register (WB)
- The mux here is to bypass that increment of PC and write a branch target if the branch is be taken. (根据"分支条件"决定是继续还是转向新的branch)
- be executed in one clock cycle
State and Sequencing
- RISC-V: two-read, single-write
State Required by RV32I ISA
R-Type Add Datapath
Datapath with Immediates
Lecture 19 (Single-Cycle CPU Datapath 2)
Supporting Loads
select (by WBSel) whrther it comes from ALU or memory
Datapath for Stores
determined by MemRW
Implementing Branches
Adding jalr to datapath
Adding jal
Adding U-Types
- Imm.Gen 的使用变了,没什么结构上的大变动
Conclusion
- 5 Phases of execution
- IF, ID, EX, MEM, WB
Lecture 20 (Single-Cycle CPU Control)
Control and Status Registers
- Control and status registers (CSRs) are separate from the register file (x0-x31)
- Used for monitoring the status and performance
- There can be up to 4096 CSRs
- Not in the base ISA, but almost mandatory in every implementation
- ISA is modular
- Necessary for counters and timers, and communication with peripherals
- The CSRRW (Atomic Read/Write CSR) instruction ‘atomically’ swaps values in the CSRs and integer registers.
- CSRRW reads the previous value of the CSR and writes it to integer register rd. Then writes rs1 to CSR
- Pseudoinstruction csrw csr, rs1 is csrrw x0, csr, rs1
- rd=x0, just writes rs1 to CSR
- Pseudoinstruction csrwi csr, uimm is csrrwi x0, csr, uimm
- rd=x0, just writes uimm to CSR
- Hint: Use write enable and clock…
Datapath Control
Instruction Timing
Control Logic Design
Control Realization Options
- ROM
- “Read-Only Memory”
- Regular structure
- Can be easily reprogrammed
- fix errors
- add instructions
- Popular when designing control logic manually
- Combinatorial Logic
- Today, chip designers use logic synthesis tools to convert truth tables to networks of gates
- Today, chip designers use logic synthesis tools to convert truth tables to networks of gates
Combinational Logic Control
- 做一些逻辑运算,,,
记得太多了,这个笔记先到这里,接下来会新开一个笔记做