Thoroughness of the Outline That being said, it did a fairly good job of preparing me for the Aug. '96 Exam, although at least one term appeared on the exam which was not present anywhere in the assigned syllabus ("data parallelism"). In addition, one other term was used on the exam which was not the standard term used in the readings ("line", referring to an entry in a cache). This outline was an attempt to summarize all of the general ideas in the Aug. 1996 syllabus reading BUT DOES NOT COVER the details of example systems. Some example systems are mentioned, but for others, the details were too important to try to summarize. For my purposes, the outline was made during a first phase of learning the basic concepts, and in a second phase, I went back and read about all of the example systems in the reading. None of those was mentioned on the actual exam, and most seem to have been removed from the Aug. '97 syllabus.
Changes in the Exam Syllabus The syllabus for the Aug. '97
exam differs significantly from the Aug. '96 one, but for the most
part, is a subset thereof.
Aug. '96 syllabus, but not on Aug. '97 syllabus:
Hwang & Briggs - very poorly-written text on parallelism and
interconnection networks with many examples of parallel hardware
Lipovski & Malek - a nice text on graph theory as applied to
networking with three examples of non-conventional architectures
Hennessy & Patterson Ch. 6 - Advanced Pipelining Techiniques
Hennessy & Patterson Ch. 9 - Input/Output
Hennessy & Patterson App. A - Computer Arithmetic
Tanenbaum Ch. 7 - UNIX
Tanenbaum Ch. 9-14 - Distributed Systems, occupying about the last 25%
of the outline This is the most substantial deletion of material from
the new syllabus without adding additional reading to cover the same
material.
Aug. '97 syllabus, but not on Aug. '96 syllabus:
Hennessy & Patterson Ch. 4 - Instruction-Level Parallelism - probably
a briefer and better account of topics from Hwang & Briggs
Hennessy & Patterson Ch. 7 - Interconnection Networks - probably
a briefer and better account of topics from Hwang & Briggs and
Lipovski & Malek
Hennessy & Patterson Ch. 8 - Multiprocessors - probably
a briefer and better account of topics from Hwang & Briggs
I'm offering this up as an act of altruism, with gratitude in mind
to the group that put together the C451 Table that helped me study for
that exam. If you appreciate this, and find it helpful, do something
for those who follow you. As with the C451 Table, this outline can't
replace the reading, but can only help check your knowledge.
Comments? Complaints? Compliments?
John Rehling's
Home Page
rehling@cogsci.indiana.edu
Good luck!
Hennessy and Patterson
Design
X in n% faster than Y means that Ty/Tx = 1 + n/100; performance = 1/T
Make the Common Case Fast
Amdahl's Law: Speedup = Tbefore/Tafter = 1/(%Same + %Diff/Speedup)
90-10 Rule: 90% of execution occurs in 10% of code
Locality of Reference- items close in memory or last access will be
close in next access
Trends: Transistors, Speed, Disk Density double in 3yrs; RAM density
quadruples; RAM delay and Disk delay decrease by 1/3 per decade
Memory Hierarchy:
Size Access MB/sec
Regstrs <1K 10ns 800
Cache <512K 20ns 200
RAM <512M 100ns 133
Disk >1GB 20ms 4
Software is flexible for special cases; can be faster than Hardware
Performance Measures and Cost
CPUtime = (#instructions * CPI)/Clock-Rate
MIPS: ignores code density issues
Relative MIPS: fix program and input
MFLOPS:FP not always relevant; need to relativize, because not all FP
operations are equal; peak MFLOPS rarely relevant
Evaluation: Kernels, Toy Benchmarks, Synthetic Benchmarks
Compiler can distort with Benchmark-specific optimizations
Marketer can distort with Hardware-specific optimizations
Performance ranking of 2 machines may reverse with new task
Cost: RAM cost drops exponentially; Cost of die varies with area^3;
Cost to Test, Cost to Package; Higher Performance/Cost usually means
higher cost
Hardware-independent metrics (code size) can't be reliable
Must involve all 3 of #instructions, CPI, Clock-Rate
CPI can't be looked up in table; Cache and RAM Availability &
Pipelining are key factors
Parallelism may increase throughput without helping response time
Instruction Sets
Operations - which?
Operands - How many? type? where? size?
Instruction Types
ALU, Data Transfer, Control, System (OS, VM), FP, Decimal, String
3 Instruction Set Types
Stack: 0 explicit operands; push & pop stack
Good model of nested calls; Short instructions
Random-access hard; Stack may be in slow memory; Memory Bottleneck
Accumulator: 1 explicit operand; load & store accumulator
Simple internal machine state; Short instructions
High memory traffic
Register: 2 or 3 explicit operands; load & store registers & RAM
Most general model of programming; Easy to program
Long instructions
Subtypes (memory operands, register operands)
Register-Register (0,3): uniform instructions; high #instructions
Register-Memory (1,1): few loads=good density; hogs encoding;
variable CPI
Memory-Memory (3,0): compact code; variation in instruction
length; memory bottleneck
Little-Endian vs. Big-Endian: only noticed with bytes & word access
Access Alignment: byte, halfword, word, doubleword
Small chunk: easy to program; tricky in hardware
Addressing Modes- example, meaning, common use
Register
ADD R4,R3 R4=R4+R3 registers
Immediate/Literal
ADD R4,#3 R4=R4+3 constants
Displacement/Based
ADD R4,100(R1) R4=R4+M[100+R1] local vars
Register Deferred/Indirect
ADD R4,(R1) R4=R4+M[R1] pointer
Indexed
ADD R3,(R1+R2) R3=R3+M[R1+R2] array
Direct/Absolute
ADD R1,(1001) R1=R1+M[1001] static data
Memory Indirect/Memory Deferred
ADD R1,@R3 R1=R1+M[M[R3]] pointer
Auto-Increment
ADD R1,(R2)+ R1=R1+M[R2]; R2=R2+D array
Auto-Decrement
ADD R1,-(R2) R2=R2-D; R1=R1+M[R2] array
Scaled/Index
ADD R1,100(R2)[R3] R1=R2+M{100+R2+R3*D] array
Frequency: Displacement, Immediate, Register, Deferred, Scaled
Sizes of operands for Displacement & Immediate big issues
Encoding Adressing Mode: In Opcode vs. Separate Address Specifier
Want many registers and modes, but small and uniform-size
Control: Jump, Branch, Call, Return
Relative vs. Absolute Addressing: Size of Displacement an Issue
Branch Implementation (may use more than one)
Condition Code: often set for free; extra states; forces order
Condition Register: easy to set; uses up a register
Compare & Branch: only one instruction, but it's a big one
EQ/NE most common case, so tempting to make it special
Branches are usually: EQ/NE, forward, and taken
Saving state of machine in registers: Callee or Caller?
Compilers & Optimization Types- Phase-Ordering an inherent problem
High-level: procedure integration
Local: do repeated computation once; propagate constants; reduce
stack height
Global: carry local optimizations across branches; tune loops
Register Allocation: graph coloring algorithm
Machine-Dependent Optimizations
More registers saves memory access, but complicates machine
Design Priorities: regularity; orthogonality; general primitives
(rather than special-case hacks); identify constants at compilation
Load/Store leads to less memory access than Memory-Memory
Optimizing can radically change instruction mix
Processor Implementation
Processor = Control + Datapath
State: Registers + PC + IAR + Status (flags) Register
Basic Steps of Execution
SEI = sign-extended immediate; Rx = register x; IR = instruction
register
1. IF : MAR <- PC; IR <- M[MAR]
2. ID : A <- Rs1; B <- Rs2; PC+=4
3. EX : MAR <- A + SEI; MDR <- Rd memory reference
3. EX : ALUoutput <- A op (B or SEI) ALU operation
3. EX : ALUoutput <- PC + SEI; cond <- A op 0 branch/jump
4. MEM: MDR <- M[MAR] or M[MAR] <- MDR load or store
4. MEM: if (cond) PC <- ALUoutput branch
5. WB : Rd <-ALUoutput or MDR
Hardwired Control
Finite-State Diagram
complexity = #states * #control inputs * #control outputs
Optimize by combining states
Number of possible states exponential, but a PLA serves in place of
a massive table; similar states should have similar table numbers
for more efficient encoding
Microprogrammed Control
Microintruction: string of bits specifying operands, operation, etc.
Control Store: the memory of the microprogram
Microprogram may have its own looping, or number of next instruction
may be included in each microinstruction
Reducing Hardware Costs: Encode Control Lines (2^k to k); Multiple
Microinstruction Formats with opcode for narrower
microinstructions; Some hardwired control (tricky)
Reducing CPI: Special-case microcode for exceptional cases;
Hardwired Control may get 2 microinstructions done at once;
Parallelism of wider instructions
Interrupts Complicate Control
State Diagram Must be modified to allow for this
Example Interrupts: I/O, OS, trace, breakpoint, overflow, page
fault, mis-aligned memory, memory protected, undefined instruction,
hardware/power failure
5 axes: synchronous/asynchronous, user/coerced, maskable/unmaskable,
within/between instruction, resume/terminate
Worst are within-resume interrupts (FP error, page fault)
Microcode is not always faster than Macrocode, due to caches
Filling up instruction space is not free (design costs; limit future
options)
Writable Control Store not too helpful (multiprogramming, VM,
difficulty of microprogramming)
Pipelining
Ideal Situation: 1 instruction/clock cycle, plus overhead
Changes to Datapath: PC updated perhaps as often as each clock cycle;
Separate MAR for instructions and data (IMAR & DMAR); Peak memory
bandwidth much higher, probably separate instruction/data caches
Structural Hazards
Some unit would be needed by two instructions at once, so one stalls
May be worth savings from not duplicating everything, especially if
structural hazard would be rare, and unit costly (floating point)
Data Hazards
WAW, WAR, RAW
Forwarding: if a register is changed after it is fetched, the new
value can be passed on directly to the instruction
Stall: if the new value has not been computed, wait until it has
Scheduling: Compiler (Static) or Hardware (Dynamic) can try to
minimize dependencies
Control Hazards
Instruction Pre-Fetch may be thwarted by a branch
Stall: wait until you know the new PC
Predict Taken, Using New PC (go back if you're wrong)
Predict Not Taken (go back if you're wrong); rarer but easier
Delayed Branch: have multiple next-instruction slots in pipeline,
and start all of them, keeping the right one
Pipelining and Interrupts
Precise- interrupt detected at ID stage, making a clean break
Imprecise can occur later
Save PC, and disable writes, letting pipeline clear
May take action as soon as interrupt occurs, or record all
interrupts in an interrupt vector, and check each instruction at WB
Pipelining and Instruction Set Complications
Condition Codes- must know last use of flags to predict branches
Multicycle Operations (Floating Point) likely to force stalls
Forwarding sometimes possible
Interrupts during multicycle operations- one operation may create
an interrupt after its successor has completed, possibly
destroying operands
Solutions: allow errors; queue all results and restore (history
file recalling change or future file delaying change); skip from
interrupt ahead, content with what was done out-of-order; never
issue a new instruction until old one is done if interrupts
possible
Dynamic Scheduling- Reduce data dependencies; Determine if and when
for branches early on
Scoreboard- issue each instruction as early as possible
4 Steps: Issue, Read Operands, Execution, Write
3 Parts of Scoreboard
Instruction Status- flag for each of four steps per instruction
Function Unit Status- busy flag, operation, destination, sources,
where and if sources are ready
Register Result Status- which operation will write to register
Tomasulo's Algorithm- distribution of hazard detection for complex
architectures (FP); WAR & WAW eliminated by copying registers
Reservation Stations at each functional unit,
Common Data Bus for all traffic
Instruction Set design and Pipelining Interact
Deeper Pipelines lead to Diminishing Returns
Optimized code should be the test case
Unusual instruction sequences may exacerbate hazards
Memory Hierarchy
Memory Stalls can significantly degrade CPI
Reads are much more common than writes
Miss Penalty = Access + Transfer; may be over 1000 clock cycles
Block-Frame Address: where block is
Block-Frame Offset: where in block data is
Larger Blocks leads to fewer misses, until Pollution Point
Key Questions For any Level of Hierarchy
Where can a block go?
How is the block found?
Which block is replaced on a miss?
How are changes written?
Cache- small; fast; upper level; ~64K, 64-byte blocks, 1-cycle access
Placement Options
Direct Map- only one possible home
Fully Associative- block can go into any frame
Set Associative- anywhere within fixed subset of all frames
Identification
Block has an Address Tag and Valid Bit (initially 0)
Address Tags are stored in an Associative Memory and all are
checked simultaneously at each memory access
May have three parts: Index (for set), Tag and Offset
Only Tag needed for identification
Word read in parallel with address check; used on hit
Replacement Algorithms
Random- easy to program and debug
Least-Recently Used- makes use of locality; overhead
First-In-First-Out- lackluster performance
Write Strategy
Write-Through- easy; no delay at read miss; acheives coherence
Write-Back- less total traffic; cache speed for typical write
Write Miss Strategy
Write Allocate loads whole block; makes use of locality
Write Around just does the write to lower level
Sources of Cache Misses
Compulsory- in the beginning, cache is vacant
Capacity- when cache is full
Conflict- there's space, but limited associativity wasted it
Caches may handle Data and Instructions separately or be Unified
Main Memory- DRAM, as opposed to SRAM for caches
Optimizations for Increasing Bandwidth
Wider Main Memory- get more per access
Interleaved Memory- speeds sequential or parallel independent
access; requires that expansion is a doubling or quadrupling
DRAM Row Buffer- square root of memory size in bits; standard
Quadruples Bandwidth in One of Four Modes:
Nibble Mode- DRAM gives three extra bits per row access
SCRAM (Static Column)- get random bits while row is active
Page Mode- similar to SCRAM, with a strobe needed per access
Virtual Memory
Relocation swaps Pages from disk to Main Memory Page Frames; On Page
Fault, appropriate page must be swapped in; Control by OS; Pages
0.5K-8K; Protection keeps processes from intruding on each other;
Page Size set to balance Internal Fragmentation and External
Fragmentation
Placement- Always Fully Associative because miss penalty is huge
Identification- Page/Segment Number looked up in Table
(Segments usually have a two-word address to just one for Pages)
OS makes VM invisible; Page Table holds physical addresses for each
virtual page; Inverted Page Table saves space; Segment provides
separate address space for Data, Text or Stack
Replacement- LRU, to minimize misses; approximation has "Use Bit"
defining subset of old pages; cleared periodically
Segments have nonuniform size, and may be difficult to fit in
Writing- always Write Back with Dirty Bit, to minimize disk access
Larger Page Sizes- Efficient Transfer; Smaller Table; Internal
Fragmentation; Slow Startup
If Page Table requires paging, access time is doubled!
Translation-Lookaside Buffer holds common Page Table Entries in
Associative Memory/Cache; TLB checked first, before table
Protection- Base & Bound; User/Kernel Bits; Permission Bits; Rings
More Optimizations
Instruction Prefetch Buffers- fetch is faster than CPU use
Register Windows- separate memory to save at procedure call
Cache Optimizations- many possibilities, including Two-Level Caches;
Write Buffer for parallel read/write
Set Address Space Ceiling with Future in Mind
Set-Associative over Direct-Map if misses match overhead
Traces from Another Architecture Unreliable for Designing the Next
Adding Address Space Segments (slow) vs. New Ceiling (development)
Caches are not always register-fast; else, why have registers?
Input/Output
Liable to be the Bottleneck in the Architecture
Bandwidth can be helped by Parallelism without helping Response Time
Response Time only improved if queues can be processed faster
Faster Response Time leads to much faster User Interaction
Measures: I/O rate (#accesses/second) vs. Data Rate (bytes/second)
Types Of I/O Devices
3 Axes: Behavior(read once/read-write/write-only); Partner
(human/device); Data Rate (peak)
Magnetic Disks; DRAMs; Optical; Arrays (reliability an issue)
Graphics Display- CRT plus buffer
Networks
RS232- 0.3 to 19.2 Kbits/sec; terminals in star around central unit
Ethernet- 10Mbit/sec; no central control; 128b-1530b packets
Collision- Wait a Random Interval and Retry
Connect LANs with Bridge (2) or Gateway (many)
Circuit-Switch small networks; Packet-Switch large, long ones
Buses- CPU/memory (short and fast) or I/O (long, slow, maybe wide)
Bus width: may multiplex; Data Width; Transfer Size
Bus Masters: Single or Multiple; Synchronous/Asynchronous;
Split-Transaction can operate in Bursts or Packets
CPU and I/O
Memory-mapped I/O vs. Special Opcodes, registers, ports, etc.
Polling (busy waiting) vs. Interrupts (overhead can bog CPU down)
Hybrid- poll, but rarely
Direct Memory Access- block transfer via separate processing unit
Other Issues- Stale Caches (flush if tag is set); VM and DMA (Virtual
DMA one solution); Main Memory can be a cache for disks
Poor I/O Design can seriously impair throughput
I/O Processor to take load may not improve upon CPU
Cost of Media includes cost of Whole System
Seek Time: not just random arm position; Inertia & Locality factors
Computer Arithmetic
Unsigned Integer Arithmetic
Ripple-Carry Addition
Full-Adder: 3 bits in, 2 bits out
n Full Adders; O(n), but low constant
Radix-2 Multiplication: PA=A*B
P=0
Do n times:
If leftmost-bit(A)=1 then P+=B
Shift-right(PA)
Radix-2 Division: A=A/B; P=A mod B
Do n times:
Shift-left(PA)
P-=B
If P>=0 Then
low-bit(A)=1
Else
low-bit(A)=0
P+=B
Signed Integers
Sign Magintude: First bit is sign; rest is magnitude
One's Complement: to get -x, bitwise complement
Two's Complement: One's Complement + 1; x + (-x) = 2^n
Biased: fixed bias is defined as zero
Two's Complement Arithmetic
Addition: same algorithm as for unsigned integer; ignore carry out
Subtraction: x-y = x + (-y); -y: complement each bit of y and add 1
Multiplication: Hardware for Signed numbers such that Sign bit of P
shifts into High-order bit of P
Booth Recoding used for Add to P:
lo-bit(A) hi-bit(a) Add
0 0 0
0 1 B
1 0 -B
1 1 0
Overflow: Flag, Trap or Ignore
For unsigned, ignore so address math is wraparound
High bits of multiplication often ignored by high-level languages
Multiply/Divide Usually Slower than Add Subtract
May offer Multiply-Step instruction; Integer Multiply in FP or
autonomous unit
For multiple precisions of addition, carry may be put into flag, and
ADD instuction will add that to the next word up
Floating-Point Arithmetic- favored over Fixed-Point and Logarithmic
Significand & Exponent; precise standard not in place
IEEE Standard- 4, with various bits of precision & max exponent
Single (24, 127); Single Extended (>=32, >=1023),
Double (53, 1023), Double Extended (>=64, >=16363)
Exponent=255: zero fraction means infinity; else NaN
Exponent=0: zero fraction means zero; else denormal (tiny)
Gradual Underflow- minimum exponent holds firm
small numbers slowly lose significand
if x!=y, then(x-y)!=0
non-negatives ordered like unsigned integers
Most all of IEEE should be put into hardware, for speed
Rounding- extra bits: guard (high), round (low), sticky (overflow)
FP Addition
Put Biggest First
Shift Smaller to align
Add Fractions
If Carry Out, increment Exponent
Round
FP Multiply
Multiply Fractions
Add Exponents
Test for Overflow
May Trap to Software for Denormals
FP Divide: 2 laborious algorithms
Newton's iteration
Follow tangent of 1/x to intercept, iterating repeatedly
Goldschmidt's Algorithm
x=a; y=b; multiply both by r repeatedly; when y=1, x=a/b
Remainder: may do in software
Rounding Double Precision Answers to Single Precision
Accurate if the number of digits is less than half (as in IEEE)
Exceptions: Inexact (overflow); Invalid (sqrt(0), 0/0, etc.)
Optimizing Integer Addition
Carry Lookahead- shallower, wider combinatorial circuit
Carry-Skip- Lookahead, but in smaller blocks chained together
Carry-Select
for each block, assume both carry-in possibilities; pick winners
Time Space
Ripple O(n) O(n)
CLA O(ln n) O(n ln n)
Carry-Skip O(sqrt(n)) O(n)
Carry-Select O(sqrt(n)) O(n)
Optimizing Integer Multiply and Divide
Shift Over Zeros; Carry-Save Adder avoids duplicating work;
Higher-Radix; Many Adders; Division with CSA
FP Underflow is Frequent
Integer-FP Conversions are Frequent
Memory Bandwidth can be bottleneck of FP Unit
IEEE 0 and -0 are unequal
Hwang and Briggs
Introduction
Parallelism- Multiprogramming; Time-Sharing; Pipelining; Concurrency:
Simultaneity; Vectorization; Arrays
Multiplicity of Functional Units: Pipeline in CPU; I/O Processors;
Hierarchical Memory; Balancing Bandwidths; Multiprogramming
Speedup with n processors: log n to n/log n
Flynn's Classification: SISD, MISD, SIMD, MIMD
Feng: Bit-Slice vs. Word Parallelism: WSBS, WPBS, WSBP, WPBP
Haendler: Classify with
CPUs x pipelinable CPUs, ALUs/CPU x pipelinable ALUs/CPU,
word length x pipeline stages
Applications: Weather, Oceanography, Astrophysics, Economics,
Government Files, Engineering, Aerodynamics, AI, Remote Sensing,
Seismology, Seismic Petroleum Exploration, Fusion and Fission
Reactors, CAT, Genetics, Weapons, Chemistry, Physics, VLSI
Pipelining and Vector Processing
Pipelines- Arithmetic, Instruction, Multiple Processors (rare)
3 Axes: Unifunction/Multifunction, Static/Dynamic, Scalar/Vector
Reservation Tables: Unit use vs. Time
Interleaved Memory
S-Access: Get an access from each of several modules at once
C-Access: Activate a bank read from each of several modules on it
C/S Access: Many C-Accesses at once
Bandwidth: approaches maximum exponentially with probability that
next access will be to sequentially-following address
Types of Pipelines
Instruction Pipelines: fetch new instruction each clock cycle; pass
them down assembly line
Arithmetic Pipelines
CSA (Carry-Save Adder), as opposed to a Carry-Propagate Adder
Tree of CSAs ending in CPA speeds multiplication
Multifunction Pipelines
Complex Example does FP Multiply, SQRT, Divide, Remainder
Array Pipelines: matrix arithmetic
Principles of Designing Pipeline Processors
Instruction Prefetch and Branch Handling
Buffer can get instructions from beyond potential branch
Data Buffering and Busing Segments
Internal Forwarding: if register holds same thing as memory, use it
If first store result is immediately overwritten, don't perform it
Hazard Detection and Resolution- Data Forward with registers; stall
with memory
Job Sequencing and Collision Prevention: Stall as little as possible
Reservation Table derives: Forbidden Set of Latencies, Collision
Vector, State Diagram, which in turn finds Cycles (including Simple
Cycles and Greedy Cycles) which have an Average Latency each
Use the Cycle with the Minimum Average Latency
Multifunction Pipelines' Reservation Tables are overlays of each
supported function
Dynamic Pipelines and Reconfigurability- tricky design
Vector Processors
Several Operands produce a Vector or Scalar Result
Instruction: Operation, Base Address, Increment, Offset, Length
Masking Vector and Compress and Merge Operations provide control
Optimized by: Enrich Vector Instruction Set; Group Scalar Operations
Together; Suitable Vector Algorithms; Vectorized Compiler
Multiple Vector Task Dispatching: bin-packing heuristics
Pipelined Vector Processor Methods
Horizontal, Vertical, 2-D Blocks
SIMD Array Processors
SIMD Organization
Control Unit handles Scalar Operations and controls multiple
Processing Elements
CU may control Alignment Network between PEs and Memory
CU may control Interconnection Network, each PE has its own Memory,
CU has Masking Register to turn PEs on and off
Communication Axes: Operation Mode (Synchronous, Asynchronous,
Combined); Control Strategy (Centralized usually, Distributed);
Switching (Circuit, Packet); Topology (Static, Dynamic)
Interconnection Networks
Static: Linear, Ring, Star, Tree, Mesh, Systolic Array, Chordal
Ring, Completely Connected, n-Cube, n-Cube Connected Cycle
Dynamic
Single-Stage
N Input Selectors: demultiplexer, N Output Selectors: multiplexer
AKA Recirculating Networks: Low Connectivity (n log n)
Multistage- Feedforward Stages
Switch Boxes have 2 inputs and 2 outputs; may allow 2 or all 4
interconnection states; n log n size
Control: individual stage or individual box
Blocking- connections may impair further connections
Rearrangeable- enough routes can be found to enable all possible
connections (example: Benes)
Nonblocking- all possible connections possible without blocking
(examples: Clos, Crossbar Switch)
Examples: Mesh, Cube, Barrel-Shifter, Omega
Parallel Algorithms
Matrix Multiplication: do whole row at a time
MergeSort in O(n) time
Associative Array Processors
Content-Addressible Memory in 2-D alignment
Mask selects bits in Comparand; matches tagged in Indicator
Bit Parallel- one cycle does everything (PEPE)
Bit Serial- one bit slice at a time (STARAN)
Distributed Logic- character fields (PEPE)
Word-Organized- all bits are equal
Algorithms: Maxima, Minima, Mean; Equal, Not-Equal, Similar,
Proximate (within range); Less-Than, Greater-Than; Nearest-Below;
In-Range (open or closed); Asceding/Descending Sort
SIMD Performance Enhancement
Parallel Memory Organization- number of modules should be relatively
prime to distance between successive accesses
Array Processing Languages
Multi-SIMD Machines not found yet
Multiprocessor Architecture and Programming
MIMD Architecture
Loosely Coupled (message-transfers in a Distributed System)
Example: Cm* with Hierarchical Architecture
Tightly Coupled (shared main memory)
Examples: Crossbar Switch, Fully-Connected Bipartite
Design Issues
Process Recoverability; Efficient Context Switching; Large VM and
Main Memory; Synchronization Primitives; Good Communication
Instructions for: Procedure Linking; Looping; Parameter
Manipulation; Multidimensional Arrays; Address Range-Checking
Interconnection Networks
Buses: Time-Shared; cheap, low capacity, catastrophic failure
Priority; Round-Robin; Least-Recently-Used; Daisy Chain; FCFS;
Polling; Independent Requests
Crossbar Switch- high bandwidth, efficient, expensive
Multiport Memory- all switches at memory units; high rate, very
expensive
Multistage Networks- feedforward with 2 x 2 switches
Banyan: Partial-Order in Distinct Levels
Complete connection at O(n log n) size
Delta Network: a^n x b^n has n stages of a x b Crossbar Modules
High cost
Parallel Memory Organizations
Interleaving best when sharing is high
L-M Organization: l C-Accesses with m modules each
high bandwidth of a cache fairly cheap
Cache Coherence a problem if caches are private
Static Coherence Check: tag items as Shared or Private; only cache
Private
Dynamic Coherence Check: at write to shared cache, Write-Through
and invalidate all other cached versions; global tables for
Present and Modified info for each block by each processor
Multiprocessor Operating Systems
Master-Slave- one shall rule; slaves make requests; eliminates
problems of shared tables; single point of failure
Separate Supervisors- to each his own; sharing is in files; some
tables are shared, some global; replicating code wastes memory;
failure restart tricky
Floating Supervisor- take turns; best load balancing; most
complicated; table access conflicts a problem; may divide services
between many supervisors
Exploiting Multiprocessing for Parallelism
Languge Features
FORK starts child; JOIN waits for child to end
cobegin... coend
parfor handles linear row of array
Critical Sections set off by: csect VAR do A
Explicit Parallelism: user decides
Implicit Parallelism: compiler finds
Partitioning divides program into tasks
Assignment hands them to processors
Lipovski and Malek
Introduction
SIMD and MIMD
Granularity: within instruction is fine; multi-instruction is coarse
Communication
Direct (wires), Indirect (CPUs) or Switched (other hardware)
Computational Energy
Unit of Computational Power: UCP
Measure UCPs in Gates or Processors, but don't mix
Passive Circuits, as in memory, may inflate #gates
Efficiency: ratio of input bits to output bits
Useful only in comparison of Equivalent systems
Theorem: Efficiency of two Equivalent systems is inversely
proportional to the ratio of input energies of the systems
Theorem: if all UCPs are being used, Efficiency varies with unit
speed
First-Order Analysis: add Computation
Second-Order Analysis: add Computation, Communication and Control
Pipelining superior to Parallelism if units are specialized
if units are the same, Parallelism is better
Starvation Energy: unit unavailable for use
Examples: the end of the pipeline when pipeline starts up; FP unit
when no FP operations are being performed
Use specialized unit if Starvation Energy is less than Energy of
general unit computing it; obviously, depends upon frequency of use
Inductive Architecture: scales up with speed, topology, energy all in
ratio to increased size of system
Graphs and Switches
Graph Definitions
Permutation Network: allows all paths from one set to another (or
itself); Nonblocking allows a whole permutation at once; Blocking
requires some paths to break for others to be formed
Circuit-Switched: able/disable flags create stable circuits for bulk
transmission
Packet-Switched: register holds data and passes it on when possible
Packet Train: several packets sent together
Message-Switched: Packet Trains of variable length
Active Node (nodes are UCPs) vs. Passive Node (nodes are wires)
Simplex (one-way), Half-Duplex (either way), Duplex (both ways)
Subgraph (some nodes, all their arcs); Partial "Spanning" Graph (all
nodes, some arcs); Partial Subgraph (some nodes, some arcs)
Elementary: no repeat nodes in path
Delay: length of longest elementary path from source to sink
Homomorphism: many-to-one, combining multiple units
kth Correspondence Set: k steps away from a set of nodes
Energy and Graphs
Switch Energy: delay * power; add Control Energy to get Total Energy
Theorem: given fixed elementary paths and probability of blocking,
packet-switching uses no more energy than circuit-switching;
nonblocked circuit-switching uses no more energy outside the switch
than packet-switching with same elementary paths
Theorem: in circuit-switching network with N UCPs, if delay>sqrt(N)
then architecture is not inductive
Blocking networks have higher energy due to unavailable resources
Circuit-Switching is much faster than Packet-Switching, because it
uses the Interconnection Network more efficiently;
Circuit-Switching is better if resources outside the network are
most expensive; Packet-Switching if Switching Energy is the main
cost; Packet Trains can be the worst of both
Transitive Graphs are fast and flexible but use a lot of power; the
Minimum-Power graph will always be intransitive
Weak Ordering: transitive reflexive
Partial Order: transitive reflexive antisymmetric
Equivalence: transitive reflexive symmetric
Hasse Diagram: intransitive irreflexive asymmetric
Trans. Red.: make it not make it not
Trans. Clos:. make it so make it so
Rings are Transitive Reductions and Homomorphic Reductions of
Fully-Connected graphs; have delays O(n); work well if
communication is localized to subsegments of the ring, or if each
node can be useful while waiting (like stand-alone machines);
lowest power consumption besides Trees
Trees are Hasse Diagrams of in-degree 1; great for addressing;
lowest possible power; congestion near root
Indirect Communication (passing through a third arc) saves
connections but causes congestion; Transitive Reduction a good
candidate: Single Path from each source to each sink a tempting
option...
Banyans and Their Kin
Lattices
Greatest Lower Bound (GLB): descendent of all lower-bounds
Least Upper Bound (LUB): ancestor of all upper-bounds
Lattice: a partial order, and any two nodes have a GLB and a LUB
Mesh: a grid rotated 45 degrees; sqrt(N) delays; well-suited for
some tasks, like those involving pixels
Cube and various hyper-meshes reduce energies and delays; not
inductive
Trees: lowest power; low energy; can be used as a bus
Banyan: a Hasse Diagram of a partial order, with exactly one path
from each source to each sink; fully-connect with sparse power
Examples: Tree, Crossbar Switch, Bus, Omega Network, joined pair of
Clos Networks
Regular: in-degree is constant (besides sources) and out-degree is
constant (besides sinks)
k-Level: each path from source to sink is of length k
Rectangular: same number of nodes at each level
(i, o, l) Banyan: in-degree, out-degree, levels
SW-Banyan: regular, k-level banyan such that any two nodes in the
same level project to and are projected to by the same nodes
Important Properties
Set of all paths from a source is a tree
Recursive Locality: 1-(o/i) of connections stay within block
Examples: Omega Network, Crossbar Switch, Bus/Star, Bitonic Sort
CC-Banyan: composition of fundamental graphs, each having i sources
and o sinks
Important Properties
Inverse tree addressing
Relative, distributed, Locality; Suited to applications where
communication into sets distributed around a circle
Examples: Barrel Switch, Crossbar Switch, Bus/Star
Derivatives of Banyans
Batcher Sort: many Bitonic Sorts
Clos Network: pair of Rectangular SW-Banyans
Banyan performs a "half" permutation; Clos a full permuatation
Double-Ended Banyan (resources at sinks) vulnerable to a single
failure; Single-Ended Banyan (resources at sources only) very
robust
Tanenbaum
Introduction
Operating System
Provides Clean Virtual Machine
Manages Resources
May run in Kernel Mode
Multiprogramming- so CPU isn't wasted on I/O wait; Timesharing
Processes- Process Table; Core Image; Shell; Child Processes; Signal;
uid and gid
Files- Directory (Root and Working); Path; Permission Bits; File
Descriptor handle; Special Files (Block or Character); stdin,
stdout, stderr; pipes
System Calls- library procedure lays out parameters, and makes a Trap
to the OS; Errors detected there; Results passed back
Architectures: Monolithic; Layered (layers of permission); Virtual
Machines (simulates the ugliness below); Client-Server (kernel just
handles message-passing)
Processes
Multiprogramming provides Pseudoparallelism
Race Conditions: 2 processes access shared memory, and result depends
upon which acts first
Deadlock: 2 or more processes wait for something that only one of the
other waiting processes can do
Starvation: processes are executing, but one or more never get to
some point in their execution
Mutual Exclusion: no 2 processes in Critical Section at once; with no
assumptions about speed or number of CPUs; no blocking by
non-critical section; no process should wait forever to run Critical
Section
Poor Solutions: Disable Interrupts (too powerful); Lock Variables
(race condition); Strict Alternation, TSL Instruction, and
Peterson's Solution (busy waiting); Sleep and Wakeup (race
condition)
Semaphores- generalize SLEEP and WAKE with decrement/increment of
variable; Binary Semaphore a toggle; others count; be sure to
decrement resource to be taken before taking it, increment resource
to be added after adding it, take control only immediately before
actually making the change; list of processes sleeping on each
Semaphore
Event Counters- Condition Variables manipulated with Read(E) probes,
Advance(E) increments, Await(E,v) compares; Event Counters cannot
decrease; again, bookkeeping is changed before "greedy" event, after
"generous" one Monitors- high-level language construct, rarely
available; easy to program, because compiler does the work;
condition variables can be WAITed upon, or SIGNALed
Message Passing- SEND(dest, &message) and RECEIVE(source, &message);
receiver may send Acknowledgement, messages numbered so repetition
is not harmful; Domains eliminate machine name ambiguity;
encrypted Authorization may be included; Mailboxes buffer messages
Choice of Race-Condition Prevention Method: Semaphores and Event
Counters low-level; Monitors rarely supported; Message Passing
inefficient on one machine, the others can't work in distributed
systems without shared memory
Equivalence of Primitives
Semaphores simulate entering and leaving Monitor; Semaphore and
mailbox per process simulate Message Passing
Monitors keep counter and list of processes to simulate Semaphore;
counter for Condition Variables of Message Passing
Messages can use a special Synchronization Process to simulate
Semaphores and Monitors
Classical Problems
Producer-Consumer: one provides items, one takes them away
Dining Philosophers: each of 5 thinks, then tries to get two forks
and eat before thinking anew
Readers and Writers: one writer, or multiple readers at a time
Sleeping Barber: customers wait for barber, who sleeps when idle
Process Scheduling
Run-to-Completion, AKA Nonpreemptive Scheduling
Round Robin- Quantum size main issue, about 100ms is right
Priority Scheduling- highest priority runs; can be set by who user
is, by least recent CPU time, pick a class and have Round Robin
within it; priorities may decay
Multiple Queues- several groups, with promotion and demotion
possibly based on recent CPU time; may give longer quanta (doubling
exponentially) to different queues
Shortest-First- optimal response time if all jobs arrive at once;
may estimate run length by past examples
Guaranteed Scheduling- meet needs, demands and requests if possible
Policy vs. Mechanism- user or parent process may set priority/need
Two-Level Scheduling- to get swapped into memory, and to run
Memory Management
Monoprogramming is for PCs only
Multiprogramming seeks right Quantum size
CPU Utilization- I/O wait reduced by switching to other processes
Utilization = 1 - p^n
p is probability of I/O wait; n is Degree of Multiprogramming
Fixed Partitions- one queue, or several for different size
partitions; keep at least one small one around
Relocation and Protection- Base and Limit hardware registers to keep
bounds on access
Swapping
Variable Partitions- Fixed can't handle growing
Compaction can eliminate Fragmentation, fast with special hardware
If processes tend to grow, maybe allocate extra initially
If processes have two segments that can grow, have them grow towards
each other
Methods
Bit Map- smaller unit: less waste, bigger map; simple; search for
free space slow
Linked List, in address order- find hole with First Fit, better
than Next, Best and Worst Fits; Quick Fit has a list of holes,
quick to find, but slow to merge
Buddy System- all blocks are powers of two, list kept; quick to fit
and quick to merge; Internal and External Fragmentation
Some systems take disk space from processes in memory
Fifty-Percent Rule: about one hole per process
Unused Memory Rule: fraction = k/(k+2), k is ave. hole/ave. process
Virtual Memory: if Page Fault occurs, get Page from Disk
Paging- a Page is a virtual thing; Page Frame is in Memory
Page Table- entry for each Page, in order
Present/Absent Bit; Page Frame for the Page, if P/A Bit is set;
Modified/Referenced bits; Caching-Disable for memory-mapped I/O;
Protection/Locking flags
Can be very large; mapping must be very fast
Multilevel- k Level Table has entries pointing to k+1 Level
Tables, only some of which need actually exist; allows
huge Address Space with limited storage, but requires multiple
memory accesses
Translation-Lookaside Buffer holds common Page Table Entries in
Associative Memory/Cache; TLB checked first, before table
Zero-Level Paging: all TLB; trap on a miss
Inverted Page Table: entry per Page Frame; always in Associative
Memory; more overhead on miss, but saves space with huge Page
Table for huge Address Space
Page-Replacement Algorithms
Optimal: dispose of longest-until-needed Page; impossible, but
simulated for experimentation
Not-Recently-Used: Reference and Modified Bits (former cleared by
ticks) define 4 classes; pick one from oldest class
FIFO- may well throw out useful pages; not used
Second Chance/Clock: first pass clears bit; kills if bit is unset
Least-Recently-Used- too much overhead to fully implement
Aging can shift right and set leftmost bit to 1; kill lowest
Hardware can maintain N x N matrix of usage
Modelling Page Replacement Algorithms
Reference String: sequence of page references
Distance String: time since last use of each page, or infinity
Flat distribution means no locality, and Page Faults plentiful
Stack Algorithms: top of memory at time T+1 same if you add another
Page Frame; FIFO not so
Belady's Anomaly: more memory can cause more faults without a
Stack Algorithm
Faults: F = SUMk[m+1...n](Ck) + Cinfinity, m Page Frames, n Pages
Design Issues for Paging Systems
Thrashing: incessant Page Faults; Locality helps prevent it
Demand Paging: get a Page when and if you need it
Working Set Model- Prepaging when swapping in reloads the pages in
use last time; 1 in high bits of Aging give Working Set
WSCLOCK uses Working Set non-membership to swap out
Global Page Allocation makes more sense than local, guaranteeing
each process an equal number of page frames; Page Fault Frequency
algorithm balances each process's share to minimize Page Faults
Page Size- small wastes less, especially for process with small
requirements, but needs large table; transfer is mainly disk
rotation so small pages are nearly as slow to load as large ones
Optimal for process size s and table entry size e: sqrt(2*s*e)
Implementation Issues for Paging Systems
Instruction Backup after Fault- hardware or software must save PC
Locking Pages in Memory- Memory-Mapped I/O
Shared Pages- especially program text
Backing Store- Process Table points to disk space; to deal with
growth, may allocate page-by-page or separate segments
Paging Daemons- periodically free up some page frames
Page Fault Handling: hardware saves PC and traps to kernel; kernel
saves registers and calls OS; find victim if necessary, and
swap victim out if dirty; context switch to other processes during
transfers; swap in and mark Page Table; back up faulting
instruction; schedule faulting process; kernel restores volatiles
Segmentation- many Address Spaces, distinguished by purpose, visible
to programmer; size variable and flexible; facilitates sharing and
protection; Checkerboarding fragmentation may be cured by
Compaction; may have Segmentation with Paging
File Systems: large, permanent storage of information
Naming- length, case-sensitivity, file extensions (.suffix)
Structure- sequence of Bytes, fixed-length Record Fields, Tree of
records searchable by a key
File Types- ASCII or Binary, Directory, Special: Block or Character
Sequential Access or Random Access
Attributes- owner, size, age, last change, creator, protection, etc.
Operations- CREATE, DELETE, OPEN, CLOSE, READ, WRITE, APPEND, SEEK,
COPY, RENAME, SEEK, GET-ATTRIBUTES, SET-ATTRIBUTES, etc.
Memory-Mapped Files: a copy in Main Memory instead of on disk; if
evicted by Page-Replacement, send the data to a real file; meshes
well with Segmentation; Problems: finding end of write, coherence,
size of file
Directories: a table of entries; Name-Attributes or Name-Pointer (to
Attributes); Hierarchy or DAG (trickier for deletion); Path Name,
Relative to Working Directory or Absolute
Operations- CREATE, DELETE, OPEN, CLOSE, READ, RENAME, LINK, etc.
File Implementation
Contiguous- easy bookkeeping; fast access; must know maximum size;
much Fragmentation
Linked List- pointer to next block (ruins power-of-2 size); slow
Linked List with Index- table in Memory holds pointers to next
block; faster, because pointers aren't on disk; table can be large
I-Nodes- each file has its own fixed-length table for blocks; if
more slots are needed, Indirect Blocks (single, double, etc.)
Directories- just a list, attributes and blocks as above, via
pointers or I-Nodes
Shared Files- Links make DAGs of the directory; may be a pointer to
the file's block information, or a Symbolic Link: just a pathname
Disk Space Management- because Contiguous is a bad idea
Block Size- small is less wasteful but slower with larger table;
512b, 1K or 2K common choices
Free Blocks recorded in Linked List on disk or the more-concise Bit
Map in memory
Disk Quotas- in UNIX, soft and hard quotas for each user
Reliability- bugs can happen in factory or field
Bad Blocks- keep a list or dummy file containing them
Backup with Incremental Dumps, tag Archive Bit, cleared with use
Consistency- check at boot; each block should be in some list
Performance Enhancements
Block Cache in memory; needs to be written back often; Write
Through for removable media
Position associated blocks close together, especially I-Nodes
Security- disasters, malfunctions, human errors, Trojan Horses,
Worms, Viruses, Trapdoors; need Authentication and to seek and
remove holes in security; have secure backup of data
Protection
Domain: a set of (Object, Rights) pairs
Access Control Lists: for each object, who has what rights
RWX bits an offshoot
Capabilities: for each user/process, what Objects they can
manipulate, and with what Rights; Generic Rights exist; Rights
Amplification lets system calls do more
Preventing Capability-List Tampering: hardware provided Kernel Bit
for all Memory Locations; keep Capabilities List in OS; encrypt it
Type Manager Module may define Rights
Revoke Access by having them point to an indirect object which can
be moved, or changing a long Object password, implicitly invoked
with each access
Protection Models- in model, seek paths to Unauthorized States
Covert Channels- vary CPU or device usage; sneak data into output
Input-Output
Device Controller the hardware; Device Driver the software
I/O Channel: optional additional bus
OS gives Controller minimal data in parameters; Controller returns
preamble, sector data, checksum
Memory-Mapped I/O: modest number of Controller registers mapped
directly to Main Memory
Direct-Memory Access: controller stores data in its own buffer, and
writes the buffer to memory; time lag in reads leads to Interleaved
blocks on disk (0, 4, 1, 5, 2, 6, 3, 7) so rotate coincides with
read time
I/O Software Principles
Information Hiding- Device-Independence and Uniform Naming;
Error-Handling as low-level as possible
Mounting: giving a pathname to a disk or other device
Synchronous (Blocking) vs. Asynchronous (Interrupt-Driven)
Synchronous easier to program, but Asynchronous more efficient, so
Asynchronous implementations that look Synchronous are ideal
Sharable (disks) vs. Dedicated (printers) Devices
Four-Level I/O Software
Interrupt Handlers- hide deep, maybe block with DOWN on a Semaphore,
WAIT on a Condition Variable or RECEIVE on a message; interrupt
procedure must tag back
Device Drivers- device-dependent code; abstract and concise command
converted to arcane and lengthy commands to Controller; may wait
for device to finish (drive) or not (screen)
Device-Independent I/O Software- uniform interface and naming;
Protection; Buffering; Mutual Exclusion; Error codes; May shunt
some to Device Driver for efficiency
User-Space I/O Software- stdio-type libraries and stand-alone
applications; Spooling and Spool Daemons
Disks- physical Sectors and abstract Blocks; may have RAM disks
Arm Scheduling- First-Come, First-Served; Shortest Seek First;
Elevator Algorithm (best)
Track-at-a-Time Caching- Driver can handle for CPU reads or
Controller for DMA reads
Clocks- slow ones tied to AC cycle or fast ones in crystal
Tick- interrupt that maintains Time of Day, Preemptive Scheduling,
Profiling, CPU Usage accounting
Watchdog Timer: shuts off disk motor or gives up on network wait
Terminals- interface hides details
RS-232 passes changes back and forth
Memory-Mapped Interface (character or bit) represents whole screen
Input comes in Raw (character) or Cooked (line); Echoed to screen
Output- cursor motion and escape sequences
Deadlock: 2 or more processes wait for an event that only one of the
other waiting processes can cause
Resources may be Preemptable or Nonpreemptable; process needing a
Nonpreemptable Resource may Block or get error code and fail
Deadlock Conditions- need all 4 for Resource-Related Deadlock
Mutual Exclusion: at most one process can hold resource at a time
Hold and Wait: processes holding resources can request more
No Preemption: resources cannot forcibly be taken from a process
Circular Wait: a circular chain of processes; each awaits the next
Graph Models
Process holds Resource: R -> P
Process requests Resource: P -> R
Cycle indicates Deadlock
Dealing with Deadlock
Ostrich Method: ignore problem and hope it doesn't happen
avoids limiting functionality
Detection and Recovery
With one Resource of each kind: find cycle in graph
With Multiple Resources
Matrices C (current allocation), A (available), R (requests)
LOOP:
Look for process, Pi, such that the ith row of R < ith row of A
If found, {add ith row of C to A; mark process i; goto LOOP}
All unmarked processes, if any, are Deadlocked
Recovery- Preemption (rob Peter to pay Paul); Rollback in time via
Checkpointing; Kill processes until Deadlock is gone
Avoidance: avoid Deadlock with clever scheduling
Resource Trajectories- some deadends in 2-D grid of
process-resource states
Safe: there is a way out; Unsafe: cannot avoid Deadlock
Banker's Algorithm: grant request if it leads to a Safe state;
else, postpone request
Requires that maximum needs are known in advance, so irrelevant
Prevention: eliminate one of the 4 conditions
Mutual Exclusion- never-past-capacity Spool, if possible
Hold and Wait- rarely possible; release & re-request may help
No Preemption- unacceptable for some devices (printers)
Circular Wait- strict priorities to processes
Other Issues
Two-Phase Locking: take all resources before giving any back
Non-Resource Deadlocks- mistake with Semaphore, for instance
Starvation- without proper scheduling, some process may never finish
Introduction to Distributed Systems
Advantages over Centralized System: Cost/Performance, Speed,
Reliability, Growth, Inherent Distribution for some tasks
Advantages over Isolated Multiple Units: Data Sharing, Device
Sharing, Communication, Flexibility in where work is done
Disadvantages: Software & Networking difficult; Security
Flynn: SISD, MISD (never), SIMD, MIMD
MIMD: Tightly-Coupled (Multiprocessors, Shared Memory) or
Loosely-Coupled (Multicomputers, Private Memory)
Bus-Based Multiprocessors- Cache Coherence via Write-Through or
(better) Snooping; 32-64 CPUs at most
Switched Multiprocessors- Crossbar Switch or Omega Network; NUMA =
Non-Uniform Memory Access lets local memory to work faster
Bus-Based Multicomputers- CPUs communicate; less traffic than
Shared Memory
Switched Multicomputers- Grid or n-Cube; Topology should suit task
Software- Tightly-Coupled- subtasks; Loosely-Coupled- separate tasks
Loosely-Coupled Software & Loosely-Coupled Hardware
NFS; Stateless Servers tempting
Tightly-Coupled Software & Loosely-Coupled Hardware
true Distributed Systems; Single-System Image; Global, Uniform
Communication; Consistent OS Kernel, System Calls and Protection
Tightly-Coupled Software & Tightly-Coupled Hardware
Multiprocessor Timesharing Systems; seem like general-purpose SISD
machines; single run queue; shared Main Memory; private caches
Design Issues
Transparency- Location, Migration, Replication, Concurrency,
Parallelism
Flexibility- Microkernel (Interprocess Communication, Memory
Management, Scheduling, I/O) shunts rest to User Space
Reliability- Fault Tolerance; 1 failure or n wreck system?;
Coherence; Security; Clean Transmission
Performance- Large Grain Size deemphasizes Overhead; in conflict
with Reliability
Scalability- centralized units, data or routing begs massive
failure; total Synchronization unlikely
Communication in Distributed Systems
Layered Protocols- OSI standard; add a header at each protocol;
layers added by Sender and peeled by Receiver
Physical (bits); Data Link (fixed-length frames; Sender-Receiver
bicker; Checksum at end); Network (routing and hops; X.25 or IP);
Transport (message to packets; TCP); Session, Presentation and
Application rarely supported
Client-Server Model: Clients use service Servers provide; less
overhead on simple, reliable LANs
Physical and Data Link levels, simple Request-Reply for Messages
SEND(dest, &message); RECEIVE(source, &message)
Headers define Sizes, Operations, Error Codes, Fields of Messages
Addressing- machine ASCII name or number; big random number
Blocking (CPU idle; still best); Nonblocking with copy (slow);
Nonblocking with Interrupt (hard to program)
Unbuffered (may lose a message); Temporary-Buffered (still may lose
a message); Mailboxes
Reliability: how many Acknowledgments and when
Remote Procedure Call- gets away from I/O-style communication
Marshall parameters into a Message; Kernel sends Message to Server;
Timer starts; Server Kernel unpacks parameters and executes call;
Result is returned
Parameter Passing- if representations are not uniform (Big Endian
vs. Little Endian), conversion must be done, to standard or host
format; Pointers: not supported or whole array passed, or as much
as is necessary (effectively Copy/Restore passing)
Dynamic Binding: Client finds a suitable Server, with Name, Version,
Handle (system-specific), Unique ID (sparse process identifier)
agreed upon
RPC Semantics in the Presence of Failures
Client cannot locate Server- exception via Signal Handler
Lost Request Messages- wait and resend
Lost Replay Messages- if Message is not Idempotent (capable of
re-transmissions without error), add identification number
Server Crashes- if the RPC call caused the crash, it would be nice
to know that; Exactly-Once semantics ideal; At-Most-Once or
At-Least-Once implementable
Client Crashes- Orphan processes left behind; may go back after
recovery and kill them all; have Epoch associated with each
re-boot, and kill processes from prior Epochs; Server may kill
processes if Epoch Broadcast is heard; limited amount of time may
be granted to each process in the first place, renewable
RPC Implementation Issues
Protocol, of so many possible ones
Connection (large network) or Connectionless (small, reliable)
Standard (free) or Custom (specialized) Protocol
Packet Size?
Acknowledgements- Stop-and-Wait vs. Blast; Selective Repeat asks
for resend only of what is lost; Overrun Errors the major hazard
Critical Path- some cost varies with Message size; others constant
Copying- in worst case, 8 copies necessary; Scatter-Gather chip can
concatenate two buffers; with VM, just changing the pointers may
do a copy; headers a big part of the overhead here
Timer Management- precision not too important; periodic Sweep of
Process Table's Timer Fields good answer
Problem Areas
Transparency with Global variables and large parameters; may have
to set maximum size; often only One-Way Pass is needed
Pipelines may translate poorly to Client-Server model
User I/O with Remote Process causes much traffic and overhead
Group Communication
Multicasting (to group address); Broadcasting (to all machines;
recipient decides whether or not to look at it); Unicasting (to
each machine in group, one-by-one)
Design Issues, beyond those for ordinary communication
Closed (only members can send; parallel processing) vs. Open
(replicated servers)
Peer (symmetric; no single point of failure; complex control)
vs. Hierarchical (subject to failure; matches task-decomposition)
Membership- in Server (easy; single point of failure) or
Distributed (each machine has own copy of list)
Member crash- hard for rest of network to know about
Synchronization of membership changes and message receipt
May so many machines go down that the group can't function?
Addressing- Group ID; Explicit List; Predicate
Send & Receive Primitives- for RPC, may separate Group and Single
primitives, since n replies for a procedure call is strange;
simple to merge the two primitive sets
Atomicity: all receive a message or none do; makes system easy to
program; can be implmented by requiring universal Reply
Message Ordering- difficult to have Global Time Ordering;
Consistent Time Ordering (Loosely Synchronous; either A or B is
seen first by all recipients) or Virtually Synchronous (A and B
are ordered only if it matters)
ISIS
ABCAST - Loosely Synchronous: send M1 with timestamp; get Replies
with timestamps; send COMMIT; all timestamps increasing
CBCAST- Virtually Synchronous: a vector of group machines is kept
in each machine, telling when last it has heard from the others;
the vector is sent out with each message; messages are buffered
if the Sender is more "experienced" than the Receiver
Overlapping Groups- issue is raised that two processes may get
Messages in different orders; hard problem to root out, and it may
not be worth the trouble
Scalability- with loops, must keep packets from circling and
multiplying endlessly; single points of failure very hazardous;
traffic congestion at centralized resource can be very bad;
synchrony too hard to attempt
Synchronization in Distributed Systems
Clock Synchronization
Logical Clocks: only sequence matters
Lamport's Algorithm: Clock = MAX(Messages, Clock) + 1
Physical Clocks: accurate within some bound
Cristian's Algorithm: query central Time Server periodically;
assume lag of 1/2 Reply-Time; adjust by changing milliseconds/tick
from 10 to 9 or 11
Berkeley Algorithm: Server daemon queries all machines; tells them
to get in line with average
Averaging Algorithms: each machine queries the others; gets in line
with average
Multiple External Time Sources- each server broadcasts a range
bounding the "correct" time; delays compensated for; outliers
thrown out; midpoint of intersection is The Time
Mutual Exclusion in Distributed Systems
Centralized: Coordinator gives permission to just one at a time;
Request, Reply, Confirmation; single point-of-failure; bottleneck;
dead Coordinator may resemble a busy one, and cause Blocking
Distributed: send request to everybody; unanimous approval (none
using or have dibs) required; n points-of-failure; unused!
Token-Ring: Token of permission is passed around; use it or pass it;
considerable traffic; detecting a dead process can lose Token, so
we may require announcing Token receipt
Election Algorithms
Bully: all processes ordered; Coordinator seems dead; some process
declares ELECTION to its superiors; higher living process holds
ELECTION itself; if no Reply, the Bully wins and announces that
Ring: ELECTION goes in a ring, like a chain letter; highest number
on list is the winner; second lap announces winner; quits
Atomic Transactions: higher-level abstraction for Fault Tolerance;
two processes work together and finally COMMIT work to Stable
Storage; if problem occurs, return to prior state (redundant disks);
BEGIN, END, ABORT, READ, WRITE; Serializable (Concurrent
Transactions do not interfere with each other); Atomicity
(Transaction is indivisible); Permanence (what's done is done)
Nested Transactions- sub-Transactions require nested copies of world
Storage Options
Private Workspace: pointers to parent blocks used for READ; local
copies made once WRITE starts; only at COMMIT is change to parent
blocks made
Writeahead Log: don't actually change data; record old value and
change you want to make; at COMMIT, make changes; at ABORT,
Rollback changes
Two-Phase Commit Protocol: Transaction Coordinator polls the lackies
and if a unanimous response comes in, a COMMIT is sent out;
decisions are written to logs every step of the way, saving state
in the face of multiple crashes
Concurrency Control Algorithms
Locking- recorded centrally or locally; multiple readers, one
writer; Granularity- small: more precise, more parallelism, more
locks, more expensive, more likely to deadlock
Two-Phase Locking: acquire all resources; COMMIT; write; release;
next process steps up; does not eliminate Deadlock
Optimistic Concurrency Control- act freely and worry about problems
if they come up; works great with Private Workspaces; busy system
can cause huge overhead
Timestamps: if Transaction is older than last access of file, ABORT;
Deadlock impossible
Deadlocks in Distributed Systems- very nasty; usually just 2
processes involved; usually Resource-Oriented
Ostrich: ignore problem; fine solution here
Avoidance: never attempted in Distributed Systems
Detection- ABORTing a transaction here easier to recover from than
KILLing a process in Single-Processor systems
Centralized- Coordinator gets reports immediately, periodically, or
on its request; graph reveals Deadlock
False Deadlock can result from incomplete or late Replies; with
Timestamps, Coordinator can poll each machine and make sure that
no older messages that can resolve Deadlock are coming
Distributed- blocked process sends a probe forward; if it returns
home, highest process number along the way is killed
Prevention
Wait-Die: older can wait for younger; younger killed if it tries to
wait for older
Wound-Wait: older can ABORT younger's transaction and Preempt the
resource; younger must wait for older
Processes and Processors in Distributed Systems
Threads AKA Lightweight Processes
Separate: PC, Stack, Child Threads, Registers, State
Share: Address Space, Global Variables, Open Files, Child Processes,
Signals, Semaphores, Accounting
Use: allow a Process to continue parallel work in the face of
Blocking System Calls; Threads in 1 Process may differ in role
Dispatcher/Worker, Team, Pipeline
Design Issues
Static (permanent within Process) vs. Dynamic (exit or be killed)
Shared Data- Mutexes (Binary Semaphore) for short-term locking of
Critical Section; Condition Variables for long-term resource waits
Global Variables tricky; maybe make local copies and dash coherence
Scheduling Algorithm- maybe let user pick
Implementation
User Space- may be no Kernel Implmentation; faster than Trapping to
Kernel; Customize Scheduling; Table & Stack space open-ended
Kernel Space- Blocking tricky to implement (Jacket around System
Call); Thread may keep CPU forever; Blocking is the whole point of
Threads, and that requires a Trap, anyway
Threads may mutually interfere via Shared Memory (like Copy
Buffers); Signals (like F4 in X)
Threads and RPC- with Local (same-machine) calls, just shift
pointers to data and save on Marshalling; put incoming Parameters
right into the Server Thread's Address Space
System Models
Workstation Model- continuum of how much to have on local disks;
less: cheap, uniformity & transparency, upgrade easy; much traffic;
no coherency problem
Idle Workstations- Server can advertise or Clients seek; may need to
interact re: File data, I/O, Time, etc.
When Owner returns, process may: Continue; Migrate; Die
Processor Pool- advantage of queue service; response time
T = 1/(potential/sec-demand/sec); 1 n-speed unit better RT than n
1-speed units, by factor of n; Reliable & Fault-Tolerant; uniform
response; advantage over Workstation Model only for heavy workload
Processor Allocation
Design Issues
Migratory (flexible) vs. Nonmigratory (simple)
Goal: Response Time or Response Ratio
Deterministic, Optimal vs. Heuristic, Good
Centralized vs. Distributed
Local vs. Global information in decision
Locstion- always Global; Sender vs. Receiver to initiate
Implementation Issues
Difficult to Measure Load accurately
Added Overhead and Complexity hard to Recoup
Stability- don't want processors to play tag with a process
Allocation Algorithms
Graph Theory: Processors are Nodes; Traffic is Arcs; Find better
Partition; Require Complete Information in Advance
Centralized Up-Down: Give Priority to Needy; Take from Hogs
Hierarchical: Bosses and Sub-Bosses; Complex; Information may be
Outdated
Distributed Heuristic: Probe several machines, and run process
elsewhere if a suitable Server is found; works well
Bidding: Economic Inspiration
Scheduling in Distributed Systems
Local: no complications
Co-scheduling: have related, communicating processes run
simultaneously; same quanta slices across many machines
Distributed File Systems
File Service: the interface for programmers and users
As in Single-Processor Systems- Attributes; Protection
Immutable Files- guarantees Coherence at price of utility
Basic Models- choice depends upon workload
Upload-Download: Client gets whole file from Server at once
Remote-Access: each READ/WRITE requested individually
Directory Service Interface- Hierarchical Tree vs. General Graph
Transparency- view from all machines the same?
Two-Level Naming- sometimes Symbolic and Binary (System use) Names
Binary Name can hold Location of File, or a Symbolic Link to that
Location, Protection, etc.
Semantics of File Sharing
UNIX: every operation immediately visible to all
Session: changes visible to others only when file is closed
Immutable Files: no changes at all; make new copies
Transactions: changes are all or nothing
Implementation
Usage Trends: Most Files <10K; READs outnumber WRITEs; Sequential
Access more than Random; Short File Lifespans; Sharing Rare; Most
Processes access few Files; File Classes useful
Servers and Clients: Same (general) or Different (optimized)
Directories and Files at Same Server (less traffic) or Different
(simpler)
Name Caching: remember where Files were before
Path Lookup- Iterative (Client does work) or Automatic (Servers are
lackies)
States or Stateless?
States- shorter Requests; better performance; readahead possible;
idempotency easier; File Locking possible
Stateless- Fault-Tolerance; no OPEN/CLOSE calls; saves Server
space; no limit on how many Open Files; Client crash no problem
Caching
Server Memory vs. Client Disk vs. Client Memory
Client Memory- Process Memory vs. Kernel Memory vs. Cache Manager
Unit- File (fast) vs. Block (saves space)
Key Parameters- Size, Number, Locality of Accesses; Hardware Speed
Cache Consistency
Immutable Files (like Text Segments) eliminates problem
Write-Through (saves READ traffic only)
Delayed Write (faster, but may confuse semantics)
Write-on-Close (Session Semantics)
Centralized Control (UNIX Semantics; not robust; scales poorly)
Client-Driven or Server's Unsolicited Message announces change
Replication- adds Reliability; eases Server bottleneck; should be
Transparent
Implementation Options
Explicit (multiple copies & lookups); maybe Group Primitives
Lazy (Server copies on its own time)
Update Protocols- problem if crash after only some WRITEs done
Primary Server: head honcho (single point-of-failure)
Voting: majority rules; with Ghosts: whoever's alive takes part
Trends in Distributed Systems- WORM; Fiber Optics; Scalability;
Wide-Area Networks; mobile users; Fault-Tolerance
Lessons Learned
Use Workstation power to save Server; Cache where possible; Exploit
Usage properties; Minimize Centrality, perhaps with Hierarchy; Blast
Transfers when possible