Systems Qualifying Exam Outline

Disclaimer: This outline was made to prepare for the Aug. (actually, Sept.) 1996 I.U. Qualifying Exam in Systems. It may not cover material in subsequent revisions to the syllabus, and questions may appear on the exam that are not adequately covered by this outline. The strongest claim I can make of this outline is that if you are not at all familiar with some topic in it, then you are not adequately prepared for the Aug. '96 material (although such material may not be required for subsequent versions of the exam).

Thoroughness of the Outline That being said, it did a fairly good job of preparing me for the Aug. '96 Exam, although at least one term appeared on the exam which was not present anywhere in the assigned syllabus ("data parallelism"). In addition, one other term was used on the exam which was not the standard term used in the readings ("line", referring to an entry in a cache). This outline was an attempt to summarize all of the general ideas in the Aug. 1996 syllabus reading BUT DOES NOT COVER the details of example systems. Some example systems are mentioned, but for others, the details were too important to try to summarize. For my purposes, the outline was made during a first phase of learning the basic concepts, and in a second phase, I went back and read about all of the example systems in the reading. None of those was mentioned on the actual exam, and most seem to have been removed from the Aug. '97 syllabus.

Changes in the Exam Syllabus The syllabus for the Aug. '97 exam differs significantly from the Aug. '96 one, but for the most part, is a subset thereof.

Aug. '96 syllabus, but not on Aug. '97 syllabus:
Hwang & Briggs - very poorly-written text on parallelism and interconnection networks with many examples of parallel hardware
Lipovski & Malek - a nice text on graph theory as applied to networking with three examples of non-conventional architectures
Hennessy & Patterson Ch. 6 - Advanced Pipelining Techiniques
Hennessy & Patterson Ch. 9 - Input/Output
Hennessy & Patterson App. A - Computer Arithmetic
Tanenbaum Ch. 7 - UNIX
Tanenbaum Ch. 9-14 - Distributed Systems, occupying about the last 25% of the outline This is the most substantial deletion of material from the new syllabus without adding additional reading to cover the same material.

Aug. '97 syllabus, but not on Aug. '96 syllabus:
Hennessy & Patterson Ch. 4 - Instruction-Level Parallelism - probably a briefer and better account of topics from Hwang & Briggs
Hennessy & Patterson Ch. 7 - Interconnection Networks - probably a briefer and better account of topics from Hwang & Briggs and Lipovski & Malek
Hennessy & Patterson Ch. 8 - Multiprocessors - probably a briefer and better account of topics from Hwang & Briggs

I'm offering this up as an act of altruism, with gratitude in mind to the group that put together the C451 Table that helped me study for that exam. If you appreciate this, and find it helpful, do something for those who follow you. As with the C451 Table, this outline can't replace the reading, but can only help check your knowledge.

Comments? Complaints? Compliments?
John Rehling's Home Page

Good luck!
Hennessy and Patterson

 X in n% faster than Y means that Ty/Tx = 1 + n/100; performance = 1/T
 Make the Common Case Fast
 Amdahl's Law: Speedup = Tbefore/Tafter = 1/(%Same + %Diff/Speedup)
 90-10 Rule: 90% of execution occurs in 10% of code
 Locality of Reference- items close in memory or last access will be
  close in next access
 Trends: Transistors, Speed, Disk Density double in 3yrs; RAM density
  quadruples; RAM delay and Disk delay decrease by 1/3 per decade
 Memory Hierarchy:
		Size	Access	MB/sec
	Regstrs	<1K	10ns	800
	Cache	<512K	20ns	200
	RAM	<512M	100ns	133
	Disk	>1GB	20ms	4

 Software is flexible for special cases; can be faster than Hardware

Performance Measures and Cost
 CPUtime = (#instructions * CPI)/Clock-Rate
 MIPS: ignores code density issues
  Relative MIPS: fix program and input
 MFLOPS:FP not always relevant; need to relativize, because not all FP
  operations are equal; peak MFLOPS rarely relevant
 Evaluation: Kernels, Toy Benchmarks, Synthetic Benchmarks
  Compiler can distort with Benchmark-specific optimizations
  Marketer can distort with Hardware-specific optimizations
  Performance ranking of 2 machines may reverse with new task
 Cost: RAM cost drops exponentially; Cost of die varies with area^3;
  Cost to Test, Cost to Package; Higher Performance/Cost usually means
  higher cost
 Hardware-independent metrics (code size) can't be reliable
 Must involve all 3 of #instructions, CPI, Clock-Rate
 CPI can't be looked up in table; Cache and RAM Availability &
  Pipelining are key factors
 Parallelism may increase throughput without helping response time

Instruction Sets
 Operations - which?
 Operands - How many? type? where? size?
 Instruction Types
  ALU, Data Transfer, Control, System (OS, VM), FP, Decimal, String
 3 Instruction Set Types
  Stack: 0 explicit operands; push & pop stack
   Good model of nested calls; Short instructions
   Random-access hard; Stack may be in slow memory; Memory Bottleneck
  Accumulator: 1 explicit operand; load & store accumulator
   Simple internal machine state; Short instructions
   High memory traffic
  Register: 2 or 3 explicit operands; load & store registers & RAM
   Most general model of programming; Easy to program
   Long instructions
   Subtypes (memory operands, register operands)
    Register-Register (0,3): uniform instructions; high #instructions
    Register-Memory (1,1): few loads=good density; hogs encoding;
     variable CPI
    Memory-Memory (3,0): compact code; variation in instruction
     length; memory bottleneck
 Little-Endian vs. Big-Endian: only noticed with bytes & word access
 Access Alignment: byte, halfword, word, doubleword
  Small chunk: easy to program; tricky in hardware
Addressing Modes- example, meaning, common use
 ADD R4,R3		R4=R4+R3		registers
 ADD R4,#3		R4=R4+3			constants
 ADD R4,100(R1)		R4=R4+M[100+R1]		local vars
 Register Deferred/Indirect
 ADD R4,(R1)		R4=R4+M[R1]		pointer
 ADD R3,(R1+R2)		R3=R3+M[R1+R2]		array
 ADD R1,(1001)		R1=R1+M[1001]		static data
 Memory Indirect/Memory Deferred
 ADD R1,@R3		R1=R1+M[M[R3]]		pointer
 ADD R1,(R2)+		R1=R1+M[R2]; R2=R2+D	array
 ADD R1,-(R2)		R2=R2-D; R1=R1+M[R2]	array
 ADD R1,100(R2)[R3]	R1=R2+M{100+R2+R3*D]	array

 Frequency: Displacement, Immediate, Register, Deferred, Scaled
 Sizes of operands for Displacement & Immediate big issues
 Encoding Adressing Mode: In Opcode vs. Separate Address Specifier
  Want many registers and modes, but small and uniform-size

 Control: Jump, Branch, Call, Return
  Relative vs. Absolute Addressing: Size of Displacement an Issue
  Branch Implementation (may use more than one)
   Condition Code: often set for free; extra states; forces order
   Condition Register: easy to set; uses up a register
   Compare & Branch: only one instruction, but it's a big one
  EQ/NE most common case, so tempting to make it special
  Branches are usually: EQ/NE, forward, and taken
  Saving state of machine in registers: Callee or Caller?

 Compilers & Optimization Types- Phase-Ordering an inherent problem
  High-level: procedure integration
  Local: do repeated computation once; propagate constants; reduce
   stack height
  Global: carry local optimizations across branches; tune loops
  Register Allocation: graph coloring algorithm
  Machine-Dependent Optimizations

 More registers saves memory access, but complicates machine
 Design Priorities: regularity; orthogonality; general primitives
  (rather than special-case hacks); identify constants at compilation
 Load/Store leads to less memory access than Memory-Memory
 Optimizing can radically change instruction mix

Processor Implementation
 Processor = Control + Datapath
 State: Registers + PC + IAR + Status (flags) Register
 Basic Steps of Execution
  SEI = sign-extended immediate; Rx = register x; IR = instruction
  1. IF : MAR <- PC; IR <- M[MAR]
  2. ID : A <- Rs1; B <- Rs2; PC+=4
  3. EX : MAR <- A + SEI; MDR <- Rd                memory reference
  3. EX : ALUoutput <- A op (B or SEI)             ALU operation
  3. EX : ALUoutput <- PC + SEI; cond <- A op 0    branch/jump
  4. MEM: MDR <- M[MAR] or M[MAR] <- MDR           load or store
  4. MEM: if (cond) PC <- ALUoutput                branch
  5. WB : Rd <-ALUoutput or MDR

 Hardwired Control
  Finite-State Diagram
   complexity = #states * #control inputs * #control outputs
   Optimize by combining states
   Number of possible states exponential, but a PLA serves in place of
    a massive table; similar states should have similar table numbers
    for more efficient encoding

 Microprogrammed Control
  Microintruction: string of bits specifying operands, operation, etc.
  Control Store: the memory of the microprogram
  Microprogram may have its own looping, or number of next instruction
   may be included in each microinstruction
  Reducing Hardware Costs: Encode Control Lines (2^k to k); Multiple
   Microinstruction Formats with opcode for narrower
   microinstructions; Some hardwired control (tricky)
  Reducing CPI: Special-case microcode for exceptional cases;
   Hardwired Control may get 2 microinstructions done at once;
   Parallelism of wider instructions

 Interrupts Complicate Control
  State Diagram Must be modified to allow for this
  Example Interrupts: I/O, OS, trace, breakpoint, overflow, page
   fault, mis-aligned memory, memory protected, undefined instruction,
   hardware/power failure
  5 axes: synchronous/asynchronous, user/coerced, maskable/unmaskable,
   within/between instruction, resume/terminate
  Worst are within-resume interrupts (FP error, page fault)

 Microcode is not always faster than Macrocode, due to caches
 Filling up instruction space is not free (design costs; limit future
 Writable Control Store not too helpful (multiprogramming, VM,
  difficulty of microprogramming)

 Ideal Situation: 1 instruction/clock cycle, plus overhead
 Changes to Datapath: PC updated perhaps as often as each clock cycle;
  Separate MAR for instructions and data (IMAR & DMAR); Peak memory
  bandwidth much higher, probably separate instruction/data caches
 Structural Hazards
  Some unit would be needed by two instructions at once, so one stalls
  May be worth savings from not duplicating everything, especially if
   structural hazard would be rare, and unit costly (floating point)
 Data Hazards
  Forwarding: if a register is changed after it is fetched, the new
   value can be passed on directly to the instruction
  Stall: if the new value has not been computed, wait until it has
  Scheduling: Compiler (Static) or Hardware (Dynamic) can try to
   minimize dependencies
 Control Hazards
  Instruction Pre-Fetch may be thwarted by a branch
  Stall: wait until you know the new PC
  Predict Taken, Using New PC (go back if you're wrong)
  Predict Not Taken (go back if you're wrong); rarer but easier
  Delayed Branch: have multiple next-instruction slots in pipeline,
   and start all of them, keeping the right one
 Pipelining and Interrupts
  Precise- interrupt detected at ID stage, making a clean break
  Imprecise can occur later 
  Save PC, and disable writes, letting pipeline clear
  May take action as soon as interrupt occurs, or record all
   interrupts in an interrupt vector, and check each instruction at WB
 Pipelining and Instruction Set Complications
  Condition Codes- must know last use of flags to predict branches
  Multicycle Operations (Floating Point) likely to force stalls
   Forwarding sometimes possible
   Interrupts during multicycle operations- one operation may create
    an interrupt after its successor has completed, possibly
    destroying operands
    Solutions: allow errors; queue all results and restore (history
     file recalling change or future file delaying change); skip from
     interrupt ahead, content with what was done out-of-order; never
     issue a new instruction until old one is done if interrupts
 Dynamic Scheduling- Reduce data dependencies; Determine if and when
  for branches early on
  Scoreboard- issue each instruction as early as possible
   4 Steps: Issue, Read Operands, Execution, Write
   3 Parts of Scoreboard
    Instruction Status- flag for each of four steps per instruction
    Function Unit Status- busy flag, operation, destination, sources,
     where and if sources are ready
    Register Result Status- which operation will write to register
  Tomasulo's Algorithm- distribution of hazard detection for complex
    architectures (FP); WAR & WAW eliminated by copying registers
    Reservation Stations at each functional unit,
    Common Data Bus for all traffic

 Instruction Set design and Pipelining Interact
 Deeper Pipelines lead to Diminishing Returns
 Optimized code should be the test case
 Unusual instruction sequences may exacerbate hazards

Memory Hierarchy
 Memory Stalls can significantly degrade CPI
 Reads are much more common than writes
 Miss Penalty = Access + Transfer; may be over 1000 clock cycles
 Block-Frame Address: where block is
 Block-Frame Offset: where in block data is
 Larger Blocks leads to fewer misses, until Pollution Point
 Key Questions For any Level of Hierarchy
  Where can a block go?
  How is the block found?
  Which block is replaced on a miss?
  How are changes written?

 Cache- small; fast; upper level; ~64K, 64-byte blocks, 1-cycle access
  Placement Options
   Direct Map- only one possible home
   Fully Associative- block can go into any frame
   Set Associative- anywhere within fixed subset of all frames
   Block has an Address Tag and Valid Bit (initially 0)
   Address Tags are stored in an Associative Memory and all are
    checked simultaneously at each memory access
   May have three parts: Index (for set), Tag and Offset
    Only Tag needed for identification
   Word read in parallel with address check; used on hit
   Replacement Algorithms
    Random- easy to program and debug
    Least-Recently Used- makes use of locality; overhead
    First-In-First-Out- lackluster performance
  Write Strategy
   Write-Through- easy; no delay at read miss; acheives coherence
   Write-Back- less total traffic; cache speed for typical write
  Write Miss Strategy
   Write Allocate loads whole block; makes use of locality
   Write Around just does the write to lower level
  Sources of Cache Misses
   Compulsory- in the beginning, cache is vacant
   Capacity- when cache is full
   Conflict- there's space, but limited associativity wasted it
  Caches may handle Data and Instructions separately or be Unified

 Main Memory- DRAM, as opposed to SRAM for caches
  Optimizations for Increasing Bandwidth
   Wider Main Memory- get more per access
   Interleaved Memory- speeds sequential or parallel independent
    access; requires that expansion is a doubling or quadrupling
   DRAM Row Buffer- square root of memory size in bits; standard
    Quadruples Bandwidth in One of Four Modes:
     Nibble Mode- DRAM gives three extra bits per row access
     SCRAM (Static Column)- get random bits while row is active
     Page Mode- similar to SCRAM, with a strobe needed per access

 Virtual Memory
  Relocation swaps Pages from disk to Main Memory Page Frames; On Page
   Fault, appropriate page must be swapped in; Control by OS; Pages
   0.5K-8K; Protection keeps processes from intruding on each other;
   Page Size set to balance Internal Fragmentation and External
  Placement- Always Fully Associative because miss penalty is huge
  Identification- Page/Segment Number looked up in Table
   (Segments usually have a two-word address to just one for Pages)
   OS makes VM invisible; Page Table holds physical addresses for each
    virtual page; Inverted Page Table saves space; Segment provides
    separate address space for Data, Text or Stack
  Replacement- LRU, to minimize misses; approximation has "Use Bit"
   defining subset of old pages; cleared periodically
   Segments have nonuniform size, and may be difficult to fit in
  Writing- always Write Back with Dirty Bit, to minimize disk access
  Larger Page Sizes- Efficient Transfer; Smaller Table; Internal
   Fragmentation; Slow Startup
  If Page Table requires paging, access time is doubled!
  Translation-Lookaside Buffer holds common Page Table Entries in
   Associative Memory/Cache; TLB checked first, before table
  Protection- Base & Bound; User/Kernel Bits; Permission Bits; Rings

 More Optimizations
  Instruction Prefetch Buffers- fetch is faster than CPU use
  Register Windows- separate memory to save at procedure call
  Cache Optimizations- many possibilities, including Two-Level Caches;
   Write Buffer for parallel read/write

 Set Address Space Ceiling with Future in Mind
 Set-Associative over Direct-Map if misses match overhead
 Traces from Another Architecture Unreliable for Designing the Next
 Adding Address Space Segments (slow) vs. New Ceiling (development)
 Caches are not always register-fast; else, why have registers?

 Liable to be the Bottleneck in the Architecture
 Bandwidth can be helped by Parallelism without helping Response Time
  Response Time only improved if queues can be processed faster
 Faster Response Time leads to much faster User Interaction
 Measures: I/O rate (#accesses/second) vs. Data Rate (bytes/second)
 Types Of I/O Devices
  3 Axes: Behavior(read once/read-write/write-only); Partner
   (human/device); Data Rate (peak)
  Magnetic Disks; DRAMs; Optical; Arrays (reliability an issue)
  Graphics Display- CRT plus buffer
   RS232- 0.3 to 19.2 Kbits/sec; terminals in star around central unit
   Ethernet- 10Mbit/sec; no central control; 128b-1530b packets
    Collision- Wait a Random Interval and Retry
    Connect LANs with Bridge (2) or Gateway (many)
   Circuit-Switch small networks; Packet-Switch large, long ones
 Buses- CPU/memory (short and fast) or I/O (long, slow, maybe wide)
  Bus width: may multiplex; Data Width; Transfer Size
  Bus Masters: Single or Multiple; Synchronous/Asynchronous;
   Split-Transaction can operate in Bursts or Packets
 CPU and I/O
  Memory-mapped I/O vs. Special Opcodes, registers, ports, etc.
  Polling (busy waiting) vs. Interrupts (overhead can bog CPU down)
   Hybrid- poll, but rarely
  Direct Memory Access- block transfer via separate processing unit
 Other Issues- Stale Caches (flush if tag is set); VM and DMA (Virtual
  DMA one solution); Main Memory can be a cache for disks

 Poor I/O Design can seriously impair throughput
 I/O Processor to take load may not improve upon CPU
 Cost of Media includes cost of Whole System
 Seek Time: not just random arm position; Inertia & Locality factors

Computer Arithmetic
 Unsigned Integer Arithmetic
  Ripple-Carry Addition
   Full-Adder: 3 bits in, 2 bits out
   n Full Adders; O(n), but low constant
  Radix-2 Multiplication: PA=A*B
   Do n times:
    If leftmost-bit(A)=1 then P+=B
  Radix-2 Division: A=A/B; P=A mod B
   Do n times:
    If P>=0 Then
 Signed Integers
  Sign Magintude: First bit is sign; rest is magnitude
  One's Complement: to get -x, bitwise complement
  Two's Complement: One's Complement + 1; x + (-x) = 2^n
  Biased: fixed bias is defined as zero
 Two's Complement Arithmetic
  Addition: same algorithm as for unsigned integer; ignore carry out
  Subtraction: x-y = x + (-y); -y: complement each bit of y and add 1
  Multiplication: Hardware for Signed numbers such that Sign bit of P
   shifts into High-order bit of P
   Booth Recoding used for Add to P:
lo-bit(A) hi-bit(a) Add
0         0         0
0         1         B
1         0         -B
1         1         0
 Overflow: Flag, Trap or Ignore
  For unsigned, ignore so address math is wraparound
 High bits of multiplication often ignored by high-level languages
 Multiply/Divide Usually Slower than Add Subtract
  May offer Multiply-Step instruction; Integer Multiply in FP or
   autonomous unit
 For multiple precisions of addition, carry may be put into flag, and
  ADD instuction will add that to the next word up

 Floating-Point Arithmetic- favored over Fixed-Point and Logarithmic
  Significand & Exponent; precise standard not in place
  IEEE Standard- 4, with various bits of precision & max exponent
   Single (24, 127); Single Extended (>=32, >=1023),
   Double (53, 1023), Double Extended (>=64, >=16363)
   Exponent=255: zero fraction means infinity; else NaN
   Exponent=0: zero fraction means zero; else denormal (tiny)
   Gradual Underflow- minimum exponent holds firm
    small numbers slowly lose significand
    if x!=y, then(x-y)!=0
   non-negatives ordered like unsigned integers
  Most all of IEEE should be put into hardware, for speed

 Rounding- extra bits: guard (high), round (low), sticky (overflow)
 FP Addition
  Put Biggest First
  Shift Smaller to align
  Add Fractions
  If Carry Out, increment Exponent
 FP Multiply
  Multiply Fractions
  Add Exponents
  Test for Overflow
  May Trap to Software for Denormals
 FP Divide: 2 laborious algorithms
  Newton's iteration
  Follow tangent of 1/x to intercept, iterating repeatedly
  Goldschmidt's Algorithm
   x=a; y=b; multiply both by r repeatedly; when y=1, x=a/b
  Remainder: may do in software
 Rounding Double Precision Answers to Single Precision
  Accurate if the number of digits is less than half (as in IEEE)
 Exceptions: Inexact (overflow); Invalid (sqrt(0), 0/0, etc.)

 Optimizing Integer Addition
  Carry Lookahead- shallower, wider combinatorial circuit
  Carry-Skip- Lookahead, but in smaller blocks chained together
   for each block, assume both carry-in possibilities; pick winners
              Time       Space
Ripple        O(n)       O(n)
CLA           O(ln n)    O(n ln n)
Carry-Skip    O(sqrt(n)) O(n)
Carry-Select  O(sqrt(n)) O(n)

 Optimizing Integer Multiply and Divide
  Shift Over Zeros; Carry-Save Adder avoids duplicating work;
  Higher-Radix; Many Adders; Division with CSA

 FP Underflow is Frequent
 Integer-FP Conversions are Frequent
 Memory Bandwidth can be bottleneck of FP Unit
 IEEE 0 and -0 are unequal

Hwang and Briggs

 Parallelism- Multiprogramming; Time-Sharing; Pipelining; Concurrency:
  Simultaneity; Vectorization; Arrays
 Multiplicity of Functional Units: Pipeline in CPU; I/O Processors;
  Hierarchical Memory; Balancing Bandwidths; Multiprogramming
 Speedup with n processors: log n to n/log n
 Flynn's Classification: SISD, MISD, SIMD, MIMD
 Feng: Bit-Slice vs. Word Parallelism: WSBS, WPBS, WSBP, WPBP
 Haendler: Classify with 
  CPUs x pipelinable CPUs, ALUs/CPU x pipelinable ALUs/CPU,
  word length x pipeline stages
 Applications: Weather, Oceanography, Astrophysics, Economics,
  Government Files, Engineering, Aerodynamics, AI, Remote Sensing,
  Seismology, Seismic Petroleum Exploration, Fusion and Fission
  Reactors, CAT, Genetics, Weapons, Chemistry, Physics, VLSI

Pipelining and Vector Processing
 Pipelines- Arithmetic, Instruction, Multiple Processors (rare)
  3 Axes: Unifunction/Multifunction, Static/Dynamic, Scalar/Vector
  Reservation Tables: Unit use vs. Time
  Interleaved Memory
   S-Access: Get an access from each of several modules at once
   C-Access: Activate a bank read from each of several modules on it
   C/S Access: Many C-Accesses at once
   Bandwidth: approaches maximum exponentially with probability that
    next access will be to sequentially-following address
 Types of Pipelines
  Instruction Pipelines: fetch new instruction each clock cycle; pass
   them down assembly line
  Arithmetic Pipelines
   CSA (Carry-Save Adder), as opposed to a Carry-Propagate Adder
   Tree of CSAs ending in CPA speeds multiplication
  Multifunction Pipelines
   Complex Example does FP Multiply, SQRT, Divide, Remainder
  Array Pipelines: matrix arithmetic
 Principles of Designing Pipeline Processors
  Instruction Prefetch and Branch Handling
   Buffer can get instructions from beyond potential branch
  Data Buffering and Busing Segments
  Internal Forwarding: if register holds same thing as memory, use it
   If first store result is immediately overwritten, don't perform it
  Hazard Detection and Resolution- Data Forward with registers; stall
   with memory
  Job Sequencing and Collision Prevention: Stall as little as possible
   Reservation Table derives: Forbidden Set of Latencies, Collision
   Vector, State Diagram, which in turn finds Cycles (including Simple
   Cycles and Greedy Cycles) which have an Average Latency each
   Use the Cycle with the Minimum Average Latency
   Multifunction Pipelines' Reservation Tables are overlays of each
    supported function
  Dynamic Pipelines and Reconfigurability- tricky design
 Vector Processors
  Several Operands produce a Vector or Scalar Result
  Instruction: Operation, Base Address, Increment, Offset, Length
  Masking Vector and Compress and Merge Operations provide control
  Optimized by: Enrich Vector Instruction Set; Group Scalar Operations
   Together; Suitable Vector Algorithms; Vectorized Compiler
  Multiple Vector Task Dispatching: bin-packing heuristics
 Pipelined Vector Processor Methods
  Horizontal, Vertical, 2-D Blocks

SIMD Array Processors
 SIMD Organization
  Control Unit handles Scalar Operations and controls multiple
   Processing Elements
  CU may control Alignment Network between PEs and Memory
  CU may control Interconnection Network, each PE has its own Memory,
   CU has Masking Register to turn PEs on and off
  Communication Axes: Operation Mode (Synchronous, Asynchronous,
   Combined); Control Strategy (Centralized usually, Distributed);
   Switching (Circuit, Packet); Topology (Static, Dynamic)
 Interconnection Networks
  Static: Linear, Ring, Star, Tree, Mesh, Systolic Array, Chordal
   Ring, Completely Connected, n-Cube, n-Cube Connected Cycle
    N Input Selectors: demultiplexer, N Output Selectors: multiplexer
    AKA Recirculating Networks: Low Connectivity (n log n)
   Multistage- Feedforward Stages
    Switch Boxes have 2 inputs and 2 outputs; may allow 2 or all 4
     interconnection states; n log n size
    Control: individual stage or individual box
    Blocking- connections may impair further connections
    Rearrangeable- enough routes can be found to enable all possible
     connections (example: Benes)
    Nonblocking- all possible connections possible without blocking
     (examples: Clos, Crossbar Switch)
    Examples: Mesh, Cube, Barrel-Shifter, Omega
 Parallel Algorithms
  Matrix Multiplication: do whole row at a time
  MergeSort in O(n) time
 Associative Array Processors
  Content-Addressible Memory in 2-D alignment
  Mask selects bits in Comparand; matches tagged in Indicator
  Bit Parallel- one cycle does everything (PEPE)
  Bit Serial- one bit slice at a time (STARAN)
  Distributed Logic- character fields (PEPE)
  Word-Organized- all bits are equal
  Algorithms: Maxima, Minima, Mean; Equal, Not-Equal, Similar,
   Proximate (within range); Less-Than, Greater-Than; Nearest-Below;
   In-Range (open or closed); Asceding/Descending Sort

SIMD Performance Enhancement
 Parallel Memory Organization- number of modules should be relatively
  prime to distance between successive accesses
 Array Processing Languages
 Multi-SIMD Machines not found yet

Multiprocessor Architecture and Programming
 MIMD Architecture
  Loosely Coupled (message-transfers in a Distributed System)
   Example: Cm* with Hierarchical Architecture
  Tightly Coupled (shared main memory)
   Examples: Crossbar Switch, Fully-Connected Bipartite
  Design Issues
   Process Recoverability; Efficient Context Switching; Large VM and
   Main Memory; Synchronization Primitives; Good Communication
   Instructions for: Procedure Linking; Looping; Parameter
   Manipulation; Multidimensional Arrays; Address Range-Checking
 Interconnection Networks
  Buses: Time-Shared; cheap, low capacity, catastrophic failure
   Priority; Round-Robin; Least-Recently-Used; Daisy Chain; FCFS;
   Polling; Independent Requests
  Crossbar Switch- high bandwidth, efficient, expensive
  Multiport Memory- all switches at memory units; high rate, very
  Multistage Networks- feedforward with 2 x 2 switches
   Banyan: Partial-Order in Distinct Levels
    Complete connection at O(n log n) size
   Delta Network: a^n x b^n has n stages of a x b Crossbar Modules
    High cost
  Parallel Memory Organizations
   Interleaving best when sharing is high
   L-M Organization: l C-Accesses with m modules each
    high bandwidth of a cache fairly cheap
   Cache Coherence a problem if caches are private
    Static Coherence Check: tag items as Shared or Private; only cache
    Dynamic Coherence Check: at write to shared cache, Write-Through
     and invalidate all other cached versions; global tables for
     Present and Modified info for each block by each processor
 Multiprocessor Operating Systems
  Master-Slave- one shall rule; slaves make requests; eliminates
   problems of shared tables; single point of failure
  Separate Supervisors- to each his own; sharing is in files; some
   tables are shared, some global; replicating code wastes memory;
   failure restart tricky
  Floating Supervisor- take turns; best load balancing; most
   complicated; table access conflicts a problem; may divide services
   between many supervisors
 Exploiting Multiprocessing for Parallelism
  Languge Features
   FORK starts child; JOIN waits for child to end
   cobegin... coend
   parfor handles linear row of array
   Critical Sections set off by: csect VAR do A
   Explicit Parallelism: user decides
   Implicit Parallelism: compiler finds
    Partitioning divides program into tasks
    Assignment hands them to processors

Lipovski and Malek

 Granularity: within instruction is fine; multi-instruction is coarse
  Direct (wires), Indirect (CPUs) or Switched (other hardware)

Computational Energy
 Unit of Computational Power: UCP
 Measure UCPs in Gates or Processors, but don't mix
  Passive Circuits, as in memory, may inflate #gates
 Efficiency: ratio of input bits to output bits
  Useful only in comparison of Equivalent systems
 Theorem: Efficiency of two Equivalent systems is inversely
  proportional to the ratio of input energies of the systems
 Theorem: if all UCPs are being used, Efficiency varies with unit
 First-Order Analysis: add Computation
 Second-Order Analysis: add Computation, Communication and Control
 Pipelining superior to Parallelism if units are specialized
  if units are the same, Parallelism is better
 Starvation Energy: unit unavailable for use
  Examples: the end of the pipeline when pipeline starts up; FP unit
   when no FP operations are being performed
  Use specialized unit if Starvation Energy is less than Energy of
   general unit computing it; obviously, depends upon frequency of use
 Inductive Architecture: scales up with speed, topology, energy all in
  ratio to increased size of system

Graphs and Switches
 Graph Definitions
  Permutation Network: allows all paths from one set to another (or
   itself); Nonblocking allows a whole permutation at once; Blocking
   requires some paths to break for others to be formed
  Circuit-Switched: able/disable flags create stable circuits for bulk
  Packet-Switched: register holds data and passes it on when possible
   Packet Train: several packets sent together
   Message-Switched: Packet Trains of variable length
  Active Node (nodes are UCPs) vs. Passive Node (nodes are wires)
  Simplex (one-way), Half-Duplex (either way), Duplex (both ways)
  Subgraph (some nodes, all their arcs); Partial "Spanning" Graph (all
   nodes, some arcs); Partial Subgraph (some nodes, some arcs)
  Elementary: no repeat nodes in path
  Delay: length of longest elementary path from source to sink
  Homomorphism: many-to-one, combining multiple units
  kth Correspondence Set: k steps away from a set of nodes
 Energy and Graphs
  Switch Energy: delay * power; add Control Energy to get Total Energy
  Theorem: given fixed elementary paths and probability of blocking,
   packet-switching uses no more energy than circuit-switching;
   nonblocked circuit-switching uses no more energy outside the switch
   than packet-switching with same elementary paths
  Theorem: in circuit-switching network with N UCPs, if delay>sqrt(N)
   then architecture is not inductive
  Blocking networks have higher energy due to unavailable resources
  Circuit-Switching is much faster than Packet-Switching, because it
   uses the Interconnection Network more efficiently;
   Circuit-Switching is better if resources outside the network are
   most expensive; Packet-Switching if Switching Energy is the main
   cost; Packet Trains can be the worst of both 
  Transitive Graphs are fast and flexible but use a lot of power; the
   Minimum-Power graph will always be intransitive

Weak Ordering:  transitive   reflexive
Partial Order:  transitive   reflexive   antisymmetric
Equivalence:    transitive   reflexive   symmetric
Hasse Diagram:  intransitive irreflexive asymmetric
Trans. Red.:    make it not  make it not
Trans. Clos:.   make it so   make it so

   Rings are Transitive Reductions and Homomorphic Reductions of
    Fully-Connected graphs; have delays O(n); work well if
    communication is localized to subsegments of the ring, or if each
    node can be useful while waiting (like stand-alone machines);
    lowest power consumption besides Trees
   Trees are Hasse Diagrams of in-degree 1; great for addressing;
    lowest possible power; congestion near root
  Indirect Communication (passing through a third arc) saves
   connections but causes congestion; Transitive Reduction a good
   candidate: Single Path from each source to each sink a tempting

Banyans and Their Kin
  Greatest Lower Bound (GLB): descendent of all lower-bounds
  Least Upper Bound (LUB): ancestor of all upper-bounds
  Lattice: a partial order, and any two nodes have a GLB and a LUB
   Mesh: a grid rotated 45 degrees; sqrt(N) delays; well-suited for
    some tasks, like those involving pixels
   Cube and various hyper-meshes reduce energies and delays; not
   Trees: lowest power; low energy; can be used as a bus
 Banyan: a Hasse Diagram of a partial order, with exactly one path
  from each source to each sink; fully-connect with sparse power
  Examples: Tree, Crossbar Switch, Bus, Omega Network, joined pair of
   Clos Networks
  Regular: in-degree is constant (besides sources) and out-degree is
   constant (besides sinks)
  k-Level: each path from source to sink is of length k
  Rectangular: same number of nodes at each level
  (i, o, l) Banyan: in-degree, out-degree, levels
  SW-Banyan: regular, k-level banyan such that any two nodes in the
   same level project to and are projected to by the same nodes
   Important Properties
    Set of all paths from a source is a tree
    Recursive Locality: 1-(o/i) of connections stay within block
   Examples: Omega Network, Crossbar Switch, Bus/Star, Bitonic Sort
  CC-Banyan: composition of fundamental graphs, each having i sources
   and o sinks
   Important Properties
    Inverse tree addressing
    Relative, distributed, Locality; Suited to applications where
     communication into sets distributed around a circle
   Examples: Barrel Switch, Crossbar Switch, Bus/Star
  Derivatives of Banyans
   Batcher Sort: many Bitonic Sorts
   Clos Network: pair of Rectangular SW-Banyans
  Banyan performs a "half" permutation; Clos a full permuatation
  Double-Ended Banyan (resources at sinks) vulnerable to a single
   failure; Single-Ended Banyan (resources at sources only) very


 Operating System
  Provides Clean Virtual Machine
  Manages Resources
  May run in Kernel Mode
 Multiprogramming- so CPU isn't wasted on I/O wait; Timesharing
 Processes- Process Table; Core Image; Shell; Child Processes; Signal;
  uid and gid
 Files- Directory (Root and Working); Path; Permission Bits; File
  Descriptor handle; Special Files (Block or Character); stdin,
  stdout, stderr; pipes
 System Calls- library procedure lays out parameters, and makes a Trap
  to the OS; Errors detected there; Results passed back
 Architectures: Monolithic; Layered (layers of permission); Virtual
  Machines (simulates the ugliness below); Client-Server (kernel just
  handles message-passing)

 Multiprogramming provides Pseudoparallelism
 Race Conditions: 2 processes access shared memory, and result depends
  upon which acts first
 Deadlock: 2 or more processes wait for something that only one of the
  other waiting processes can do
 Starvation: processes are executing, but one or more never get to
  some point in their execution
 Mutual Exclusion: no 2 processes in Critical Section at once; with no
  assumptions about speed or number of CPUs; no blocking by
  non-critical section; no process should wait forever to run Critical
 Poor Solutions: Disable Interrupts (too powerful); Lock Variables
  (race condition); Strict Alternation, TSL Instruction, and
  Peterson's Solution (busy waiting); Sleep and Wakeup (race
 Semaphores- generalize SLEEP and WAKE with decrement/increment of
  variable; Binary Semaphore a toggle; others count; be sure to
  decrement resource to be taken before taking it, increment resource
  to be added after adding it, take control only immediately before
  actually making the change; list of processes sleeping on each
 Event Counters- Condition Variables manipulated with Read(E) probes,
  Advance(E) increments, Await(E,v) compares; Event Counters cannot
  decrease; again, bookkeeping is changed before "greedy" event, after
  "generous" one Monitors- high-level language construct, rarely
  available; easy to program, because compiler does the work;
  condition variables can be WAITed upon, or SIGNALed
 Message Passing- SEND(dest, &message) and RECEIVE(source, &message);
  receiver may send Acknowledgement, messages numbered so repetition
  is not harmful; Domains eliminate machine name ambiguity;
  encrypted Authorization may be included; Mailboxes buffer messages
 Choice of Race-Condition Prevention Method: Semaphores and Event
  Counters low-level; Monitors rarely supported; Message Passing
  inefficient on one machine, the others can't work in distributed
  systems without shared memory
 Equivalence of Primitives
  Semaphores simulate entering and leaving Monitor; Semaphore and
   mailbox per process simulate Message Passing
  Monitors keep counter and list of processes to simulate Semaphore;
   counter for Condition Variables of Message Passing
  Messages can use a special Synchronization Process to simulate
  Semaphores and Monitors
 Classical Problems
  Producer-Consumer: one provides items, one takes them away
  Dining Philosophers: each of 5 thinks, then tries to get two forks
   and eat before thinking anew
  Readers and Writers: one writer, or multiple readers at a time
  Sleeping Barber: customers wait for barber, who sleeps when idle

 Process Scheduling
  Run-to-Completion, AKA Nonpreemptive Scheduling
  Round Robin- Quantum size main issue, about 100ms is right
  Priority Scheduling- highest priority runs; can be set by who user
   is, by least recent CPU time, pick a class and have Round Robin
   within it; priorities may decay
  Multiple Queues- several groups, with promotion and demotion
   possibly based on recent CPU time; may give longer quanta (doubling
   exponentially) to different queues
  Shortest-First- optimal response time if all jobs arrive at once;
   may estimate run length by past examples
  Guaranteed Scheduling- meet needs, demands and requests if possible
  Policy vs. Mechanism- user or parent process may set priority/need
  Two-Level Scheduling- to get swapped into memory, and to run

Memory Management
 Monoprogramming is for PCs only
 Multiprogramming seeks right Quantum size
 CPU Utilization- I/O wait reduced by switching to other processes
  Utilization = 1 - p^n
   p is probability of I/O wait; n is Degree of Multiprogramming
 Fixed Partitions- one queue, or several for different size
  partitions; keep at least one small one around
 Relocation and Protection- Base and Limit hardware registers to keep
  bounds on access
  Variable Partitions- Fixed can't handle growing
  Compaction can eliminate Fragmentation, fast with special hardware
  If processes tend to grow, maybe allocate extra initially
  If processes have two segments that can grow, have them grow towards
   each other
   Bit Map- smaller unit: less waste, bigger map; simple; search for
    free space slow
   Linked List, in address order- find hole with First Fit, better
    than Next, Best and Worst Fits; Quick Fit has a list of holes,
    quick to find, but slow to merge
   Buddy System- all blocks are powers of two, list kept; quick to fit
    and quick to merge; Internal and External Fragmentation
  Some systems take disk space from processes in memory
  Fifty-Percent Rule: about one hole per process
  Unused Memory Rule: fraction = k/(k+2), k is ave. hole/ave. process

 Virtual Memory: if Page Fault occurs, get Page from Disk
  Paging- a Page is a virtual thing; Page Frame is in Memory
   Page Table- entry for each Page, in order
    Present/Absent Bit; Page Frame for the Page, if P/A Bit is set;
     Modified/Referenced bits; Caching-Disable for memory-mapped I/O;
     Protection/Locking flags
    Can be very large; mapping must be very fast
    Multilevel- k Level Table has entries pointing to k+1 Level
     Tables, only some of which need actually exist; allows
     huge Address Space with limited storage, but requires multiple
     memory accesses
    Translation-Lookaside Buffer holds common Page Table Entries in
     Associative Memory/Cache; TLB checked first, before table
    Zero-Level Paging: all TLB; trap on a miss
    Inverted Page Table: entry per Page Frame; always in Associative
     Memory; more overhead on miss, but saves space with huge Page
     Table for huge Address Space
  Page-Replacement Algorithms
   Optimal: dispose of longest-until-needed Page; impossible, but
    simulated for experimentation
   Not-Recently-Used: Reference and Modified Bits (former cleared by
    ticks) define 4 classes; pick one from oldest class
   FIFO- may well throw out useful pages; not used
   Second Chance/Clock: first pass clears bit; kills if bit is unset
   Least-Recently-Used- too much overhead to fully implement
    Aging can shift right and set leftmost bit to 1; kill lowest
    Hardware can maintain N x N matrix of usage
  Modelling Page Replacement Algorithms
   Reference String: sequence of page references
   Distance String: time since last use of each page, or infinity
    Flat distribution means no locality, and Page Faults plentiful
   Stack Algorithms: top of memory at time T+1 same if you add another
    Page Frame; FIFO not so
    Belady's Anomaly: more memory can cause more faults without a
     Stack Algorithm
   Faults: F = SUMk[m+1...n](Ck) + Cinfinity, m Page Frames, n Pages
  Design Issues for Paging Systems
   Thrashing: incessant Page Faults; Locality helps prevent it
   Demand Paging: get a Page when and if you need it
   Working Set Model- Prepaging when swapping in reloads the pages in
    use last time; 1 in high bits of Aging give Working Set
    WSCLOCK uses Working Set non-membership to swap out
   Global Page Allocation makes more sense than local, guaranteeing
    each process an equal number of page frames; Page Fault Frequency
    algorithm balances each process's share to minimize Page Faults
   Page Size- small wastes less, especially for process with small
    requirements, but needs large table; transfer is mainly disk
    rotation so small pages are nearly as slow to load as large ones
    Optimal for process size s and table entry size e: sqrt(2*s*e)
  Implementation Issues for Paging Systems
   Instruction Backup after Fault- hardware or software must save PC
   Locking Pages in Memory- Memory-Mapped I/O
   Shared Pages- especially program text
   Backing Store- Process Table points to disk space; to deal with
    growth, may allocate page-by-page or separate segments
   Paging Daemons- periodically free up some page frames
   Page Fault Handling: hardware saves PC and traps to kernel; kernel
    saves registers and calls OS; find victim if necessary, and
    swap victim out if dirty; context switch to other processes during
    transfers; swap in and mark Page Table; back up faulting
    instruction; schedule faulting process; kernel restores volatiles

  Segmentation- many Address Spaces, distinguished by purpose, visible
   to programmer; size variable and flexible; facilitates sharing and
   protection; Checkerboarding fragmentation may be cured by
   Compaction; may have Segmentation with Paging

File Systems: large, permanent storage of information
 Naming- length, case-sensitivity, file extensions (.suffix)
 Structure- sequence of Bytes, fixed-length Record Fields, Tree of
  records searchable by a key
 File Types- ASCII or Binary, Directory, Special: Block or Character
 Sequential Access or Random Access
 Attributes- owner, size, age, last change, creator, protection, etc.
 Memory-Mapped Files: a copy in Main Memory instead of on disk; if
  evicted by Page-Replacement, send the data to a real file; meshes
  well with Segmentation; Problems: finding end of write, coherence,
  size of file
 Directories: a table of entries; Name-Attributes or Name-Pointer (to
  Attributes); Hierarchy or DAG (trickier for deletion); Path Name,
  Relative to Working Directory or Absolute
 File Implementation
  Contiguous- easy bookkeeping; fast access; must know maximum size;
   much Fragmentation
  Linked List- pointer to next block (ruins power-of-2 size); slow
  Linked List with Index- table in Memory holds pointers to next
   block; faster, because pointers aren't on disk; table can be large
  I-Nodes- each file has its own fixed-length table for blocks; if
   more slots are needed, Indirect Blocks (single, double, etc.)
  Directories- just a list, attributes and blocks as above, via
   pointers or I-Nodes
  Shared Files- Links make DAGs of the directory; may be a pointer to
   the file's block information, or a Symbolic Link: just a pathname
  Disk Space Management- because Contiguous is a bad idea
   Block Size- small is less wasteful but slower with larger table;
    512b, 1K or 2K common choices
   Free Blocks recorded in Linked List on disk or the more-concise Bit
    Map in memory
   Disk Quotas- in UNIX, soft and hard quotas for each user
  Reliability- bugs can happen in factory or field
   Bad Blocks- keep a list or dummy file containing them
   Backup with Incremental Dumps, tag Archive Bit, cleared with use
   Consistency- check at boot; each block should be in some list
  Performance Enhancements
   Block Cache in memory; needs to be written back often; Write
    Through for removable media
   Position associated blocks close together, especially I-Nodes

 Security- disasters, malfunctions, human errors, Trojan Horses,
  Worms, Viruses, Trapdoors; need Authentication and to seek and
  remove holes in security; have secure backup of data

  Domain: a set of (Object, Rights) pairs
  Access Control Lists: for each object, who has what rights
   RWX bits an offshoot
  Capabilities: for each user/process, what Objects they can
   manipulate, and with what Rights; Generic Rights exist; Rights
   Amplification lets system calls do more
   Preventing Capability-List Tampering: hardware provided Kernel Bit
    for all Memory Locations; keep Capabilities List in OS; encrypt it
   Type Manager Module may define Rights
   Revoke Access by having them point to an indirect object which can
    be moved, or changing a long Object password, implicitly invoked
    with each access
  Protection Models- in model, seek paths to Unauthorized States
  Covert Channels- vary CPU or device usage; sneak data into output

 Device Controller the hardware; Device Driver the software
 I/O Channel: optional additional bus
 OS gives Controller minimal data in parameters; Controller returns
  preamble, sector data, checksum
 Memory-Mapped I/O: modest number of Controller registers mapped
  directly to Main Memory
 Direct-Memory Access: controller stores data in its own buffer, and
  writes the buffer to memory; time lag in reads leads to Interleaved
  blocks on disk (0, 4, 1, 5, 2, 6, 3, 7) so rotate coincides with
  read time
 I/O Software Principles
  Information Hiding- Device-Independence and Uniform Naming;
   Error-Handling as low-level as possible
  Mounting: giving a pathname to a disk or other device
  Synchronous (Blocking) vs. Asynchronous (Interrupt-Driven)
    Synchronous easier to program, but Asynchronous more efficient, so
     Asynchronous implementations that look Synchronous are ideal
  Sharable (disks) vs. Dedicated (printers) Devices
 Four-Level I/O Software
  Interrupt Handlers- hide deep, maybe block with DOWN on a Semaphore,
   WAIT on a Condition Variable or RECEIVE on a message; interrupt
   procedure must tag back
  Device Drivers- device-dependent code; abstract and concise command
   converted to arcane and lengthy commands to Controller; may wait
   for device to finish (drive) or not (screen)
  Device-Independent I/O Software- uniform interface and naming;
   Protection; Buffering; Mutual Exclusion; Error codes; May shunt
   some to Device Driver for efficiency
  User-Space I/O Software- stdio-type libraries and stand-alone
   applications; Spooling and Spool Daemons

 Disks- physical Sectors and abstract Blocks; may have RAM disks
  Arm Scheduling- First-Come, First-Served; Shortest Seek First;
   Elevator Algorithm (best)
  Track-at-a-Time Caching- Driver can handle for CPU reads or
   Controller for DMA reads
 Clocks- slow ones tied to AC cycle or fast ones in crystal
  Tick- interrupt that maintains Time of Day, Preemptive Scheduling,
   Profiling, CPU Usage accounting
  Watchdog Timer: shuts off disk motor or gives up on network wait
 Terminals- interface hides details
  RS-232 passes changes back and forth
  Memory-Mapped Interface (character or bit) represents whole screen
  Input comes in Raw (character) or Cooked (line); Echoed to screen
  Output- cursor motion and escape sequences

Deadlock: 2 or more processes wait for an event that only one of the
 other waiting processes can cause
 Resources may be Preemptable or Nonpreemptable; process needing a
  Nonpreemptable Resource may Block or get error code and fail
 Deadlock Conditions- need all 4 for Resource-Related Deadlock
  Mutual Exclusion: at most one process can hold resource at a time
  Hold and Wait: processes holding resources can request more
  No Preemption: resources cannot forcibly be taken from a process
  Circular Wait: a circular chain of processes; each awaits the next
 Graph Models
  Process holds Resource: R -> P
  Process requests Resource: P -> R
  Cycle indicates Deadlock
 Dealing with Deadlock
  Ostrich Method: ignore problem and hope it doesn't happen
   avoids limiting functionality
  Detection and Recovery
   With one Resource of each kind: find cycle in graph
    With Multiple Resources
     Matrices C (current allocation), A (available), R (requests)
     Look for process, Pi, such that the ith row of R < ith row of A
     If found, {add ith row of C to A; mark process i; goto LOOP}
     All unmarked processes, if any, are Deadlocked
   Recovery- Preemption (rob Peter to pay Paul); Rollback in time via
    Checkpointing; Kill processes until Deadlock is gone
  Avoidance: avoid Deadlock with clever scheduling
   Resource Trajectories- some deadends in 2-D grid of
    process-resource states
   Safe: there is a way out; Unsafe: cannot avoid Deadlock
   Banker's Algorithm: grant request if it leads to a Safe state;
    else, postpone request
    Requires that maximum needs are known in advance, so irrelevant 
  Prevention: eliminate one of the 4 conditions
   Mutual Exclusion- never-past-capacity Spool, if possible
   Hold and Wait- rarely possible; release & re-request may help
   No Preemption- unacceptable for some devices (printers)
   Circular Wait- strict priorities to processes
 Other Issues
  Two-Phase Locking: take all resources before giving any back
  Non-Resource Deadlocks- mistake with Semaphore, for instance
  Starvation- without proper scheduling, some process may never finish

Introduction to Distributed Systems
 Advantages over Centralized System: Cost/Performance, Speed,
  Reliability, Growth, Inherent Distribution for some tasks
 Advantages over Isolated Multiple Units: Data Sharing, Device
  Sharing, Communication, Flexibility in where work is done
 Disadvantages: Software & Networking difficult; Security
 Flynn: SISD, MISD (never), SIMD, MIMD
  MIMD: Tightly-Coupled (Multiprocessors, Shared Memory) or
   Loosely-Coupled (Multicomputers, Private Memory)
   Bus-Based Multiprocessors- Cache Coherence via Write-Through or
    (better) Snooping; 32-64 CPUs at most
   Switched Multiprocessors- Crossbar Switch or Omega Network; NUMA =
    Non-Uniform Memory Access lets local memory to work faster
   Bus-Based Multicomputers- CPUs communicate; less traffic than
    Shared Memory
   Switched Multicomputers- Grid or n-Cube; Topology should suit task
 Software- Tightly-Coupled- subtasks; Loosely-Coupled- separate tasks
  Loosely-Coupled Software & Loosely-Coupled Hardware
   NFS; Stateless Servers tempting
  Tightly-Coupled Software & Loosely-Coupled Hardware
   true Distributed Systems; Single-System Image; Global, Uniform
    Communication; Consistent OS Kernel, System Calls and Protection
  Tightly-Coupled Software & Tightly-Coupled Hardware
   Multiprocessor Timesharing Systems; seem like general-purpose SISD
    machines; single run queue; shared Main Memory; private caches
 Design Issues
  Transparency- Location, Migration, Replication, Concurrency,
  Flexibility- Microkernel (Interprocess Communication, Memory
   Management, Scheduling, I/O) shunts rest to User Space
  Reliability- Fault Tolerance; 1 failure or n wreck system?;
   Coherence; Security; Clean Transmission
  Performance- Large Grain Size deemphasizes Overhead; in conflict
   with Reliability
  Scalability- centralized units, data or routing begs massive
   failure; total Synchronization unlikely

Communication in Distributed Systems
 Layered Protocols- OSI standard; add a header at each protocol;
  layers added by Sender and peeled by Receiver
  Physical (bits); Data Link (fixed-length frames; Sender-Receiver
  bicker; Checksum at end); Network (routing and hops; X.25 or IP);
  Transport (message to packets; TCP); Session, Presentation and
  Application rarely supported

 Client-Server Model: Clients use service Servers provide; less
  overhead on simple, reliable LANs
  Physical and Data Link levels, simple Request-Reply for Messages
  SEND(dest, &message); RECEIVE(source, &message)
  Headers define Sizes, Operations, Error Codes, Fields of Messages
  Addressing- machine ASCII name or number; big random number
  Blocking (CPU idle; still best); Nonblocking with copy (slow);
   Nonblocking with Interrupt (hard to program)
  Unbuffered (may lose a message); Temporary-Buffered (still may lose
   a message); Mailboxes
  Reliability: how many Acknowledgments and when

 Remote Procedure Call- gets away from I/O-style communication
  Marshall parameters into a Message; Kernel sends Message to Server;
   Timer starts; Server Kernel unpacks parameters and executes call;
   Result is returned
  Parameter Passing- if representations are not uniform (Big Endian
   vs. Little Endian), conversion must be done, to standard or host
   format; Pointers: not supported or whole array passed, or as much
   as is necessary (effectively Copy/Restore passing)
  Dynamic Binding: Client finds a suitable Server, with Name, Version,
   Handle (system-specific), Unique ID (sparse process identifier)
   agreed upon
  RPC Semantics in the Presence of Failures
   Client cannot locate Server- exception via Signal Handler
   Lost Request Messages- wait and resend
   Lost Replay Messages- if Message is not Idempotent (capable of
    re-transmissions without error), add identification number
   Server Crashes- if the RPC call caused the crash, it would be nice
    to know that; Exactly-Once semantics ideal; At-Most-Once or
    At-Least-Once implementable
   Client Crashes- Orphan processes left behind; may go back after
    recovery and kill them all; have Epoch associated with each
    re-boot, and kill processes from prior Epochs; Server may kill
    processes if Epoch Broadcast is heard; limited amount of time may
    be granted to each process in the first place, renewable
  RPC Implementation Issues
   Protocol, of so many possible ones
    Connection (large network) or Connectionless (small, reliable)
    Standard (free) or Custom (specialized) Protocol
    Packet Size?
   Acknowledgements- Stop-and-Wait vs. Blast; Selective Repeat asks
    for resend only of what is lost; Overrun Errors the major hazard
   Critical Path- some cost varies with Message size; others constant
   Copying- in worst case, 8 copies necessary; Scatter-Gather chip can
    concatenate two buffers; with VM, just changing the pointers may
    do a copy; headers a big part of the overhead here
   Timer Management- precision not too important; periodic Sweep of
    Process Table's Timer Fields good answer
  Problem Areas
   Transparency with Global variables and large parameters; may have
    to set maximum size; often only One-Way Pass is needed
   Pipelines may translate poorly to Client-Server model
   User I/O with Remote Process causes much traffic and overhead

 Group Communication
  Multicasting (to group address); Broadcasting (to all machines;
   recipient decides whether or not to look at it); Unicasting (to
   each machine in group, one-by-one)
  Design Issues, beyond those for ordinary communication
   Closed (only members can send; parallel processing) vs. Open
    (replicated servers)
   Peer (symmetric; no single point of failure; complex control)
    vs. Hierarchical (subject to failure; matches task-decomposition)
   Membership- in Server (easy; single point of failure) or
    Distributed (each machine has own copy of list)
    Member crash- hard for rest of network to know about
    Synchronization of membership changes and message receipt
    May so many machines go down that the group can't function?
   Addressing- Group ID; Explicit List; Predicate
   Send & Receive Primitives- for RPC, may separate Group and Single
    primitives, since n replies for a procedure call is strange;
    simple to merge the two primitive sets
   Atomicity: all receive a message or none do; makes system easy to
    program; can be implmented by requiring universal Reply
   Message Ordering- difficult to have Global Time Ordering;
    Consistent Time Ordering (Loosely Synchronous; either A or B is
    seen first by all recipients) or Virtually Synchronous (A and B
    are ordered only if it matters)
     ABCAST - Loosely Synchronous: send M1 with timestamp; get Replies
      with timestamps; send COMMIT; all timestamps increasing
     CBCAST- Virtually Synchronous: a vector of group machines is kept
      in each machine, telling when last it has heard from the others;
      the vector is sent out with each message; messages are buffered
      if the Sender is more "experienced" than the Receiver
   Overlapping Groups- issue is raised that two processes may get
    Messages in different orders; hard problem to root out, and it may
    not be worth the trouble
   Scalability- with loops, must keep packets from circling and
    multiplying endlessly; single points of failure very hazardous;
    traffic congestion at centralized resource can be very bad;
    synchrony too hard to attempt

Synchronization in Distributed Systems
 Clock Synchronization
  Logical Clocks: only sequence matters
   Lamport's Algorithm: Clock = MAX(Messages, Clock) + 1
  Physical Clocks: accurate within some bound
   Cristian's Algorithm: query central Time Server periodically;
    assume lag of 1/2 Reply-Time; adjust by changing milliseconds/tick
    from 10 to 9 or 11
   Berkeley Algorithm: Server daemon queries all machines; tells them
    to get in line with average
   Averaging Algorithms: each machine queries the others; gets in line
    with average
   Multiple External Time Sources- each server broadcasts a range
    bounding the "correct" time; delays compensated for; outliers
    thrown out; midpoint of intersection is The Time

 Mutual Exclusion in Distributed Systems
  Centralized: Coordinator gives permission to just one at a time;
   Request, Reply, Confirmation; single point-of-failure; bottleneck;
   dead Coordinator may resemble a busy one, and cause Blocking
  Distributed: send request to everybody; unanimous approval (none
   using or have dibs) required; n points-of-failure; unused!
  Token-Ring: Token of permission is passed around; use it or pass it;
   considerable traffic; detecting a dead process can lose Token, so
   we may require announcing Token receipt

 Election Algorithms
  Bully: all processes ordered; Coordinator seems dead; some process
   declares ELECTION to its superiors; higher living process holds
   ELECTION itself; if no Reply, the Bully wins and announces that
  Ring: ELECTION goes in a ring, like a chain letter; highest number
   on list is the winner; second lap announces winner; quits

 Atomic Transactions: higher-level abstraction for Fault Tolerance;
  two processes work together and finally COMMIT work to Stable
  Storage; if problem occurs, return to prior state (redundant disks);
  BEGIN, END, ABORT, READ, WRITE; Serializable (Concurrent
  Transactions do not interfere with each other); Atomicity
  (Transaction is indivisible); Permanence (what's done is done)
  Nested Transactions- sub-Transactions require nested copies of world
  Storage Options
   Private Workspace: pointers to parent blocks used for READ; local
    copies made once WRITE starts; only at COMMIT is change to parent
    blocks made
   Writeahead Log: don't actually change data; record old value and
    change you want to make; at COMMIT, make changes; at ABORT,
    Rollback changes
  Two-Phase Commit Protocol: Transaction Coordinator polls the lackies
   and if a unanimous response comes in, a COMMIT is sent out;
   decisions are written to logs every step of the way, saving state
   in the face of multiple crashes

 Concurrency Control Algorithms
  Locking- recorded centrally or locally; multiple readers, one
   writer; Granularity- small: more precise, more parallelism, more
   locks, more expensive, more likely to deadlock
   Two-Phase Locking: acquire all resources; COMMIT; write; release;
    next process steps up; does not eliminate Deadlock
  Optimistic Concurrency Control- act freely and worry about problems
   if they come up; works great with Private Workspaces; busy system
   can cause huge overhead
  Timestamps: if Transaction is older than last access of file, ABORT;
   Deadlock impossible
 Deadlocks in Distributed Systems- very nasty; usually just 2
  processes involved; usually Resource-Oriented
  Ostrich: ignore problem; fine solution here
  Avoidance: never attempted in Distributed Systems
  Detection- ABORTing a transaction here easier to recover from than
   KILLing a process in Single-Processor systems
   Centralized- Coordinator gets reports immediately, periodically, or
    on its request; graph reveals Deadlock
    False Deadlock can result from incomplete or late Replies; with
     Timestamps, Coordinator can poll each machine and make sure that
     no older messages that can resolve Deadlock are coming
   Distributed- blocked process sends a probe forward; if it returns
    home, highest process number along the way is killed
   Wait-Die: older can wait for younger; younger killed if it tries to
    wait for older
   Wound-Wait: older can ABORT younger's transaction and Preempt the
    resource; younger must wait for older

Processes and Processors in Distributed Systems
 Threads AKA Lightweight Processes
  Separate: PC, Stack, Child Threads, Registers, State
  Share: Address Space, Global Variables, Open Files, Child Processes,
   Signals, Semaphores, Accounting
  Use: allow a Process to continue parallel work in the face of
   Blocking System Calls; Threads in 1 Process may differ in role
    Dispatcher/Worker, Team, Pipeline
  Design Issues
   Static (permanent within Process) vs. Dynamic (exit or be killed)
   Shared Data- Mutexes (Binary Semaphore) for short-term locking of
    Critical Section; Condition Variables for long-term resource waits
   Global Variables tricky; maybe make local copies and dash coherence
   Scheduling Algorithm- maybe let user pick
   User Space- may be no Kernel Implmentation; faster than Trapping to
    Kernel; Customize Scheduling; Table & Stack space open-ended
   Kernel Space- Blocking tricky to implement (Jacket around System
    Call); Thread may keep CPU forever; Blocking is the whole point of
    Threads, and that requires a Trap, anyway
   Threads may mutually interfere via Shared Memory (like Copy
    Buffers); Signals (like F4 in X)
  Threads and RPC- with Local (same-machine) calls, just shift
   pointers to data and save on Marshalling; put incoming Parameters
   right into the Server Thread's Address Space

 System Models
  Workstation Model- continuum of how much to have on local disks;
   less: cheap, uniformity & transparency, upgrade easy; much traffic;
   no coherency problem
  Idle Workstations- Server can advertise or Clients seek; may need to
   interact re: File data, I/O, Time, etc.
   When Owner returns, process may: Continue; Migrate; Die
  Processor Pool- advantage of queue service; response time
   T = 1/(potential/sec-demand/sec); 1 n-speed unit better RT than n
   1-speed units, by factor of n; Reliable & Fault-Tolerant; uniform
   response; advantage over Workstation Model only for heavy workload
 Processor Allocation
  Design Issues
   Migratory (flexible) vs. Nonmigratory (simple)
   Goal: Response Time or Response Ratio
   Deterministic, Optimal vs. Heuristic, Good
   Centralized vs. Distributed
   Local vs. Global information in decision
   Locstion- always Global; Sender vs. Receiver to initiate
  Implementation Issues
   Difficult to Measure Load accurately
   Added Overhead and Complexity hard to Recoup
   Stability- don't want processors to play tag with a process
  Allocation Algorithms
   Graph Theory: Processors are Nodes; Traffic is Arcs; Find better
    Partition; Require Complete Information in Advance
   Centralized Up-Down: Give Priority to Needy; Take from Hogs
   Hierarchical: Bosses and Sub-Bosses; Complex; Information may be
   Distributed Heuristic: Probe several machines, and run process
    elsewhere if a suitable Server is found; works well
   Bidding: Economic Inspiration
 Scheduling in Distributed Systems
  Local: no complications
  Co-scheduling: have related, communicating processes run
   simultaneously; same quanta slices across many machines

Distributed File Systems
 File Service: the interface for programmers and users
  As in Single-Processor Systems- Attributes; Protection
  Immutable Files- guarantees Coherence at price of utility
  Basic Models- choice depends upon workload
   Upload-Download: Client gets whole file from Server at once
   Remote-Access: each READ/WRITE requested individually
  Directory Service Interface- Hierarchical Tree vs. General Graph
   Transparency- view from all machines the same?
   Two-Level Naming- sometimes Symbolic and Binary (System use) Names
    Binary Name can hold Location of File, or a Symbolic Link to that
     Location, Protection, etc.
  Semantics of File Sharing
   UNIX: every operation immediately visible to all
   Session: changes visible to others only when file is closed
   Immutable Files: no changes at all; make new copies
   Transactions: changes are all or nothing
   Usage Trends: Most Files <10K; READs outnumber WRITEs; Sequential
    Access more than Random; Short File Lifespans; Sharing Rare; Most
    Processes access few Files; File Classes useful
   Servers and Clients: Same (general) or Different (optimized)
   Directories and Files at Same Server (less traffic) or Different
   Name Caching: remember where Files were before
   Path Lookup- Iterative (Client does work) or Automatic (Servers are
   States or Stateless?
    States- shorter Requests; better performance; readahead possible;
     idempotency easier; File Locking possible
    Stateless- Fault-Tolerance; no OPEN/CLOSE calls; saves Server
     space; no limit on how many Open Files; Client crash no problem
    Server Memory vs. Client Disk vs. Client Memory
     Client Memory- Process Memory vs. Kernel Memory vs. Cache Manager
    Unit- File (fast) vs. Block (saves space)
    Key Parameters- Size, Number, Locality of Accesses; Hardware Speed
   Cache Consistency
    Immutable Files (like Text Segments) eliminates problem
    Write-Through (saves READ traffic only)
    Delayed Write (faster, but may confuse semantics)
    Write-on-Close (Session Semantics)
    Centralized Control (UNIX Semantics; not robust; scales poorly)
     Client-Driven or Server's Unsolicited Message announces change
   Replication- adds Reliability; eases Server bottleneck; should be
    Implementation Options
     Explicit (multiple copies & lookups); maybe Group Primitives
     Lazy (Server copies on its own time)
    Update Protocols- problem if crash after only some WRITEs done
     Primary Server: head honcho (single point-of-failure)
     Voting: majority rules; with Ghosts: whoever's alive takes part
 Trends in Distributed Systems- WORM; Fiber Optics; Scalability;
  Wide-Area Networks; mobile users; Fault-Tolerance
 Lessons Learned
  Use Workstation power to save Server; Cache where possible; Exploit
  Usage properties; Minimize Centrality, perhaps with Hierarchy; Blast
  Transfers when possible