15CS72_ACA_Module3_chapter1_finalnotes
15CS72_ACA_Module3_chapter1_finalnotes
In bus based commercial multiprocessors system many processors will request the
access of the bus at the same time, but only one of them will be granted access at a
time. Also the effective bandwidth available to each processor will be inversely
proportional to the number of processors contending for the bus. Hence bus based
multiprocessor system should be small in size with some 4 to 16 processors.
Functional Modules
Various functional modules are given below
Arbiter: It accepts bus requests from the requester module and grants control of the
DTB to one requester at a time.
Bus Timer: It measures the time each data transfer takes on the DTB and terminates
the DTB cycle if a transfer takes too long.
Interrupter Module: It generates an interrupt request along with it’s status /ID
information.
Location monitor: Monitors the data transfer over the DTB bus
Power Monitor: It watches the status of the power source and signals when power
becomes unstable.
System clock driver : It is a module that provides a clock timing signal on the utility bus.
Physical Limitations
1. Due to electrical, mechanical, and packaging limitations, only a limited number of
boards can be plugged into a single backplane bus.
2. The bus system is difficult to scale and mainly limited by contention.
3. Multiple backplane buses can be mounted on the same backplane chassis.
5.1.2 Addressing and Timing Protocols
Bus has two types of devices. Active devices like processors can act as bus masters or
as slaves at different times. Passive devices like memories can act only as slaves.Only
one master can control the bus at a time.
Bus Addressing
Every bus is having internal clock with fixed time bus cycle. Also the parallel lines on
different buses can be utilized for data transfer at the same time/bus cycle. All bus
cycles are not utilized for data transfer. To optimize performance, the bus should be
designed to minimize the time required for request handling, arbitration, addressing, and
interrupts so that most bus cycles are used for useful data transfer operations.
Synchronous Timing
All bus transactions take place at fixed length clock signal. The master uses data ready
pulse to initiate transfer. The slave uses a data accept pulse to signal completion of the
information transfer. A synchronous bus is simple to control, requires less control
circuitry, and thus costs less. It is suitable for connecting devices having relatively the
speed. It is shown in figure given below.
Asynchronous Timing
It is implemented using handshaking or interlocking mechanism and the bus transaction
takes place at variable length clock signals. The rising edge (1) of the data ready signal
from the master triggers the rising (2) of the data accept signal from the slave. The
second signal triggers the falling (3) the data ready clock and the removal of data from
the bus. The third signal triggers the trailing edge (4) of the data accept clock . This
four-edge handshaking process is repeated until all the data are transferred.
Because of variable length clock signals the fast and slow devices can be connected on
the same bus, and it is less prone to noise. Overall, an asynchronous bus offers better
application flexibility at the expense of increased complexity and costs.
Central Arbitration
1. It uses a central arbiter. The bus grant signal is propagated from first master( slot
1) to last master (slot n).
2. Each master can send bus request. However, all requests share the same
bus-request line. As shown in Figure, the bus-request signals the rise of the
bus-grant level which in turn raises the bus-busy level.
3. The bus grant signal serially propagates through each master until it encounters
the first one that is requesting access to the bus. This master blocks the
propagation of the bus grant signal, activities the busy line and gains control of
the bus.
4. Fixed priority is set from left to right such that the lower priority devices are at the
right. Thus lower priority device on right gets the bus control only when the
devices on the left do not request bus control.
5. When the bus transaction is complete, the bus-busy level is lowered, which
triggers the falling of the bus grant signal and the subsequent rising of the
bus-request signal. Advantage of this arbitration scheme is its simplicity.
6. Disadvantage is fixed-priority sequence violating the fairness practice. Another
drawback is its slowness in propagating the bus grant signal along the daisy
chain.
7. Whenever a higher-priority device fails, all the lower-priority devices on the right
of the daisy chain cannot use the bus.
Independent Request and Grant: Instead of using shared request and grant lines as in
Figure, multiple bus-request and bus-grant signal lines can be independently provided
for each potential master as shown in figure. But the total number of signal lines
required is larger.The arbitration among potential masters is still carried out by a central
arbiter. However, any priority based or fairness-based bus allocation policy can be
implemented. The advantage is their flexibility and faster arbitration time compared with
the daisy-chained policy. The drawback is the large number of arbitration lines used.
Distributed Arbitration
1. Each potential master is equipped with its own arbiter and a unique arbitration
number.
2. When two or more devices compete for the bus, the winner is the one whose
arbitration number is the largest.
3. All potential masters can send their arbitration numbers to the shared-bus
request grant (SBRG) lines on the arbitration bus
4. Each arbiter compares the resulting number on the SBRG lines with its own
arbitration number. If the SBRG number is greater, the requester is dismissed. At
the end, the winner’s arbitration number remains on the arbitration bus. After the
current bus transaction is completed, the winner seizes control of the bus.
5. It is shown in figure given below
Transaction Modes: There are three different data transfer modes
1. An address-only mode consists of an address transfer followed by no data.
2. A compelled data transfer consists of an address transfer followed by a block of
one or more data transfers to one or more contiguous addresses.
3. A packet data transfer mode consists of an address transfer followed by a fixed
length block of data transfers (packet) from a set of contiguous addresses.
A connected transaction is used to carry out a master's request and a slave's response
in a single bus transaction.
A split transaction splits the request and response into separate bus transactions.Split
transactions allow devices with a long data latency or access time to use the bus
resources in a more efficient way.
Example : Split Cache Design in Silicon Graphics Workstation and Unified Cache
Design in VAX 8600 and the intel i486.
The figure given above shows the split cache design in the Silicon Graphics 4-D Series
workstation. Both data cache and instruction cache are accessed with a physical
address issued from the on-chip MMU.A two-level data cache is implemented in this
design.The first level uses 64 Kbytes of WT D-cache. The second level uses 256
KBytes of WB D-cache. The single-level I-cache is 64 KBytes. By the inclusion property,
the first level cache is always a subset of the second-level cache
Advantages
1. No aliasing
2. Fewer cache bugs
3. No Cache flushing
4. Simple design
Drawback : Slowdown in accessing the cache until the MMU or TLB finishes translating
the address.
Virtual address cache
1. The cache is indexed or tagged with a virtual address.
2. Here the cache can be accessed faster compared to physical address cache.
Example: The virtual addressed split cache design in Intel i860
The figure given below shows split cache for instruction and data. Instructions are 32 bit
wide. Virtual and physical address also is 32 bit. The data cache is 8K bytes.
The Aliasing Problem
● The major problem associated with the virtual cache is aliasing, where several
virtual addresses will map to the same physical memory address.
● Hence multiple copies of the data found in the physical address will be cached in
different cache lines.
● Hence when one copy of the data is updated by one process, other copies of the
same data will be stale. Also the subsequent read by another process for same
data will return the old value.
● This problem can be solved using Cache Flushing when a context switch occurs.
Flush writes back the contents of the cache to the main memory
● Large amounts of flushing may result in a poor cache performance, with a low hit
ratio and too much time wasted in flushing.Also before I/O writes or after I/O
reads, the cache must be flushed.
identified the tag of the cache block frame is compared with tag of memory block. A
cache hit occurs when the two tags match.When cache miss occurs, the entire memory
address (s + w bits) is used to access the main memory. The first s bits locate the
addressed block, and the lower w bits locate the word within the block.
Advantages of Direct Mapped Cache
1. Simplicity in hardware
2. Lower cost
3. No search needed
Drawback
1. Rigid mapping may result in poor hit ratio.
Advantages:
Higher hit ratio
Better block replacement policy with reduced block contention
Disadvantages
Expensive search process with additional hardware cost
Set Associative Cache
It is the most popular cache design and can offer the best performance cost ratio.In a
k-way set associative cache , the m cache block frames are divided into v= m/k sets,
with k blocks per set. Each set is identified by a d-bit set number where 2d =v.
The cache block frame has tag of s-d bits. In a k-way associative search, the tag of the
received memory address is compared with all the tags of the identified set.
Thus block Bj can be mapped into any of the block frames in a set Si of cache
Advantages
1. Yields higher hit ratio
2. k-way associative search is easier to implement.
3. Block replacement policy should consider among few blocks from set for
replacement.
4. For a fixed eache size there exists a tradeoff between the set size and the
number of sets.
5.2.4 Cache Performance Issues
Cache performance depends on two related concepts:
1. Cycle Count : Number of basic machine cycles that are needed to access or
update cache.
2. Hit Ratio : It determines how effectively cache can reduce overall access time.
Cycle Count : It depends on Hit Ratio, Write-through or Write-back cache, Cache-size,
Block-size, Number of Sets and Associativity. The cycle count is directly related to the
hit ratio which decreases linearly with the increase in the above cache parameters. But
the decreasing trend becomes flat and after a certain point turns into an increasing
trend as shown in figure below.
Hit Ratio
Block Size
Low order interleaving spreads contiguous memory locations across the m modules
horizontally. Thus lower order a bits of the memory address are used to identify the
memory module. The higher order b bits are the word addresses (displacement) within
each module.Also called C-access memory scheme.
High order interleaving spreads contiguous memory locations across the m modules
vertically. Thus high order a bits of the memory address are used to identify the memory
module. The lower order b bits are the word addresses (displacement) within each
module.
Memory Bandwidth
The memory bandwidth B for m-way interleaved memory is upper bounded by m and
lower bounded by 1. The hellerman estimate of B is
B= m0.56 ≅ √m
Where m is the number of memory modules. Also Hellerman’s estimate is given by
assuming a single processor system.
Consider a vector processing system where a vector of n elements is stored in
contiguous memory locations in a m-way interleaved memory system. The average time
t1 required to access one element in a vector is estimated by
θ m−1
t1 = m
(1 + n
)
θ
Where n→ ∞ (very long vector) , t1 = m
= 𝞽. As n→ 1(scalar access) then t1 → θ
Fault Tolerance
In high order interleaving the sequential addresses are assigned to one single memory
module. This makes it easier to isolate faulty memory modules in a memory bank of m
memory modules. When one module failure is detected , the remaining modules can
still be used by opening a window in the address space. This fault isolation cannot be
done in low order interleaving in which a memory module failure will completely
paralyze the entire memory bank. Thus low order interleaving memory is not fault
tolerant.
5.4 Sequential and Weak Consistency Model
5.4.1 Atomicity and Event Ordering
The problem of memory inconsistency arises when the memory-access order differs
from the program execution order. Also when the memory accesses (for instructions
and data) are consistent with the program execution order it is called sequential
consistency. The sequential consistency is shown in figure below for SISD systems
Primitive memory operations for multiprocessors include load(read), store (write), and
one or more synchronization operations such as swap.
Event Orderings :
On a multiprocessor, concurrent instruction streams (or threads) executing on different
processors are called processes. Different processes perform memory operations for
particular shared memory and hence it is necessary to determine the order in which the
memory events performed by one process should be observed by other processes in
the machine.
The event ordering can be used to declare whether a memory event is legal or illegal,
when several processes are accessing a common set of memory locations.
Maintaining the correctness and predictability of the execution results is rather complex
on an MIMD system for the following reasons:
1. The order in which instructions belonging to different streams are executed is not
fixed in a parallel program. If no synchronization among the instruction streams
exists, then a large number of different instruction interleavings is possible.
2. If for performance reasons the order of execution of instructions belonging to the
same stream is different from the program order, then an even larger number of
instruction interleavings is possible
Example: The shared variables A,B,C are set to 0 initially. If the outputs of all three
processors are concatenated in the order P1 , P2, and P3, then the output forms a
6-tuple of binary vectors. There are 26 = 64 possible output combinations. If all
processors execute instructions in their own program orders, then the execution
interleaving a,b,c,d,e,f is possible, yielding the output 001011. Another
Interleaving a,c,e,b,d,f also preserves the program orders and yields the output 111111.
Another interleaving b,d,f,e,a,c does not preserve the program order and yields the
output 000000. Out of 6! = 720 possible execution interleavings, 90 preserve the
individual program order.
Atomicity
A shared memory access is atomic if the memory updates are known to all processors
at the same time. Thus a store is atomic if the value stored becomes readable to all the
processors at the same time. Hence if the memory access is atomic and if all individual
program orders are preserved it is possible to have sequential consistency with strong
consistency models.
If the memory access is non atomic then the multiprocessor cannot be strongly ordered
and it is necessary to have weak consistency model.
1. There is a single port that is able to service exactly one memory operation at a
time, and a switch that connects this memory to one of the processors for the
duration of each memory operation.
2. The order in which the switch is thrown from one processor to another
determines the global order of memory-access operations which is observed by
all processors.
3. Also the strong ordering of all the shared memory accesses in the sequential
consistency model preserves the program order in all processors.
4. Also another processor is not allowed to access a shared memory until the
recent value written to shared memory has been globally performed. This may
require the propagation of all shared-memory accesses to all processors, which
is rather time-consuming and costly.
5. Also sequential consistency model has poor scalability.
Strong Ordering introduces unnecessary processor/cache waiting time and reduces the
concurrency.
The memory events store and swap are placed in a dedicated store buffer of each
processor. Thus the order in which these memory events are performed is the same as
the order in which the processor issued them(in program order).
The memory order corresponds to the order in which the switch is thrown from one
processor to another.
A load by a processor first checks its store buffer to see if it contains a store to the same
location. If it does, then the load returns the value of the most recent such store.
Otherwise, the load goes directly to memory.
A swap is placed in the store buffer like a store and it blocks the processor like a load.
In other words, swap blocks until the store buffer is empty and then proceeds to the
memory.
The weak consistency model may offer better performance than the sequential
consistency model at the expense of more complex hardware software support and
more programmer awareness of the imposed restrictions.