0% found this document useful (0 votes)
7 views

15CS72_ACA_Module3_chapter1_finalnotes

This document discusses bus systems in multiprocessor architectures, detailing the backplane bus specifications, data transfer mechanisms, and arbitration methods. It also covers cache memory organization, including addressing models, cache types (direct, associative, and set associative), and performance issues related to hit ratios and cycle counts. The document highlights the complexities and trade-offs involved in designing efficient bus and cache systems.

Uploaded by

Tarun Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

15CS72_ACA_Module3_chapter1_finalnotes

This document discusses bus systems in multiprocessor architectures, detailing the backplane bus specifications, data transfer mechanisms, and arbitration methods. It also covers cache memory organization, including addressing models, cache types (direct, associative, and set associative), and performance issues related to hit ratios and cycle counts. The document highlights the complexities and trade-offs involved in designing efficient bus and cache systems.

Uploaded by

Tarun Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Module 3: Chapter 1

5.1 Bus Systems

In bus based commercial multiprocessors system many processors will request the
access of the bus at the same time, but only one of them will be granted access at a
time. Also the effective bandwidth available to each processor will be inversely
proportional to the number of processors contending for the bus. Hence bus based
multiprocessor system should be small in size with some 4 to 16 processors.

5.1.1 Backplane Bus Specification


1. The backplane bus interconnects processors, data storage and peripheral
devices.
2. Timing protocols should be developed to arbitrate(making decision to grant bus
access for particular request) among multiple requests.
3. Operational rules must be set to ensure orderly data transfers on the bus.
4. The backplane bus is shown in given figure.
5. Various functional boards are plugged into slots on the backplane which has one
or more connectors for inserting the boards.

Data Transfer bus


It contains control , data and address lines. The address lines are used to broadcast the
data and address . The number of address lines is proportional to the logarithm of the
size of the address space. Address modifier lines can be used to define special
addressing modes. The data lines are often proportional to the memory word length.
VME bus has 32 address lines. The DTB control lines are used to indicate read/write ,
timing control, and bus error-conditions.

Bus Arbitration and Control


The process of assigning the DTB bus to the requesting processor is called arbitration.
The requester is called as master and processor receiving request is called slave.
Special dedicated lines for managing the arbitration process. Utility lines include signals
for managing the power up and power down sequences of the system. A special bus
controller board has backplane control logic, such as the system clock driver, arbiter,
bus timer, and power driver.

Functional Modules
Various functional modules are given below
Arbiter: It accepts bus requests from the requester module and grants control of the
DTB to one requester at a time.
Bus Timer: It measures the time each data transfer takes on the DTB and terminates
the DTB cycle if a transfer takes too long.
Interrupter Module: It generates an interrupt request along with it’s status /ID
information.
Location monitor: Monitors the data transfer over the DTB bus
Power Monitor: It watches the status of the power source and signals when power
becomes unstable.
System clock driver : It is a module that provides a clock timing signal on the utility bus.
Physical Limitations
1. Due to electrical, mechanical, and packaging limitations, only a limited number of
boards can be plugged into a single backplane bus.
2. The bus system is difficult to scale and mainly limited by contention.
3. Multiple backplane buses can be mounted on the same backplane chassis.
5.1.2 Addressing and Timing Protocols

Bus has two types of devices. Active devices like processors can act as bus masters or
as slaves at different times. Passive devices like memories can act only as slaves.Only
one master can control the bus at a time.

Bus Addressing
Every bus is having internal clock with fixed time bus cycle. Also the parallel lines on
different buses can be utilized for data transfer at the same time/bus cycle. All bus
cycles are not utilized for data transfer. To optimize performance, the bus should be
designed to minimize the time required for request handling, arbitration, addressing, and
interrupts so that most bus cycles are used for useful data transfer operations.

Broadcall and Broadcast


Broadcall is the read operation where multiple slaves will put the data on the bus lines
useful for detecting multiple interrupt sources. A broadcast is write operation involving
multiple slaves and useful for maintaining multiprocessor cache coherence.
Figure given below shows the typical timing sequence when information is transferred
over a bus from a source to destination. Most bus timing protocols implement such a
sequence.

Synchronous Timing
All bus transactions take place at fixed length clock signal. The master uses data ready
pulse to initiate transfer. The slave uses a data accept pulse to signal completion of the
information transfer. A synchronous bus is simple to control, requires less control
circuitry, and thus costs less. It is suitable for connecting devices having relatively the
speed. It is shown in figure given below.
Asynchronous Timing
It is implemented using handshaking or interlocking mechanism and the bus transaction
takes place at variable length clock signals. The rising edge (1) of the data ready signal
from the master triggers the rising (2) of the data accept signal from the slave. The
second signal triggers the falling (3) the data ready clock and the removal of data from
the bus. The third signal triggers the trailing edge (4) of the data accept clock . This
four-edge handshaking process is repeated until all the data are transferred.

Because of variable length clock signals the fast and slow devices can be connected on
the same bus, and it is less prone to noise. Overall, an asynchronous bus offers better
application flexibility at the expense of increased complexity and costs.

5.1.3 Arbitration, Transaction and Interrupt


The process of assigning the DTB bus to the requesting processor or master is called
arbitration.The duration of master’s control over the bus is called as bus tenure.
Schemes for arbitration
1. Central Arbitration
2. Distributed Arbitration

Central Arbitration
1. It uses a central arbiter. The bus grant signal is propagated from first master( slot
1) to last master (slot n).
2. Each master can send bus request. However, all requests share the same
bus-request line. As shown in Figure, the bus-request signals the rise of the
bus-grant level which in turn raises the bus-busy level.
3. The bus grant signal serially propagates through each master until it encounters
the first one that is requesting access to the bus. This master blocks the
propagation of the bus grant signal, activities the busy line and gains control of
the bus.
4. Fixed priority is set from left to right such that the lower priority devices are at the
right. Thus lower priority device on right gets the bus control only when the
devices on the left do not request bus control.
5. When the bus transaction is complete, the bus-busy level is lowered, which
triggers the falling of the bus grant signal and the subsequent rising of the
bus-request signal. Advantage of this arbitration scheme is its simplicity.
6. Disadvantage is fixed-priority sequence violating the fairness practice. Another
drawback is its slowness in propagating the bus grant signal along the daisy
chain.
7. Whenever a higher-priority device fails, all the lower-priority devices on the right
of the daisy chain cannot use the bus.
Independent Request and Grant: Instead of using shared request and grant lines as in
Figure, multiple bus-request and bus-grant signal lines can be independently provided
for each potential master as shown in figure. But the total number of signal lines
required is larger.The arbitration among potential masters is still carried out by a central
arbiter. However, any priority based or fairness-based bus allocation policy can be
implemented. The advantage is their flexibility and faster arbitration time compared with
the daisy-chained policy. The drawback is the large number of arbitration lines used.
Distributed Arbitration
1. Each potential master is equipped with its own arbiter and a unique arbitration
number.
2. When two or more devices compete for the bus, the winner is the one whose
arbitration number is the largest.
3. All potential masters can send their arbitration numbers to the shared-bus
request grant (SBRG) lines on the arbitration bus
4. Each arbiter compares the resulting number on the SBRG lines with its own
arbitration number. If the SBRG number is greater, the requester is dismissed. At
the end, the winner’s arbitration number remains on the arbitration bus. After the
current bus transaction is completed, the winner seizes control of the bus.
5. It is shown in figure given below
Transaction Modes: There are three different data transfer modes
1. An address-only mode consists of an address transfer followed by no data.
2. A compelled data transfer consists of an address transfer followed by a block of
one or more data transfers to one or more contiguous addresses.
3. A packet data transfer mode consists of an address transfer followed by a fixed
length block of data transfers (packet) from a set of contiguous addresses.
A connected transaction is used to carry out a master's request and a slave's response
in a single bus transaction.
A split transaction splits the request and response into separate bus transactions.Split
transactions allow devices with a long data latency or access time to use the bus
resources in a more efficient way.

Interrupt Mechanism: An interrupt is a request from IO or other devices to a processor


for service or attention. A priority interrupt bus is used to pass the interrupt signals. For
example, the VME bus uses seven interrupt-request lines. Up to seven interrupt
handlers can be used to handle multiple interrupts.

5.2 Cache Memory Organization


5.2.1 Cache Addressing Models
In shared memory multiprocessor system each processor has its own private cache.
Caches can be addressed using either a physical address or a virtual address. This
leads to the two ditierent cache design models given below.
1. Physical address cache
2. Virtual address cache

Physical Address cache


1. The cache is addressed using the physical memory address.
2. Initially the virtual address will be translated to physical address by the TLB or
the MMU and cache lookup will be performed
3. If the data is found in the cache it is cache hit and if the data is not found in the
cache it is cache miss. In case of a cache miss the required block of data is
loaded from the main memory to cache. Since a whole cache block is loaded at
one time some unwanted data may also be loaded. Locality of references will find
most of the loaded data useful in subsequent instruction cycles
4. Also cache can be Write Through(WT) cache where data is written through the
main memory immediately or Write Back(WB) cache where data is not written
immediately and it is delayed until the cache block block is replaced.

Example : Split Cache Design in Silicon Graphics Workstation and Unified Cache
Design in VAX 8600 and the intel i486.

The figure given above shows the split cache design in the Silicon Graphics 4-D Series
workstation. Both data cache and instruction cache are accessed with a physical
address issued from the on-chip MMU.A two-level data cache is implemented in this
design.The first level uses 64 Kbytes of WT D-cache. The second level uses 256
KBytes of WB D-cache. The single-level I-cache is 64 KBytes. By the inclusion property,
the first level cache is always a subset of the second-level cache
Advantages
1. No aliasing
2. Fewer cache bugs
3. No Cache flushing
4. Simple design
Drawback : Slowdown in accessing the cache until the MMU or TLB finishes translating
the address.
Virtual address cache
1. The cache is indexed or tagged with a virtual address.
2. Here the cache can be accessed faster compared to physical address cache.
Example: The virtual addressed split cache design in Intel i860
The figure given below shows split cache for instruction and data. Instructions are 32 bit
wide. Virtual and physical address also is 32 bit. The data cache is 8K bytes.
The Aliasing Problem
● The major problem associated with the virtual cache is aliasing, where several
virtual addresses will map to the same physical memory address.
● Hence multiple copies of the data found in the physical address will be cached in
different cache lines.
● Hence when one copy of the data is updated by one process, other copies of the
same data will be stale. Also the subsequent read by another process for same
data will return the old value.
● This problem can be solved using Cache Flushing when a context switch occurs.
Flush writes back the contents of the cache to the main memory
● Large amounts of flushing may result in a poor cache performance, with a low hit
ratio and too much time wasted in flushing.Also before I/O writes or after I/O
reads, the cache must be flushed.

5.2.2 Direct Mapping and Associative Caches


Blocks in caches are called Block Frames in order to distinguish them from the blocks in
main memory. Block frames are denoted as ̅Bi​= 0,1,2...m. Blocks are denoted as B​j for
j=0,1,2...n. Also n>>m. Also n= 2s and m= 2r .
Each block or block frame will contain b words where b= 2w . Thus cache consists of
m.b= 2r+w words. The main memory has n.b= 2s+w words.

Direct Mapping Cache


The Block B​j​ is mapped to Block Frame ̅Bi​ ​if
i=j(modulo m).
Also direct mapping is very rigid but this is the simplest cache organization to
implement.
The Direct mapping is shown in figure given below where the memory block contains 4
words hence w=2.
The memory address is divided into three fields. The lower w bits specify the word offset
within each block. The s-r bits specify the tag bits to be matched. The block field with r
bits is used to identify the cache block frame for main memory block. Once the ̅Bi is

identified the tag of the cache block frame is compared with tag of memory block. A
cache hit occurs when the two tags match.When cache miss occurs, the entire memory
address (s + w bits) is used to access the main memory. The first s bits locate the
addressed block, and the lower w bits locate the word within the block.
Advantages of Direct Mapped Cache
1. Simplicity in hardware
2. Lower cost
3. No search needed
Drawback
1. Rigid mapping may result in poor hit ratio.

Fully Associative Cache


Each block in main memory can be placed in any of the available block frames. The tag
contains s-bit. Also the tag of the received memory address will be compared with all
block tags in the cache. The hit ratio will be higher.
The m-way comparison of all the tags is very time consuming if the tags are compared
sequentially using RAMs. Thus an associative memory is needed to achieve a parallel
comparison with all tags simultaneously which requires higher implementation cost. The
figure given below shows fully associative cache. The tag is 4 bits.

Advantages:
Higher hit ratio
Better block replacement policy with reduced block contention
Disadvantages
Expensive search process with additional hardware cost
Set Associative Cache
It is the most popular cache design and can offer the best performance cost ratio.In a
k-way set associative cache , the m cache block frames are divided into v= m/k sets,
with k blocks per set. Each set is identified by a d-bit set number where 2d =v.

The cache block frame has tag of s-d bits. In a k-way associative search, the tag of the
received memory address is compared with all the tags of the identified set.
Thus block B​j​ can be mapped into any of the block frames in a set S​i​ of cache

Advantages
1. Yields higher hit ratio
2. k-way associative search is easier to implement.
3. Block replacement policy should consider among few blocks from set for
replacement.
4. For a fixed eache size there exists a tradeoff between the set size and the
number of sets.
5.2.4 Cache Performance Issues
Cache performance depends on two related concepts:

1. Cycle Count : Number of basic machine cycles that are needed to access or
update cache.
2. Hit Ratio : It determines how effectively cache can reduce overall access time.

Hit Ratio = ​no of cache hits

(no of cache hits +cache miss)

3. Two approaches for studying cache performance :

Program trace-driven simulation : snapshots of the program behavior and cache


responses are taken at regular intervals but they also suffer from having a
microscopic perspective

Analytical modelling : It may deviate from reality under simplification. They


provide macroscopic insight into the underlying process.

Cycle Count : ​It depends on Hit Ratio, Write-through or Write-back cache, Cache-size,
Block-size, Number of Sets and Associativity. The cycle count is directly related to the
hit ratio which decreases linearly with the increase in the above cache parameters. But
the decreasing trend becomes flat and after a certain point turns into an increasing
trend as shown in figure below.
Hit Ratio

1. It is affected by cache size and block size in different ways.


2. Hit Ratio increases with respect to increasing cache size .
3. As cache size reaches infinity hit ratio tends to become 100%.
4. But this is not possible as cache memory is bounded by a limited budget.

Block Size

1. With fixed cache size cache performance is sensitive to block size.


2. As block size increases hit ratio increases due to spatial locality in referencing
larger instructions/data blocks.
3. The increase reaches peak at optimum block size.
4. Then it starts decreasing and hit ratio approaches zero when the blocks size
equals the entire cache size.
Effect of Set Number:​For fixed capacity cache hit ratio may decrease as the number of
sets increases.

Other Performance Factors

1. In performance-directed design trade off exists among cache-size, set number,


block size and memory speed.
2. I​ndependent blocks, fetch sizes, fetch strategies affect performance.
3. Sometimes multilevel cache hierarchies are also used.
4. Very often a write-through policy is used for first level cache and write-back
policy is used for second-level cache.
5.3 Shared Memory Organization
Memory interleaving indicates the pipelined access of the parallel memory modules and
provides higher bandwidth.
Memory Interleaving
The main memory is built with multiple memory modules. Once received a memory
address, each memory module returns one word per cycle. It is possible to present
different addresses to different memory modules so that parallel access of multiple
words can be done simultaneously or in a pipelined fashion.

Consider a main memory with m= 2a memory modules each containing w= 2b words.


The total memory capacity= m.w = 2a+b words. These memory words are assigned linear
addresses. Different ways of assigning linear addresses result in different memory
organizations as given below

Low order interleaving spreads contiguous memory locations across the m modules
horizontally. Thus lower order a bits of the memory address are used to identify the
memory module. The higher order b bits are the word addresses (displacement) within
each module.Also called C-access memory scheme.
High order interleaving spreads contiguous memory locations across the m modules
vertically. Thus high order a bits of the memory address are used to identify the memory
module. The lower order b bits are the word addresses (displacement) within each
module.

Pipelined Memory Access


Access of the m memory modules can be overlapped in a pipelined fashion as shown in
figure given below. For this purpose. the memory cycle (called major cycle) is
subdivided into m minor cycles.
The figure given above is for eight way interleaved memory with m=w=8 and a=b=3.
The 𝛉 is major cycle and 𝝉 is the minor cycle. Also
𝝉= 𝛉/m
Where m= degree of interleaving.
The major cycle 𝛉 is the total time required to complete the access of a single word from
a module. The minor cycle 𝝉 is the actual time needed to produce one word. Even
though the total block access time is 2𝛉 the effective access time of each word is
reduced to 𝝉 as the memory is contiguously accessed in a pipelined fashion.

5.3.2 Bandwidth and Fault Tolerance

Memory Bandwidth
The memory bandwidth B for m-way interleaved memory is upper bounded by m and
lower bounded by 1. The hellerman estimate of B is
B= m0.56 ≅ √m
Where m is the number of memory modules. Also Hellerman’s estimate is given by
assuming a single processor system.
Consider a vector processing system where a vector of n elements is stored in
contiguous memory locations in a m-way interleaved memory system. The average time
t​1​ required to access one element in a vector is estimated by
θ m−1
t​1​ = m
(1 + n
)
θ
Where n→ ∞ (very long vector) , t​1​ = m
= 𝞽. As n→ 1(scalar access) then t​1​ → θ

Fault Tolerance

In high order interleaving the sequential addresses are assigned to one single memory
module. This makes it easier to isolate faulty memory modules in a memory bank of m
memory modules. When one module failure is detected , the remaining modules can
still be used by opening a window in the address space. This fault isolation cannot be
done in low order interleaving in which a memory module failure will completely
paralyze the entire memory bank. Thus low order interleaving memory is not fault
tolerant.
5.4 Sequential and Weak Consistency Model
5.4.1 Atomicity and Event Ordering
The problem of memory inconsistency arises when the memory-access order differs
from the program execution order. Also when the memory accesses (for instructions
and data) are consistent with the program execution order it is called sequential
consistency. The sequential consistency is shown in figure below for SISD systems

In shared memory multiprocessor system , there are multiple instruction sequences in


different processors as shown in figure. Hence it is necessary to find the global memory
access order for all the processors by interleaving (arranging) the instruction
sequences.

Primitive memory operations for multiprocessors include load(read), store (write), and
one or more synchronization operations such as swap.

Event Orderings :
On a multiprocessor, concurrent instruction streams (or threads) executing on different
processors are called processes. Different processes perform memory operations for
particular shared memory and hence it is necessary to determine the order in which the
memory events performed by one process should be observed by other processes in
the machine.

The event ordering can be used to declare whether a memory event is legal or illegal,
when several processes are accessing a common set of memory locations.

Three primitive memory operations are given below


1. A load by processor P​i is considered performed with respect to processor P​K at a
point of time when the issuing of a store to the same location by P​K cannot affect
the value returned by the load.
2. A store by P​i is considered performed with respect to P​K at one time when an
issued load to the same address by P​K​ returns the value by this store.
3. A load is globally performed if it is performed with respect to all processors and if
the store that is the source of the returned value has been performed with
respect to all processors.

Maintaining the correctness and predictability of the execution results is rather complex
on an MIMD system for the following reasons:
1. The order in which instructions belonging to different streams are executed is not
fixed in a parallel program. If no synchronization among the instruction streams
exists, then a large number of different instruction interleavings is possible.
2. If for performance reasons the order of execution of instructions belonging to the
same stream is different from the program order, then an even larger number of
instruction interleavings is possible
Example: The shared variables A,B,C are set to 0 initially. If the outputs of all three
processors are concatenated in the order P​1 , P​2​, and P​3​, then the output forms a
6-tuple of binary vectors. There are 26 = 64 possible output combinations. If all
processors execute instructions in their own program orders, then the execution
interleaving a,b,c,d,e,f is possible, yielding the output 001011. Another
Interleaving a,c,e,b,d,f also preserves the program orders and yields the output 111111.
Another interleaving b,d,f,e,a,c does not preserve the program order and yields the
output 000000. Out of 6! = 720 possible execution interleavings, 90 preserve the
individual program order.

Hence the behavior of the memory can be categorized as shown below.


1. Program order preserved and uniform observation sequence by all processors.
2. Out of program order allowed and uniform observation sequence by all
processors.
3. Out of program order allowed and nonuniform sequences observed by different
processors.

Atomicity
A shared memory access is atomic if the memory updates are known to all processors
at the same time. Thus a store is atomic if the value stored becomes readable to all the
processors at the same time. Hence if the memory access is atomic and if all individual
program orders are preserved it is possible to have sequential consistency with strong
consistency models.

If the memory access is non atomic then the multiprocessor cannot be strongly ordered
and it is necessary to have weak consistency model.

5.4.2 Sequential Consistency Model


In this model, loads, stores and swaps of all processors appear to execute serially in
single global memory order, that conforms to the individual program order of the
processors.

1. There is a single port that is able to service exactly one memory operation at a
time, and a switch that connects this memory to one of the processors for the
duration of each memory operation.
2. The order in which the switch is thrown from one processor to another
determines the global order of memory-access operations which is observed by
all processors.
3. Also the strong ordering of all the shared memory accesses in the sequential
consistency model preserves the program order in all processors.
4. Also another processor is not allowed to access a shared memory until the
recent value written to shared memory has been globally performed. This may
require the propagation of all shared-memory accesses to all processors, which
is rather time-consuming and costly.
5. Also sequential consistency model has poor scalability.

Lamport’s Definition: A multiprocessor system is sequentially consistent if the result of


any execution is the same as if the operations of all the processors were executed in
some sequential order, and the operations of each individual processor appear in this
sequence in the order specified by its program.
Dubois, Seheurich, and Briggs(DBS) (1986) have provided the following two sufficient
conditions to achieve sequential consistency in shared-memory access:
1. Before a load is allowed to perform with respect to any other processor, all
previous load accesses must be globally performed and all previous store
accesses must be performed with respect to all processors.
2. Before a store is allowed to perform with respect to any other processor, all
previous load accesses must be globally performed and all previous store
accesses must be performed with respect to all processors.

Strong Ordering introduces unnecessary processor/cache waiting time and reduces the
concurrency.

5.4.3 Weak Consistency Model


The weak consistency TSO model developed by Sun Microsystems SPARC
architecture is described below

The memory events store and swap are placed in a dedicated store buffer of each
processor. Thus the order in which these memory events are performed is the same as
the order in which the processor issued them(in program order).
The memory order corresponds to the order in which the switch is thrown from one
processor to another.

A load by a processor first checks its store buffer to see if it contains a store to the same
location. If it does, then the load returns the value of the most recent such store.
Otherwise, the load goes directly to memory.
A swap is placed in the store buffer like a store and it blocks the processor like a load.
In other words, swap blocks until the store buffer is empty and then proceeds to the
memory.

A TSO Formal Specification


1. A load access is always returned with the latest store to the same memory
location issued by any processor in the system.
2. If two stores appear in a particular program order, then they must also appear in
the same memory order.
3. If a memory operation follows a load in program order, then it must also follow
the load in memory order.
4. A store operation is atomic with respect to other stores.
5. No other store can interleave between the load and store parts of a swap.
6. All stores and swaps must eventually terminate.

The weak consistency model may offer better performance than the sequential
consistency model at the expense of more complex hardware software support and
more programmer awareness of the imposed restrictions.

You might also like