Parallel Implementation of John Conway's Game of Life
Parallel Implementation of John Conway's Game of Life
Abstract:
The above separation at first glance seems quite correct and optimal since it aims
at an even distribution of sizes in the processing yarns, when we will have the best
possible result. But a closer and more technical look will show us that this is not
the case.
We know that in all modern computer systems the part of the main
memories has been implemented with SDRAM technology. From the implementation of
SDRAM we know is organized in Banks (usually 4) where access to each bank is
given by giving a row and column address. The selection of the bank arrives from
separate terminals on the memory chip, while the selection of the row and column
uses the same terminals. That is, we have one
Machine Translated by Google
multiplexed address where the row address is sent first and then the column
address. In a burst mode where we want to read many memory locations in
succession, it is advantageous to have them distributed in columns because that
way we will have the smallest possible overhead. For more details refer to a related
article on implementing and accessing SDRAM memories. So if we allocate the
table by columns and divide the sections by rows, so that they are accessed by
columns we will have the best possible performance for the common memory
model of the threads. The separation is then done as the following image:
And finally because we do not want the outline of the table (peripheral cells) to
participate in the game (their status remains the same as the original for the entire
duration of execution), we make the corresponding checks in our code (the points
are obvious from its commentary source file).
• Synchronization
Machine Translated by Google
The threads are scheduled by the operating system based on the protocol it
implements and we are not able to know the order in which the threads will be
executed. For this reason we must ensure that there is synchronization of the
threads, since some of the data managed by each thread, more specifically the
elements that are adjacent to the elements of the next thread, and we are therefore
dependent on the data. To achieve this synchronization we use the barrier
functions contained in the library <pthread.h>. PThread barriers work with the
following logic: when a thread encounters a pthread_barrier_wait () function, it
suspends its execution until the number of threads that have also called
pthread_barrier_wait () is satisfied. The maximum will be this limit we set during
the initialization of the barrier. So it is reasonable to see that we put a barrier at
the end of each run to ensure that no thread will go to the next round if all the
threads have not finished the current round first. This way we avoid inconsistencies
in our data and therefore have valid controls during processing.
2. Implementation Details
Our implementation is fully dynamic, both in terms of table dimensions and the
number of threads, while at the same time there are some default values (80x25
and 1 thread) if no values are set by the user. So if in main we reach a stage
where we know the table sizes, we do memory allocation (malloc ()) 2 tables of 2
dimensions each. We need 2 tables in order for one to contain the current state,
and in the other to store the new state that has arisen in each circle for each cell.
So at the end of each round it is enough to copy the contents of the second table
(next state) to the first (current state).
Regarding the execution of Game Of Life, we have set 2 modes for the total
execution and 2 modes for the type of table dimensions. For the total execution
we have the play and bench mode. In play mode, the current status of the game
and some accompanying information are displayed in the terminal. For a visual
representation of execution to make sense, the table boundaries must be within
some reasonable limits. We set these limits on
Machine Translated by Google
100x50. So for larger table sizes the play mode is not allowed from
user. In bench mode we have a function of measuring the execution time with
use the gettimeofday () function. In this mode it is not printed
status of the table, but statistics on the execution performed are presented
(dimensions, number of threads, execution time in us).
Regarding the dimensions of the table now we have the default and the non-default
mode. In default mode the dimensions are 80x25 and the initial state
is given by an input file of the same size. We have listed 2 indicative files
input Shapes, Random where the first includes some recurring
pattern and the second is a random initial state. In non-default mode ÿ
The user can specify the dimensions of the table and the initial state
is randomly configured using the rand () function.
Then let's look at some basic functions that have been implemented and are
common to both implementations, while later we will see how the 2 codes differ.
At the beginning of each round the main play function is executed with
arguments of the indicators for the 2 tables (current & next) and the limits of the
piece of the table to which the rules of the game will be applied. Play () therefore
applies the rules and updates the values in the table shown by the next pointer.
Inside the play there is the appropriate control so that the peripheral elements
remain unaffected. After the play is performed, a barrier_wait () follows in order for
all the threads to have applied the rules and to have updated the values in the
next state table. Then the thread with ID == 0 undertakes to swap the pointers so
that the current in the next round points to the next of the current round, and
therefore the updated values of the current round become the current values in
the next round. Also the thread with ID == 0 undertakes to make some prints on
the console if we are in play mode in order to present the status of the table during
the current round. After the swap of the indicators, another barrier follows so that
we can ensure that no thread will go to the next round if the exchange of the
indicators has not been completed first.
At the end of 100 rounds the thread finishes its processing and returns.
In OpenMP implementation the differences are not so big. In the main, when
we are done with the controls and initializations, we define a parallel area with the
appropriate openmp directive. We define as common variables the pointers to the
2 tables as well as the number of threads. We define the thread ID, the processing
limits and an auxiliary variable as private variables. Also through the option
num_threads () we define with how many threads we want to execute the parallel
area.
Then the philosophy is the same as that of PThread. Each thread sets its
limits based on its ID (as mentioned above) and starts the main operation for 100
rounds. In each round after the execution of the play we have a barrier
(corresponding directive) to have synchronization. And here the thread with ID ==
0 undertakes to swap the pointers and display its status
Machine Translated by Google
table if we are in play mode. At the end of the round another barrier follows
in order here too to make sure that the swap is done before we go to the next one
round.
After the end of 100 rounds we close the parallel area and
we display the corresponding messages before the program ends.
Part 2 - Measurements
1. Execution Systems
Cpu Model Inter Core 2 Duo T6400 AMD Opteron(tm) Processor 2210 HE
Frequency 2.00 GHz 1.80 GHz
#Cores L2 24
20 GB
20 GB
2. Results - Commentary
First let's look at some measurements for implementation with PThreads for various
table sizes:
instead of decreasing it increases. This makes sense because the overhead for
alternating threads in the cores is an order of magnitude greater than the execution
time of the piece of each thread itself. Also for small table sizes we see that the
improvement is minimal compared to the serial. In contrast to large table sizes the
improvement is significantly greater as the number of threads increases. We also
notice that we have an upper limit on the improvement we can achieve. From that
limit onwards, no matter how much we increase the threads, the execution time
remains the same or even longer as we see for the small table sizes.
It is characteristic that we have the optimal time in both systems when the
number of threads is the same as the one that can work simultaneously in each
system (2 for ours and 4 for Xanthos respectively).
The above was for PThread implementation, let's now look at the corresponding
graphs for OpenMP implementation. The tables with the numbers were not
included in the report, but can be found in the excel cover sheet.
Machine Translated by Google
We take 2 fixed values for the height 50 and 1000 and for these we take different
measurements for different width values. We first observe that as the table sizes
increase, the execution time increases linearly.
With a closer look at the graph data we see that for large height (lines) we
have a much larger angle of increase compared to small height values. This
confirms what we are saying about less overhead data transfer from memory when
accessed in columns. For this it would be more efficient if the large dimension of
the table is always the width (columns) and the small the height (rows).
Annex:
ÿÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿ: •
gol_threads.c – Source file Pthread implementation
• gol_openmp.c – Source file OpenMP implementation
• Makefile – Makefile to compile sources • Random –
Random input board’s state file • Shapes – Pattern
input board’s state file • Measures.xls – Excel sheet
with measurements
• Readme – Short txt file with executable parameters details
• GOLReport.pdf – Project report