0% found this document useful (0 votes)
101 views

Parallel Implementation of John Conway's Game of Life

This document summarizes a parallel implementation of John Conway's Game of Life using both Pthreads and OpenMP. Key points: 1) The cellular automaton grid is divided into columns and assigned to threads to optimize for memory access since modern RAM is organized into banks. 2) Barriers are used between iterations to synchronize threads as some cell states depend on neighboring threads' calculations. 3) The implementation is dynamic for grid size and number of threads, with options for a "play" mode showing iterations or a "bench" mode measuring performance.

Uploaded by

biruk bekele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views

Parallel Implementation of John Conway's Game of Life

This document summarizes a parallel implementation of John Conway's Game of Life using both Pthreads and OpenMP. Key points: 1) The cellular automaton grid is divided into columns and assigned to threads to optimize for memory access since modern RAM is organized into banks. 2) Barriers are used between iterations to synchronize threads as some cell states depend on neighboring threads' calculations. 3) The implementation is dynamic for grid size and number of threads, with options for a "play" mode showing iterations or a "bench" mode measuring performance.

Uploaded by

biruk bekele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Machine Translated by Google

Parallel Implementation of John Conway’s Game of Life

Behtsoudis Anestis - [email protected]

Abstract:

In the present work, parallel implementations were performed on a shared


memory model of the Game Of Life (John Conway) cellular automaton. The code for
the two parallel versions was written in C and the Posix Threads (Pthreads) and
OpenMP APIs were used respectively. In order to evaluate the performance and the
improvement of the implementations, measurements of the execution time were
performed in 2 systems, a personal laptop and the system provided to us by HPCLab
(Xanthos). Initially in the first part we will present how we approached the issue of
parallelism for the cellular automatic and we will present the technical details for the
2 implementations. In the second part we will present the measurements made for
the 2 systems and we will comment on the performance through them and some
auxiliary graphs.

Game Of Life Rules:

In a two-dimensional version of the cellular automaton we have an MxN array,


where each position in the array is a cell. Each cell has 2 possible states, "dead" or
"alive". During the initialization phase, each cell is given an initial state and execution
begins. In each execution (around) each cell checks the state of its 8 neighbors and
based on some rules determines the state to which it will go. The rules are: • Every
living cell with less than 2 living neighbors dies • Every living cell with more than 3
living neighbors dies • Every living cell with 2 or 3 living neighbors stays alive •
Every dead cell with exactly 3 living neighbors dies comes to life.
Machine Translated by Google

Part 1 - Implementation Details

1. Parallel implementation of Game of Life

• Separation of Parallel Areas

Game Of Life is an inherent data-parallel problem, where the next state of


the item depends on the current state of the neighboring items. So we have a
table of dependent elements and processing that needs to be done on them. In
the multi-threaded common memory model we look at things are relatively simple.
We have at our disposal a number of threads that run "false" in parallel, to which
we can assign the processing from a part of the table. The problem is that we will
optimally separate the table. A first logical thought would be to simply divide the
original table into isomeric sections or as close to the isomer as possible if the
dimensional divisions are not performed exactly. In a way, that is:

The above separation at first glance seems quite correct and optimal since it aims
at an even distribution of sizes in the processing yarns, when we will have the best
possible result. But a closer and more technical look will show us that this is not
the case.
We know that in all modern computer systems the part of the main
memories has been implemented with SDRAM technology. From the implementation of
SDRAM we know is organized in Banks (usually 4) where access to each bank is
given by giving a row and column address. The selection of the bank arrives from
separate terminals on the memory chip, while the selection of the row and column
uses the same terminals. That is, we have one
Machine Translated by Google

multiplexed address where the row address is sent first and then the column
address. In a burst mode where we want to read many memory locations in
succession, it is advantageous to have them distributed in columns because that
way we will have the smallest possible overhead. For more details refer to a related
article on implementing and accessing SDRAM memories. So if we allocate the
table by columns and divide the sections by rows, so that they are accessed by
columns we will have the best possible performance for the common memory
model of the threads. The separation is then done as the following image:

The above way of separating parallel sections is followed in both PThread


implementation and OpenMP. The boundaries are defined as follows: As we know
each thread has a unique integer ID with the number starting from 0. So the part
that processes each thread results from the following mathematical relations: Start
= ID * height / NThreads - Top section

Finish = Start + height / NThreads - End of section


Where NThreads is the number of threads and height is the height (number of rows)
of the table.

And finally because we do not want the outline of the table (peripheral cells) to
participate in the game (their status remains the same as the original for the entire
duration of execution), we make the corresponding checks in our code (the points
are obvious from its commentary source file).

• Synchronization
Machine Translated by Google

The threads are scheduled by the operating system based on the protocol it
implements and we are not able to know the order in which the threads will be
executed. For this reason we must ensure that there is synchronization of the
threads, since some of the data managed by each thread, more specifically the
elements that are adjacent to the elements of the next thread, and we are therefore
dependent on the data. To achieve this synchronization we use the barrier
functions contained in the library <pthread.h>. PThread barriers work with the
following logic: when a thread encounters a pthread_barrier_wait () function, it
suspends its execution until the number of threads that have also called
pthread_barrier_wait () is satisfied. The maximum will be this limit we set during
the initialization of the barrier. So it is reasonable to see that we put a barrier at
the end of each run to ensure that no thread will go to the next round if all the
threads have not finished the current round first. This way we avoid inconsistencies
in our data and therefore have valid controls during processing.

2. Implementation Details

Our implementation is fully dynamic, both in terms of table dimensions and the
number of threads, while at the same time there are some default values (80x25
and 1 thread) if no values are set by the user. So if in main we reach a stage
where we know the table sizes, we do memory allocation (malloc ()) 2 tables of 2
dimensions each. We need 2 tables in order for one to contain the current state,
and in the other to store the new state that has arisen in each circle for each cell.
So at the end of each round it is enough to copy the contents of the second table
(next state) to the first (current state).

To avoid so much and unnecessary read and write, we have 2 pointers to


(current and next) which initially show in the 2 tables as we defined them, and at
the end of each round we swap them in order to have the new values. This way
we avoid the big overhead of accessing and copying so much data.

• Operation Modes (Play – Bench) (Default – Non-Default)

Regarding the execution of Game Of Life, we have set 2 modes for the total
execution and 2 modes for the type of table dimensions. For the total execution
we have the play and bench mode. In play mode, the current status of the game
and some accompanying information are displayed in the terminal. For a visual
representation of execution to make sense, the table boundaries must be within
some reasonable limits. We set these limits on
Machine Translated by Google

100x50. So for larger table sizes the play mode is not allowed from
user. In bench mode we have a function of measuring the execution time with
use the gettimeofday () function. In this mode it is not printed
status of the table, but statistics on the execution performed are presented
(dimensions, number of threads, execution time in us).
Regarding the dimensions of the table now we have the default and the non-default
mode. In default mode the dimensions are 80x25 and the initial state
is given by an input file of the same size. We have listed 2 indicative files
input Shapes, Random where the first includes some recurring
pattern and the second is a random initial state. In non-default mode ÿ
The user can specify the dimensions of the table and the initial state
is randomly configured using the rand () function.

To specify modes there are 2 variables-flag bench_flag and dflag for


the modes mentioned respectively.

• Common Functions in PThread and OpenMP

Then let's look at some basic functions that have been implemented and are
common to both implementations, while later we will see how the 2 codes differ.

Function Arguments Return Operation


initialize_board() int **current, void If the default flag is 1st
int dflag table arch. with 0, otherwise with
random 0.1.
read_file() int **current, void Reading the original
char *filename status from a file.
copy_region() int **current, void Copy the periphery and
int **next in the second table.
adjacent_to() int **current, int Returns the number of
int i, int j living neighbors of the cell
of current with position (i, j).
play() int **current, void Application of its rules
int **next, int start, finish game, and renewal of
prices in the table next
status (pointer next).
print() int **current void Print his status
current table
arg_check int argc, char *argv[] int Check the parameters with
which the executable was called.
Returns 0 to successful control,
otherwise a different int.
print_help() void void Show the possible
call parameters and
interpretation of these.

2.1 PThreads Implementation


Machine Translated by Google

In the main, if the number of threads becomes known (default = 1 or user


specified), the barrier is initially initialized with a limit on the number of threads of
all threads. This is because we want all the threads that will be assigned by a part
of the table to participate in the synchronization as mentioned above. Then the
threads are created with the entry function the entry_function ().
At the beginning of the input function, the boundaries of the section that will
process each thread are determined based on the way mentioned above. Then
we have a main loop of 100 reps which is the 100 rounds of Game Of Life.

At the beginning of each round the main play function is executed with
arguments of the indicators for the 2 tables (current & next) and the limits of the
piece of the table to which the rules of the game will be applied. Play () therefore
applies the rules and updates the values in the table shown by the next pointer.
Inside the play there is the appropriate control so that the peripheral elements
remain unaffected. After the play is performed, a barrier_wait () follows in order for
all the threads to have applied the rules and to have updated the values in the
next state table. Then the thread with ID == 0 undertakes to swap the pointers so
that the current in the next round points to the next of the current round, and
therefore the updated values of the current round become the current values in
the next round. Also the thread with ID == 0 undertakes to make some prints on
the console if we are in play mode in order to present the status of the table during
the current round. After the swap of the indicators, another barrier follows so that
we can ensure that no thread will go to the next round if the exchange of the
indicators has not been completed first.

At the end of 100 rounds the thread finishes its processing and returns.

2.2 OpenMP Implementation

In OpenMP implementation the differences are not so big. In the main, when
we are done with the controls and initializations, we define a parallel area with the
appropriate openmp directive. We define as common variables the pointers to the
2 tables as well as the number of threads. We define the thread ID, the processing
limits and an auxiliary variable as private variables. Also through the option
num_threads () we define with how many threads we want to execute the parallel
area.
Then the philosophy is the same as that of PThread. Each thread sets its
limits based on its ID (as mentioned above) and starts the main operation for 100
rounds. In each round after the execution of the play we have a barrier
(corresponding directive) to have synchronization. And here the thread with ID ==
0 undertakes to swap the pointers and display its status
Machine Translated by Google

table if we are in play mode. At the end of the round another barrier follows
in order here too to make sure that the swap is done before we go to the next one
round.
After the end of 100 rounds we close the parallel area and
we display the corresponding messages before the program ends.

Part 2 - Measurements

1. Execution Systems

As mentioned above, the measurements were performed in 2 systems


with different architectures in order to have a more complete picture of
the behavior of the parallel versions. First let's look at the basics
characteristics of the 2 systems.

Characteristic My System Xanthos


s

Cpu Model Inter Core 2 Duo T6400 AMD Opteron(tm) Processor 2210 HE
Frequency 2.00 GHz 1.80 GHz
#Cores L2 24

Cache/ Core 2048 KB #Threads 1024 KB


2 Main Memory 3.00 3473
GB Swap
MB 4

20 GB
20 GB

2. Results - Commentary

Detailed tables are in the accompanying gender excel. We here will


list any pieces that are necessary to make the commentary that
we want.

First let's look at some measurements for implementation with PThreads for various
table sizes:

PThreads 80x25 – Random Input


Nthreads My System Xanthos
Time (us) Time (us) Diff(My-Xanthos
1 21348 60926 -39578
2 14379 40243 -25864
3 18402 31597 -13195
4 18757 43582 -24825
5 22359 42599 -20240
6 22856 42963 -20107
7 30019 51860 -21841
8 30170 55658 -25488
Machine Translated by Google

PThreads 200x50 – Random Input


Nthreads My System Xanthos
Time (us) Time (us) Diff(My-Xanthos
109006 319621 -210615
12 59771 111583 -51812
3 77924 91070 -13146
4 67249 128418 -61169
5 77135 103729 -26594
6 69014 101019 -32005
7 76460 104758 -28298
8 74866 116670 -41804

PThreads 2000x500 – Random Input


Nthreads My System Xanthos
Time (us) Time (us) Diff(My-Xanthos
10477616 19892058 -9414442
1 5397030 10298023 -4900993
23 5698842 6975674 -1276832
4 5670513 5481174 189339
5 5637657 7562911 -1925254
6 5545523 7050784 -1505261
7 5636890 6422875 -785985
8 5615205 5900633 -285428

PThreads 2000x2000 – Random Input


Nthreads My System Xanthos
Time (us) Time (us) Diff(My-Xanthos
1 41790751 78177419 -36386668
2 21575731 39150984 -17575253
3 22081852 26587961 4 21830961 -4506109
20241626 5 21619917 27239080 1589335
-5619163
6 21454325 23992932 -2538607
7 21558108 23212022 8 21597983 -1653914
20926997 670986

We observe that for small table sizes there is a limit of improvement as


we increase the number of threads. From this limit onwards the execution time
Machine Translated by Google

instead of decreasing it increases. This makes sense because the overhead for
alternating threads in the cores is an order of magnitude greater than the execution
time of the piece of each thread itself. Also for small table sizes we see that the
improvement is minimal compared to the serial. In contrast to large table sizes the
improvement is significantly greater as the number of threads increases. We also
notice that we have an upper limit on the improvement we can achieve. From that
limit onwards, no matter how much we increase the threads, the execution time
remains the same or even longer as we see for the small table sizes.

It is characteristic that we have the optimal time in both systems when the
number of threads is the same as the one that can work simultaneously in each
system (2 for ours and 4 for Xanthos respectively).

The above was for PThread implementation, let's now look at the corresponding
graphs for OpenMP implementation. The tables with the numbers were not
included in the report, but can be found in the excel cover sheet.
Machine Translated by Google

Similar observations with PThread can be made in OpenMP. We see that


for small table sizes there is an improvement limit as we increase the number of
threads. Beyond this limit if we increase the number of threads the execution time
increases instead of decreasing for the reason mentioned above. Conversely for
large table sizes after a thread limit and then the execution time tends to become
constant as we increase the number of threads. Of course here too if we exceed
a reasonable number of threads limit, the overhead of the switch will again be very
large and the total execution time will increase. We also notice that in the PThread
implementation our 2-Core system always presented better performance while for
large sizes the execution time of the 2 systems converges. In contrast to
the OpenMP implementation for large sizes, the superiority of the Xanthos system
is felt, which leads us to the conclusion that there is better utilization of the Xanthos
architecture.

In the next graph we see the efficiency of PThread implementation against it


OpenMP for our system:

So we see that PThread implementation is better than OpenMP, which is


completely justified since OpenMP is a higher level of API for implementing parallel
applications, compared to PThtread which is a lower-level approach. Of course
here it is worth noting the platform independence of OpenMP in relation to Posix
Threads.
Machine Translated by Google

Let's now look at some measurements in terms of table sizes.

We take 2 fixed values for the height 50 and 1000 and for these we take different
measurements for different width values. We first observe that as the table sizes
increase, the execution time increases linearly.

With a closer look at the graph data we see that for large height (lines) we
have a much larger angle of increase compared to small height values. This
confirms what we are saying about less overhead data transfer from memory when
accessed in columns. For this it would be more efficient if the large dimension of
the table is always the width (columns) and the small the height (rows).

Annex:
ÿÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿ: •
gol_threads.c – Source file Pthread implementation
• gol_openmp.c – Source file OpenMP implementation
• Makefile – Makefile to compile sources • Random –
Random input board’s state file • Shapes – Pattern
input board’s state file • Measures.xls – Excel sheet
with measurements
• Readme – Short txt file with executable parameters details
• GOLReport.pdf – Project report

You might also like