Project Via Flowcontrol
Project Via Flowcontrol
1 DESIGN DOCUMENT
System overview
1
performed by an OS kernel and device driver. ParamNet-II NIC completely ensures the
reliability of communication between connected VIs.
Virtual Interface.
A Virtual Interface is the mechanism that allows a VI Consumer to directly
access a VI Provider to perform data transfer operations. A Vi consists of a set of
Receive/Transmit Queue.
Completion queue.
1. Means to select several completions among several VI’s.
2. The one-to-one connection model of the VI Architecture requires a large
number of VI's per user to achieve full connectivity between nodes in a
cluster. Polling or taking interrupts from each individual VI does not scale as
the number of VI's increases. Completion queues alleviate this problem by
providing a single notification point for an arbitrary number of work queues.
VI Provider
1. VI Provider is the set of hardware and software components responsible for
instantiating a Virtual Interface. The VI Provider consists of a ParamNet-II
(NIC) , Alteon Acenic and Device Driver.
However Alteon Acenic NIC is a VI emulated device and does not have any
native hardware support for VI.
VI Consumer
1. The VI Consumer component consists of those objects that utilize the
resources of the VI Provider. The Consumer can either be a user application
that uses native VI primitives to conduct communication, or other
communication protocols (e.g. TCP/IP, Sockets, MPI) and OS communication
services layered over the VI Provider methods.
2
2. Integral to the VI Consumer is a VI user agent that is typically in the form of a
library, known as the Virtual Interface Provider Library (VIPL), linked into the
consumer application. The user agent abstracts the hardware and OS interface
details of different VI Provider implementations to provide a consistent
interface to the programmer. Legacy protocol stacks might employ the user agent
through native VI calls at their lower edge.
1.0 Acronyms, terms and definitions Used in CMPI Module
CCP-III
Kernel Agent
VI Provider
The combination of VI NIC and a VI Kernel Agent.
VI Consumer
A software process that consists of an application program, an operating
system communication facility, and a VI User Agent
3
CRE
Sun Cluster Runtime Environment(CRE) is a program
execution environment that provides basic job launching and
load balancing capabilities.
6.1 VIA
Constraints : The number of VI’s supported and the total memory user can
register depend on the capabilities of NIC.
6.2 C-MPI
Software assumptions: MPI assumes the availability of Stable VIPL according to VIA
specification.
4.0 Risks
Refer Risk management plan.
Hardware
1. IBM p630/p690 Power4 CPU based nodes
2. PARAMNet-II & Gigabit Ethernet (Alteon Acenic Tigon-II)
4
Data Communications
Software
Platform - P630/AIX5L.
Compiler - C compiler./ Fortran compiler.
Microcode - Protocol Controller, Receiver Macrocode for ParamNet-II NIC and
. Alteon Acenic Tigon-II.
Library- VIPL .
5
10.2 C-MPI Architecture.
Application
Plugin
Progress Engine
Transport
modules
TCP SHMEM VIA
VIPL
TCP Stack
6
10.3 C-MPI ADI architecture.
MPI Applications
This is the place where user interact with the system. It consists of user MPI application
programs.
Point-to –Point Communication
The basic communication mechanism of MPI is classified as Point-to-Point
communication and collective communication. In Point-to-Point communication the
transmittal of data taken place between a pair of processes, one side sending, the other
receiving. Further this is classified as Blocking and Non-blocking calls.
7
Collective Communication
Collective communications transmit data among all process in a group specified by an
intra communicator object. One function, the barrier, serve to synchronize process without
passing data.
VI User Agent
A software component that enables an operating system communication facility to utilize
a particular VI Provider. The VI User Agent abstracts the details of the underlying VI NIC
hardware in accordance with an interface defined by the operating system communication
facility.
VI Kernel Agent
The Kernel Agent is a privileged part of the operating system, usually a driver, which
performs to setup and resource management functions needed to maintain a VI between
VI consumers and VI NICs. These functions include the creation/destruction of VIs, VI
connection setup/teardown, management of system memory used by the VI NIC and error
handling.
Virtual Interface(VI)
An interface between a VI NIC and a process allowing a VI NIC direct access to the
process memory. A VI consists of a pair of Work Queues and for Send operations and one
for receive operations. The queues store a descriptor between the time it is posted and the
time it is done. A pair of VIs are associated using the connect operation to allow packets
sent at one VI to be received at the other
Completion Queues
8
10.4 Flow chart for C-MPI - ADI
START
IS
1 1.Transport
operation 3
2.Initialize
3.Finalize
?
2
1
IS
* Yes
1.Send
IS Inititalized
2.Recv
?
No
2
Yes IS Success
?
No IS Success
No ?
Yes
Rport Error
STOP
9
*
Yes No
Is Blocking Call
?
No
Call
m pip_via_progress
_blk
Call
m pip_via_fsend
Is m sg_len
<= Yes
No
M AX_SIZE
?
No
Is Any ERROR
yes
Report Error S
10
+
No
Yes Is Blocking call
?
No Is Unexpected Is Unexpected No
? ?
Yes Yes
Yes Is any active Is any active No
requests requests
? ?
Call
Call
mpip_via_urecv
Post request and m pip_via_uirecv
Yes
No
call
mpip_via_progress Call
_blk Post request and m pip_via_frecv
call
Call m pip_via_progress
mpip_via_frecv _nblk
No Is m sg_lemgth
Yes
<= MAX_SIZE
?
Call long_recv
Call eager_recv
No
Is any ERROR
?
Yes
Report error S
11
CONTROL FLOW FOR VIA MODULE
USER TASK
(MPI APPLICATION
PROGRAM )
API(MPI Library)
Collective
communication
Mpi/api
PROTOCOL MODULE
Mpi/engine/
SHARED
VIA TCP/IP MEMORY
VIPL Library
/usr/lib/libvipl.a
MPI_Init(&argc, &argv)
PMPI_Init(int *argc, char ***argv)
mpip_init(argc, argv)
initconns()
RTE_get_jtbl(&cid, &task.ti_jtbl, 0, -1)
makeconns(taskinfo_t *task)
mpip_rte_taskconns(task)
getconns(task, connif, connproto)
12
MPI_Send
int PMPI_Send(void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm
comm)
MPIP_Send(void *buf, size_t count, MPI_Datatype type, int dest, int tag,
MPI_Comm comm)
mpip_getproc(c, r)
mpip_send_self(void *buf, size_t count, MPI_Datatype type, int tag,
MPI_Comm comm, int mode)
mpip_req_set_sendready(r)
mpip_req_initlock(&req)
mpip_inc_nactvproto(proc->p_proto)
mpip_atominc(mpip_nactvreqs, 1);
mpip_enqsend(proc, &req);
mpip_progress_blk(&req)
mpip_pmname_progress_blk(MPI_Request rq,int pollall)
MPI_Recv
Int PMPI_Recv(void *buf, int count, MPI_Datatype type, int src, int tag, MPI_Comm
comm, MPI_Status *status)
Int MPIP_Recv(void *buf, int count, MPI_Datatype type, int src, int tag,
MPI_Comm comm, MPI_Status *status)
mpip_search_unexpected(src, tag, comm, TRUE)
mpip_getproc(comm, src)
mpip_recv_self(msg, buf, count, type, comm, status)
pt_ urecv(process_t *, unexmsg_t *, char *, int, MPI_Datatype,
MPI_Comm, MPI_Status *)
mpip_pmname_urecv(process_t *, unexmsg_t *, char *, int,
MPI_Datatype, MPI_Comm, MPI_Status *)
mpip_inc_nactvproto(srcproc->p_proto)
mpip_atominc(mpip_nactvreqs, 1);
pt_ postedreceive(srcproc)
mpip_pmname_postedreceive(srcproc)
mpip_postedrcvany(comm)
mpip_progress_blk(&req)
mpip_pmname_progress_blk(&req)
13
MPI_Wait
MPI_Finalize
PMPI_Finalize(void)
pt_closeconns(0, 0, 0)
mpip_pmname_closeconns(process_t **procs,int nprocs,void *pminfo)
All the functions name having “pmname” as a variable, are mapped to respective protocol
module by the function mpip_map_functions.For example in VIA protocol module
“pmname” is replaced by “via”.
I task_info structure process are stores depending on protocol that they are using, Unique
array for respective protocol(ex via,tcp & shm),So we choose process from the task
structure which are using VIA & build a new process table in VIA.
14
do_setup()
{
•Get the NodeID & node information structure form process table cluster
table of the task_info structure .
• Fill up tha viadev structure for rank & number of process form the task
structure
• VipOpenNic(VIADEV_DEVICE, &viadev.nic)
• VipQueryNic(viadev.nic, &nicattrs) ;
• VipCreateCQ(viadev.nic, VIADEV_CQ_SIZE, &viadev.cq)create the
completion queue for the receive
• VipCreateCQ(viadev.nic, VIADEV_CQ_SIZE, &viadev.scq)create the
completion queue for send
• VipCreatePtag(viadev.nic, &viadev.ptag)
• Allocate pool of vbufs to work with and register memory by calling
allocate_vbufs(viadev.nic, viadev.ptag,VIADEV_VBUF_POOL_SIZE)
This function will allocate chunk of vbuffs &
VIADEV_VBUF_POOL_SIZE =viadev.np*VIADEV_PREPOST_DEPTH +
VIADEV_VBUF_SPARES
So pool size depends on number of process which are using the via.Connection
with another process is done through a single VI – one per process. We need a bit
of per-connection state in addition to the VI itself. This per-connection state is
kept in a viadev_connection_t structure
• Create the VIs here, but they are not yet connected to other processes. This
is done below, in viadev_setup_communication.
15
c->remote_credit = VIADEV_PREPOST_DEPTH;
c->local_credit = 0;
c->preposts = VIADEV_PREPOST_DEPTH;
}
else
{
these should be ignored for self-communication
c->vi = NULL;
c->remote_credit = 0;
c->local_credit = 0;
}
}end of for loop
• Setup the local & remote address length from the NIC attributes
• fill in the nic address and connect
• VipNSInit(viadev.nic,(VIP_PVOID)0)
• VipNSGetHostByName(viadev.nic,proc_table[viadev.me].hostname,local
_addr,0)
• ESATBLISHING THE CONNECTION
for(i=0;i<mpip_nworld;i++)
{
if(t->ti_procs[i].pi_proto!=via_protocol && viadev.me!=i)
continue;
if(viadev.me == i) /* CONNECTWAIT_TIMEOUT */
{
for(j=i+1;j<mpip_nworld;j++)
{
if(t->ti_procs[j].pi_proto!=via_protocol)
continue;
remoteid=t->ti_procs[j].pi_proc->p_gid.pg_node;
remotenode=t->ti_ctbl.node[remoteid];
VipConnectRequest(viadev.connections[j].vi,local_addr,
remote_addr, VIP_INFINITE, &remote_attrs);
}
}
else if(viadev.me > i)
{
Set up the discriminator length in local address from NIC
attributes
VipConnectWait(viadev.nic, local_addr,
VIP_INFINITE, remote_addr, &remote_attrs, &conn) ;
VipConnectAccept(conn, viadev.connections[i].vi);
}
}End for for loop i
}end of do_setup function
16
int mpip_via_addconns(taskinfo_t *task,void *newproctbl)
{
Add connections to internal structures
For all task->ti_proto[via_protocol] check whether via_flag is set to
VIA_NOT_ATTACHED if it is do than increment the count newconnections .
If the via process table itself is not created than create it & make connection.
Other wise expand the existing one.
}
}End of do_stop()
17
MPIP_Send(void *buf, size_t count, MPI_Datatype type,
int dest, int tag, MPI_Comm comm)
{
mpip_req_t req;
process_t *proc;
int rc;a
Get the process from the remote group for intercommunication else from local group
from by calling malloc mpip_getproc by passing communicator as an argument.
If the got process is equal to mpip_myproc than call
mpip_send_self(buf, count, type, tag, comm, 0)
{
mpip_search_posted(comm, &env) – search for the posted queue
copy data directly from send buffer to receive buffer using
mpip_typetype_copy(recvrq->r_buf, recvrq->r_count,
recvrq->r_type, buf, count, type, length
set the status to one in MPIP_Req structure by calling mpip_req_done(&recvrq).
No matching receive. Pack the message and put it on the unexpected queue.
mpip_pack(packbuf, buf, count, type)
mpip_enqunexpected(comm, &env, packbuf, 0)
if (mode == MPIP_MODE_SYNC)
return mpip_progress_blk(&req)
}
else {
Fill up the request data structure as fallows
req.r_error = MPI_SUCCESS;
req.r_state = 0;
req.r_flags = MPIP_REQ_SEND;
req.r_tag = tag;
req.r_comm = comm;
req.r_peer = dest;
req.r_buf = buf;
req.r_count = count;
req.r_type = type;
req.r_length = type->d_size * count;
mpip_req_set_sendready(&req);
req.r_next = 0;
Request structure also consists of buffer to be send, which is void pointer in
request structure.
Increment the number of active request using the number of active request by one
in protocol struct by calling mpip_inc_nactvproto(p)
Increment the mpip_nactreqs (global variable for a process)byone.
Enque the request in the send queue by calling mpip_enqsend(proc, &req)
Which will set the senhead & sendtail in process struct.
Now call mpip_progress_blk(&req)
18
}
mpip_via_progress_nblk(int pollall){
This function attempts to progress all connections with requests, without blocking
on any one connection.
For all process using via as an protocol (viaproctblsz – global variable updated by
mpip_via_addconns)
{
Get the via process from via proctable entry.
mpip_check_transport(process)- this function checks for whether
process->p_flagsis set to MPIP_PROC_TRANSPORTERR.
Otherwise receive the packet & process them by calling
Rc=via_recv(viaproc_t* ap)
If rc = MPI_SUCCESS than set the progress flag
Mean while Send Packets if any by process send head (p->p_sendhead)
via_send_from_mpip_queue(process_t *destproc)
than return (err==MPI_SUCCESS) ? progress:err
}
via_recv(viaproc_t* ap)
{
rtn_status=MPID_VIA_DeviceCheck();
return rtn_status;
}
MPID_VIA_DeviceCheck()
{
• Till there are entry in Completion queue of receive by calling VipCQDone
(viadev.cq, &vi, &onrecvq), call VipRecvDone(vi, (VIP_DESCRIPTOR
**)&desc) to dequeue the send queue & get descriptor
• Get the data from the descriptor (desc->DS->Local.Data.Address )& store in
the vbuf.
}
19
int via_send_from_mpip_queue(process_t *destproc){
rq=destproc->p_sendhead
IF(
Here they will check whether requested message length + size of tranfer type exceeds
MAX_DAT_SIZE
OR
viadev_eager_ok(rq->r_length + sizeof(txfr_type),&viadev.connections[mote proc
rank])
here they will check whether the remote credits after sending the buffer will become less
than VIADEV_CREDIT_PRESERVE (=10)
)
(#define MAX_DATA_SIZE ((1024*16)-64-sizeof(viadev_packet_eager_start))
To handle the cancel reply message set the envelope mode to CANCEL type like this
env->ev_sndreq = rq->r_cancelreq
send the data using MPID_VIA_eager_send than free the buffer.
Other wise
Fill up with request struct.
Other wise Set the envelope mode depending on the req->r_state is whether synchronous
Or ready.
Send the data using
MPID_VIA_eager_send(&(tmp_buf->snd_buf),rq->r_length,rq->r_comm->c_rank,rq-
>r_ta
g,rq->r_comm->c_sndctxt,destproc->p_gid.pg_wrank,env->ev_mode,(reqaddr_t)rq,env-
>ev_rcvreq);
20
For rendezvous send
Set the envelope for rendezvous mode & fill up the other fields of envelope than call
send_req_long_data(rq->r_length,rq->r_comm->c_rank,rq->r_tag,rq->r_comm-
>c_sndctxt,destproc->p_gid.pg_wran
k,env->ev_mode,(reqaddr_t)rq);
After sending any of the above type just call the bellow calls for the clearance
rq->r_error=MPI_SUCCESS;- set the req, r_error field to MPI_SUCCESS.
mpip_req_done(&rq);- set the status to done
mpip_atomdec(mpip_nactvreqs, 1);- decriments the no of req
mpip_dec_nactvproto(mpim_via_prototbl);- decriment the no of active
protocol
mpip_deqsend(destproc);- remove the message form send queue.
________________________()_________________________________
Get the vbuff buffer from the free pool, by calling get_vbuf()
Structure of vbuf
viadev_packet_header + Data to be
viadev_packet_envelope + transferred (buff VIP_DESCRIPTOR
bytes_in_this_packet from API) (64 bytes)
BUFFER
21
Packet header consisting of fallowing fields
typedef struct {
viadev_packet_type type; /* what type of packet is this? */
packet_sequence_t id; /* sequence number of packet. Always checked for sanity
* and later for reliability */
int vbuf_credit; /* piggybacked vbuf credit */
int source;
} viadev_packet_header;
typedef struct {
int context; /* context id -- identifies communicator */
int tag; /* MPI message tag. */
int64_t data_length; /* how many bytes in MPI message (not header). */
int src_lrank; /* to help receiver */
int packet_no; /* to help packet sequencing */
int mode ; /* Send or Recv mode */
reqaddr_t sreq;
reqaddr_t rreq;
} viadev_packet_envelope;
Fill the envelop & header information ,set the header type to
VIADEV_PACKET_EAGER_START
Fill up the descriptor information in the vbuf (as shown above) by calling
vbuf_init_sendrecv(vbuf *v, unsigned long len)
________________________________()_________________________________
This function is same as MPID_VIA_eager_send but here we will set the hader type as
VIADEV_PACKET_LONG_REQUEST & we wont send the data only request is made
22
for transfer.
MPI_Recv
int
PMPI_Recv(void *buf, int count, MPI_Datatype type,
int src, int tag, MPI_Comm comm, MPI_Status *status)
int
MPIP_Recv(void *buf, int count, MPI_Datatype type,
int src, int tag, MPI_Comm comm, MPI_Status *status)
{
check if( comm->c_unexhead) that means check for unexpected queue head is
present
To search unexpected queue list call mpip_search_unexpected(src, tag,
comm, TRUE)
This function searches by taking the src & tag as parameter, if they matched than it
returns the msg.
Than get the rank of the process from which process wanted to recv from msg struct
(unexmsg_t).
By using the rank get the process by calling srcproc=mpip_getproc(comm, rank)
if (srcproc == mpip_myproc)
than call mpip_recv_self(msg, buf, count, type, comm, status)
else
pt_urecv(srcproc, msg, buf, count, type, comm, status);
which is nothing but mpip_via_urecv
If at all unexpected queue head is empty than setup the receive request struct then call
mpip_req_set_recvhead(&req) to set the r_state field of the request struct to r_state
MPIP_REQ_RECVHEAD
If ( rank >= 0)
{
if (srcproc != mpip_myproc) {
mpip_inc_nactvproto(srcproc->p_proto); - increment act protocol count
mpip_atominc(mpip_nactvreqs, 1);- increment the no of req count
srcproc->p_proto->pt_postedreceive(srcproc) this calls
mpip_via_ postedreceive
}
}
else
23
{
mpip_postedrcvany(comm)
Called when a receive is posted with a source of MPI_ANY_SOURCE. This
function handles MPI_ANY_SOURCE using two different methods. In the first method,
mpip_nrcvany is incremented and all the PM's that are part of the communicator are
instructed to turn on receives from everywhere. In the second method, the PM's are
instructed to turn on receive from all connections that are part of the communicator.
This function inturn calls pt_postedrcvany(comm.) for all protocols that are part of that
communicator, that is mpip_via_postedrcvany(comm.)
}end of MPIP_Recv
________________________________()__________________________
mpip_via_postedreceive(process_t *p){
Update polling state in response to API posting a receive.
Process->p_nrecvs++;
}
24
}
mpip_via_postedrcvany(MPI_Comm comm)
{
____________________________________()___________________________
if(mpip_mode_test_rts(env->ev_mode))
{
Allocate a new txfr_type buffer for receiving the data & set the envelop of
that buffer.
25
Depending on the type of data contagious or or other set up the send buffer
of txfr_type buffer.
Now we will send the reply for the long data message by calling
send_reply_for_long_data( int src_lrank,int dest_rank,viadev_connection_t
*c,int context_id,int mode,reqaddr_t sreq,reqaddr_t rreq)
This function will setup the packet header type as
VIADEV_PACKET_LONG_REPLY
& send the ack msg to req.r_peer by using
viadev_post_send(c,v);-which calls VIpSendPost
rc = viadev_process_send(c->vi);-VipCQDone & VipSendDone
mpip_via_postedreceive(srcproc);
mpip_via_progress_blk(&req,1);
Check the status & assign the status field from the req uest struct returned by
progress_blk call.
26
int mpip_via_fsend(process_t *dest,char *buf,size_t count,MPI_Datatype dtype,int tag,
MPI_Comm comm,int mode)
{
Do a fast send of data.Do not return until send completes.Note that when fsend is called, no
active request outstanding.
Do a fast send of data.Do not return until send completes.Note that when fsend is
called, no active request outstanding
• Check for group whether it is locl group or remote group depending on
that assign the group from comm. Struct.
• For all the group process find the destination process
• Get the via process information from the protocol private information of
destination process
• Calculate the length from count & data type
• Fill up the request structure from arguments
• Set the request r_state to MPIP_REQ_SENDREADY ready by calling
mpip_req_set_sendready(&req)
• Check the mode whether it is sync or ready send depending on that set the
request flag
• mpip_inc_nactvproto(mpim_via_prototbl);
• mpip_atominc(mpip_nactvreqs, 1);
• mpip_enqsend(dest,&req);
• via_send_from_mpip_queue(dest)
• Till the request flag is set to MPIP_REQ_DONE call via-recv(via_proc)
}
{
This function will do blocking receives of data. It will not return until all the data has
been read. This function will not be called from the threadsafe library.
27
5.0 Data Design
11.1b C-MPI.
typedef struct{
char buffer[VBUF_BUFFER_SIZE];
VIP_DESCRIPTOR desc; /* 64 Bytes*/
}vbuf;
28
* send handle list is a queue of send handles representing
* in-progress rendezvous transfers. It is processed in FIFO
* order (because of MPI ordering rules) so there is both a head
* and a tail.
* The receive handle is a pointer to a single
* in-progress eager receive. We require that an eager sender
* send *all* packets associated with an eager receive before
* sending any others, so when we receive the first packet of
* an eager series, we remember it by caching the rhandle
* on the connection. */
MPIR_SHANDLE *shandle_head; /* "queue" of send handles to process */
MPIR_SHANDLE *shandle_tail;
MPIR_RHANDLE *rhandle; /* current eager receive "in progress" */
/* these two fields are used *only* by MPID_DeviceCheck to
* build up a list of connections that have received new
* flow control credit so that pending operations should be
* pushed. nextflow is a pointer to the next connection on the
* list, and inflow is 1 (true) or 0 (false) to indicate whether
* the connection is currently on the flowlist. This is needed
* to prevent a circular list.
*/
struct _viadev_connection_t *nextflow;
int inflow;
struct unexpected *unexpectedQ;
} viadev_connection_t;
typedef struct
{
int np; /* number of processes total */
int me; /* my process rank */
int global_id; /* global id of this parallel app */
VIP_NIC_HANDLE nic; /* single NIC */
VIP_CQ_HANDLE cq; /* single completion queue for this nic */
VIP_PROTECTION_HANDLE ptag; /* single protection tag for all memory
reistration */
VIP_ULONG maxtransfersize; /* max send/rdma size */
viadev_connection_t *connections; /* array of VIs connected to other processes */
}viadev_info_t
viadev_info_t viadev;
typedef struct
{
viadev_packet_type type; /* what type of packet is this? */
packet_sequence_t id; /* sequence number of packet. Always check ed for */
/*sanity and later for reliability */
int buf_credit; /* piggybacked vbuf credit */
int source;
} viadev_packet_header;
typedef struct {
int context; /* context id -- identifies communicator */
29
int tag; /* MPI message tag. */
int data_length; /* how many bytes in MPI message (not header). */
int src_lrank; /* to help receiver */
int packet_no; /* to help packet sequencing */
int mode ; /* Send or Recv mode */
reqaddr_t sreq;
reqaddr_t rreq;
} viadev_packet_envelope;
typedef struct {
viadev_packet_header header;
viadev_packet_envelope envelope;
int bytes_in_this_packet;
} viadev_packet_eager_start;
typedef struct
{
viadev_packet_header header;
viadev_packet_envelope envelope;
#ifdef VIADEV_RGET_SUPPORT
void *buffer_address; /* address of pinned buffer on sender.
* NULL if not pinned */
VIP_MEM_HANDLE memhandle; /* memory handle for buffer (one handle
* for whole buffer) */
#endif
reqaddr_t sreq; /* identifies the send handle and is sent
* by receiver in reply */
int descriptor_required ;
} viadev_packet_rendezvous_start;
typedef struct {
viadev_packet_header header;
#ifdef VIADEV_RPUT_SUPPORT
void *buffer_address; /* address of pinned buffer on receiver.
* NULL if not pinned */
VIP_MEM_HANDLE memhandle; /* memory handle for buffer */
#endif
typedef struct {
viadev_packet_header header;
reqaddr_t rreq; /* identifies receive handle */
reqaddr_t sreq;
int context_id;
int src_lrank;
int mode;
} viadev_packet_long_reply;
30
• Global data structure
11.2b C- MPI .
struct envelope {
int ev_ctxt ;
int ev_mode ;
int ev_rank ;
int ev_tag ;
int 64_t ev_len;
reqaddr_t ev_sendreq ;
reqaddr_t ev_rcvreq ;
};
struct procinfo {
struct process *pi_proc; /* process table entry */
char pi_addr[RTE_NETADD_LEN]; /* IP address */
struct procinfo *pi_next; /* next in protocol */
struct procinfo *pi_master; /* next host master */
int pi_proto; /* protocol */
int pi_ptrank; /* rank in protocol */
int pi_nodeid; /* cluster nodeid */
};
struct taskinfo {
int ti_flags;
#define TASK_WORLD 0x1
#define TASK_MERGE 0x2
#define TASK_CLIENT 0x4
#define TASK_SERVER 0x8
#define TASK_PARENT 0x10
#define TASK_CHILD 0x20
#define TASK_ROBUST 0x40
MPI_Comm ti_comm;
MPI_Comm ti_icomm;
int ti_sd;
rte_ctbl_t ti_ctbl;
rte_proc_t *ti_jtbl;
rte_port_t *ti_port;
int ti_index;
int ti_nlocal;
int ti_nremote;
31
int ti_nhosts;
int ti_nprotos;
struct procinfo *ti_procs;
struct procinfo *ti_proto[NPROTO];
int ti_nproto[NPROTO];
int *ti_lenvsizes[NPROTO]; /* server environment sizes */
void *ti_lenvs[NPROTO]; /* server environments */
int *ti_renvsizes[NPROTO]; /* client environment sizes */
void *ti_renvs[NPROTO]; /* client environments */
int ti_myenvsize[NPROTO];
void *ti_myenv[NPROTO];
MPI_Info ti_info;
};
struct mpip_req {
MPI_Comm r_comm; /* communicator */
MPI_Datatype r_type; /* message datatype */
int r_tag; /* message tag */
int r_peer; /* peer process rank in comm */
int r_flags; /* request flags */
int r_state; /* request state */
int r_fhandle; /* Fortran handle */
size_t r_count; /* message count */
union vf r_bufquery; /* message buffer or queryfn */
int64_t r_length; /* data length received/sent */
reqaddr_t r_peereq; /* peer request */
#define r_cancelreq r_peereq /* request to be cancelled */
union vf r_datafree; /* PM specific data or freefn */
struct mpip_req *r_next; /* for linking in queues */
struct process *r_proc; /* peer process */
int r_error; /* error class */
int r_lock; /* request spin lock */
struct mpip_req *r_bsend; /* background bsend request */
int r_pertag; /* persistent tag */
uint r_cookie; /* error checking cookie */
void_fn *r_cancelfn; /* cancelfn */
#ifndef __sparcv9
char r_pad[32];
#endif
};
struct mpip_comm
{
int c_flags; /* state flags */
int c_fhandle; /* Fortran handle */
int c_sndctxt; /* send context */
int c_rcvctxt; /* receive context */
int c_rank; /* caches rank in local group */
int c_size; /* caches size of local group */
#define MPIP_COMM_INTERCOMM 0x100 /* intercommunicator */
#define MPIP_COMM_CARTESIAN 0x200 /* cartesian topology */
32
#define MPIP_COMM_GRAPH 0x400 /* graph topology */
#define MPIP_COMM_MARKED 0x800
#define MPIP_COMM_TRANSPORTERR 0x1000
#define MPIP_COMM_ROBUST 0x2000
#define MPIP_COMM_SUPER_WIN 0x4000
struct mpip_group *c_group; /* local group */
struct mpip_group *c_rgroup; /* remote group */
MPI_Request c_posthead; /* posted queue head */
MPI_Request c_posttail; /* posted queue tail */
struct unexmsg *c_unexhead; /* unexpected queue head */
struct unexmsg *c_unextail; /* unexpected queue tail */
struct mpip_comm *c_intra; /* local group intracomm */
int c_refcnt; /* reference count */
int c_inuse; /* in use count */
uint c_colloffset; /* general buffer pool offset */
long long c_localbval; /* local barrier value */
void *c_gbp; /* general buffer pool */
cluster_t c_cluster; /* collective cluster info */
int c_latency; /* network latency (usec) */
int c_bandwidth; /* network bw (Mb/sec) */
int c_tcp; /* TCP in use */
#ifdef THREADSAFE
mutex_t c_queuelock; /* posted/unexpected Q lock */
mutex_t c_collock; /* collective operation lock */
#endif
int c_proto[NPROTO]; /* protocols in use */
int c_nprotos; /* # of protocols in use */
int c_depth; /* binary tree depth */
struct attributes c_attrs; /* attributes */
int c_nedges; /* # of graph edges */
#define c_ndims c_nedges /* # of cartesian dimensions */
int *c_index; /* graph indices */
#define c_dim c_index /* cartesian dimensions */
int *c_period; /* cartesian periods */
int *c_edge; /* graph edge */
#define c_coord c_edge /* cartesian coordinates */
struct mpip_errhdlr *c_errhdlr; /* error handler */
void *c_super; /* object the comm underlies */
#ifdef THREADSAFE
mutex_t c_statelock; /* lock for updating state */
#endif
char c_name[MPI_MAX_OBJECT_NAME];
uint c_cookie; /* error checking cookie */
};
struct unexmsg {
struct envelope um_env; /* message envelope */
char *um_buf; /* message buffer */
struct unexmsg *um_next; /* next in queue */
void *um_data; /* PM specific data */
};
33
struct protocol {
uint pt_nactv; /* # active requests */
uint pt_nprocs; /* # of processes */
uint pt_mtu;
int pt_loaded; /* 0=PM not loaded, 1=PM loaded */
uint pt_flags;
int pt_flags_revs[32];
#ifdef THREADSAFE
mutex_t pt_lock;
#endif
int (*pt_initialize)(void);
int (*pt_initconns)(taskinfo_t *, void **);
int (*pt_makeconns)(taskinfo_t *, void *);
int (*pt_addconns)(taskinfo_t *, void *);
int (*pt_completeconns)(taskinfo_t *);
int (*pt_remconns)(process_t **, int, void **);
int (*pt_closeconns)(process_t **, int, void *);
int (*pt_undoconns)(taskinfo_t *, void *);
int (*pt_fsend)(process_t *, char *, size_t, MPI_Datatype,
int, MPI_Comm, int);
int (*pt_frecv)(process_t *, char *, int, MPI_Datatype,
int, MPI_Comm, MPI_Status *);
int (*pt_urecv)(process_t *, unexmsg_t *, char *, int,
MPI_Datatype, MPI_Comm, MPI_Status *);
int (*pt_uirecv)(process_t *, unexmsg_t *, MPI_Request);
int (*pt_progress_blk)(MPI_Request, int);
int (*pt_progress_nblk)(int);
void (*pt_postedreceive)(process_t *);
void (*pt_unpostedreceive)(process_t *);
void (*pt_queuedsend)(process_t *);
void (*pt_dequeuedsend)(process_t *);
void (*pt_postedrcvany)(MPI_Comm);
void (*pt_unpostedrcvany)(MPI_Comm);
void (*pt_matchedrcvany)(MPI_Comm, process_t *);
int (*pt_getevents)(void **, uint *, uint *, uint *);
int (*pt_getpolls)(struct pollfd **, int *, int *);
void (*pt_error_request)(MPI_Request);
int (*pt_finalize)(void);
int (*pt_ping)(MPI_Comm, int);
};
34
Component Design
• Algorithmic description
1. Allocate Memory for data structures.
2. Initialize data structures
• Restrictions/limitations : None.
• Algorithmic description
35
Once new connections are made add this to internal structures.
• Restrictions/limitations: None.
• Algorithmic description
• Initialize device specific structures.
• Call do_setup function for VIA initialization
• Restrictions/limitations : None.
• Algorithmic description
36
1) Opening of Network interface to get NIC handle
2) Create Completion queue.
3) Create Protection tag .
4) Register memory of specified size.
5) Create Protection tag .
6) Create VI’s
7) Prepost descriptor onto the receive queue.
8) Generate unique discriminator.
9) Request for connection
10) Wait for connect request.
11) Accept the connect request and establish the connection
between VI’s.
• Algorithmic description
1 Check for connection count.
37
1. If connection count is zero removes the process information from the
table i.e. logically disconnect the processes.
• Restrictions/limitations : None.
• Algorithmic description
1. Check the internal table for logically disconnected processes.
2. Call do_stop for physically disconnect the connections.
• Restrictions/limitations: None.
38
1 Disconnect VI’s.
2 Destroy VI’s.
3 Destroy protection tag.
4 Deregister memory.
5 Destroy completion queue.
6 Close NIC
• Restrictions/limitations : None.
• Global data definitions : viadev
• Local data structures : VIP_DESCRIPTOR
• Description in pseudo code: None.
• Reusable Modules/Code (if any): None.
• Algorithmic description
• Select the protocol
• Call for eager send or long send
• Restrictions/limitations : None.
39
17. Name of the component/module : CMPI/Send.
• Reference to requirement number : CMPI-REQ-3
• Function name. : MPID_VIA_long_send
• Description of the functionality : This function is used to send messages
of long length. Before sending actual
data sender sends hand shaking message
to receiver. Once sender got
acknowledgement sender starts sending
data packets to destination process.
• Algorithmic description
1. Send handshaking message to receiver.
2. Wait for acknowledgement.
3. Divide the long message as packets
4. If receiver is ready post descriptors on send queue.
• Restrictions/limitations: None.
40
Data is to be transferred with single packet
• Procedure for component/module: None.
• Component interface description : int MPID_VIA_eager_send(void *buf , int
len,int src_lrank, int tag, int context_id, int
dest_grank, int mode reqaddr_t rq,reqaddr_t
rrq) )
Parameters : buf – message buffer
len – message length
src_lrank – sender local rank
tag – message tag
context_id – sender context
dest_grank – destination process global rank
mode – synchronous, ready, standard
rq – sender request
rrq – receiver request
Return : VIP_SUCCESS or Error class
• Algorithmic description
1. Get the buffer.
2. Copy to internal buffer
3. Post descriptor.
• Restrictions/limitations: None.
41
• Algorithmic description
1. Poll on CQ
2. Get buffer.
3. Check for acceptability.
4. If accepted select protocol else put it in unexpected queue.
• Restrictions/limitations: None.
• Algorithmic description
• Search Unexpected queue.
• If found call mpip_via_urecv()
2. Send ACK to sender
3. Receive packets
4. Get buffer
5. Copy data from internal buffer to user buffer.
6. Prepost descriptors.
• Restrictions/limitations: None.
42
21. Name of the component/module : CMPI/Recv.
• Reference to requirement number : CMPI-REQ-4
• Function name. : viadev_incoming_eager_start
• Description of the functionality : This function is used to receive small
messages.
Parameters : v – VIA buffer structure
c– VIA connection structure.
header – Packet header
Return : VIP_SUCCESS or Error class
• Procedure for component/module: None.
• Component interface description : int viadev_incoming_eager_start(vbuf *v
,viadev_connection_t *c,
viadev_packet_eager_start * header)
• Algorithmic description
1. Search Unexpected queue.
2. If expected call function mpip_via_urecv().
3. If not found get data from descriptor.
4. Copy data from internal buffer to user buffer.
5. Prepost descriptors.
• Restrictions/limitations : None.
43
• Algorithmic description
1. If long data send ACK to sender.
2. Get the data from receive queue.
3. Get buffer
4. copy data from internal buffer to user buffer
5. Prepost descriptor on receive queue
• Restrictions/limitations: None.
• Algorithmic description
1. If long data send ACK to sender.
2. Get the data from receive queue.
3. Get buffer
4. copy data from internal buffer to user buffer
5 Prepost descriptor on receive queue
• Restrictions/limitations: None.
44
24. Name of the component/module : CMPI/Send/Recv
• Reference to requirement number : CMPI-REQ-4,CMPI-REQ-5
• Function name. : mpip_via_progress_blk
• Description of the functionality :
This is blocking function which keep
calling mpip_via_progress_nblk till
request is completed.
• Procedure for component/module: None.
• Component interface description : int mpip_via_progress_blk(MPI_Request
• Restrictions/limitations : None.
• Restrictions/limitations : None.
45
• Local data structures : process_t, viaproc_t
• Description in pseudo code : None.
• Reusable Modules/Code (if any) : None.
46