1 / 83

Asynchronous Interconnection Network and Communication

Asynchronous Interconnection Network and Communication. Chapter 3 of Casanova, et. al. Interconnection Network Topologies. The processors in a distributed memory parallel system are connected using an interconnection network.

Download Presentation

Asynchronous Interconnection Network and Communication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Asynchronous Interconnection Network and Communication Chapter 3 of Casanova, et. al.

  2. Interconnection NetworkTopologies • The processors in a distributed memory parallel system are connected using an interconnection network. • All computers have specialized coprocessors that route messages and place date in local memories • Nodes consist of a (computing) processor, a memory, and a communications coprocessor • Nodes are often called processors, when not ambigious.

  3. Network Topology Types • Static Topologies • A fixed network that cannot be changed • Nodes connected directly to each other by point-to-point communications links • Dynamic Topologies • Topology can change at runtime • One or more nodes can request direct communication be established between them. • Done using switches

  4. Some Static Topologies • Fully connected network (or clique) • Ring • Two-Dimensional grid • Torus • Hypercube • Fat tree

  5. Examples of Interconnection Topologies

  6. Static Topologies Features • Fixed number of nodes • Degree: • Nr of nodes incident to edges • Distance between nodes: • Length of shortest path between two nodes • Diameter: • Largest distance between two nodes • Number of links: • Total number of Edges • Bisection Width: • Minimum nr. of edges that must be removed to partition the network into two disconnected networks of the same size.

  7. Classical Interconnection Networks Features • Clique (or Fully Connected) • All processors are connected • p(p-1)/2 edges • Ring • Very simple and very useful topology • 2D Grid • Degree of interior processors is 4 • Not symmetric, as edge processors have different properties • Very useful when computations are local and communications are between neighbors • Has been heavily used previously

  8. Classical Network • 2D Torus • Easily formed from 2D mesh by connecting matching end points. • Hypercube • Has been extensively used • Using recursive defn, can design simple but very efficient algorithms • Has small diameter that is logarithmic in nr of edges • Degree and total number of edges grows too quickly to be useful with massively parallel machines.

  9. Dynamic Topologies • Fat tree is different than other networks included • The compute nodes are only at the leaves. • Nodes at higher level do not perform computation • Topology is a binary tree – both in 2D front view and in side view. • Provides extra bandwidth near root. • Used by Thinking Machine Corp. on the CM-5 • Crossbar Switch • Has p2 switches, which is very expensive for large p • Can connect n processors to combination of n processors • .Cost rises with the number of switches, which is quadratic with number of processors.

  10. Dynamic Topologies (cont) • Benes Network & Omega Networks • Use smaller size crossbars arranged in stages • Only crossbars in adjacent stages are connected together. • Called multi-stage networks and are cheaper to build that full crossbar. • Configuring multi-stage networks is more difficult than crossbar. • Dynamic networks are now the most common used topologies.

  11. A Simple Communications Performance Model • Assume a processor Pi sends a message to Pj or length m. • Cost to transfer message along a network link is roughly linear in message length. • Results in cost to transfer message along a particular route to be roughly linear in m. • Let ci,j(m) denote the time to transfer this message.

  12. Hockney Performance Model for Communications • The time ci,j(m) to transfer this message can be modeled by ci,j(m) = Li,j + m/Bi,j = Li,j + mbi,j • m is the size of the message • Li,j is the startup time, also called latency • Bi,j is the bandwidth, in bytes per second • bi,j is 1/Bi,j, the inverse of the bandwidth • Proposed by Hockney in 1994 to evaluate the performance of the Intel Paragon. • Probably the most commonly used model.

  13. Hockney Performance Model (cont.) • Factors that Li,j and Bi,j depend on • Length of route • Communication protocol used • Communications software overhead • Ability to use links in parallel • Whether links are half or full duplex • Etc.

  14. Store and Forward Protocol • SF is a point-to-point protocol • Each intermediate node receives and stores the entire message before retransmitting it • Implemented in earliest parallel machines in which nodes did not have communications coprocessors. • Intermediate nodes are interrupted to handle messages and route them towards their destination.

  15. Store and Forward Protocol (cont) • If d(i,j) is the number of links between Pi & Pj, the formula for ci,j(m) can be re-written as ci,j(m) = d(i,j) {L+ m/B} = d(i,j)L+ d(i,j)mb where • L is the initial latency & b is the reciprocal for the broadcast bandwidth for one link. • This protocol produces a poor latency & bandwith • The communication cost can be reduced using pipelining.

  16. Store and Forward Protocol using Pipelining • The message is split into r packets of size m/r. • The packets are sent one after another from Pi to Pj. • The first packet reaches node j after ci,j(m/r) time units. • The remaining r-1 packets arrive in (r-1) (L+ mb/r) time units • Simplifying, total communication time reduces to [d(i,j) -1+r][L+ mb/r] • Casanova, et.al. finds optimal value for r above.

  17. Two Cut-Through Protocols • Common performance model: ci,j(m) = L + d(i,j)*  + m/B where • L is the one-time cost of creating a message. •  is the routing management overhead • Generally  << L as routing management is performed by hardware while L involve software overhead • m/B is the time required to transmit the message through entire route

  18. Circuit-Switching Protocol • First cut-through protocol • Route created before first message is sent • Message sent directly to destination through this route • The nodes used in this transmission can not be used during this transmission for any other communication

  19. Wormhole (WH) Protocol • A second cut-through protocol • The destination address is stored in the header of the message. • Routing is performed dynamically at each node. • Message is split into small packets called flits • If two flits arrive at the same time, flits are stored in intermediate nodes’ internal registers

  20. Point-to-Point Communication Comparisons • Store and Forward is not used in physical networks but only at applications level • Cut-through protocols are more efficient • Hide distance between nodes • Avoid large buffer requirement for intermediate nodes • Almost no message loss • For small networks, flow-control mechanism not needed • Wormhole generally preferred to circuit switching • Latency is normally much lower

  21. LogP Model • Models based on the LogP model are more precise than the Hockney model • Involves three components of communication – the sender, the network, and the receiver • At times, some of these components may be busy while others are not. • Some parameters for LogP • m is the message size (in bytes) • w is the size of packets message is split into • L is an upper bound on the latency • o is the overhead, • Defined to be the time that the a node is engaged in the transmission or reception of a packet

  22. LogP Model (cont) • Parameters for LogP (cont) • g or gap is the minimal time interval between consecutive packet transmission or packet reception • During this time, a node may not use the communication coprocessor (i.e., network card) • 1/g the communication bandwidth available per node • P the number of nodes in the platform • Cost of sending m bytes with packet size w • Processor occupational time on sender/receiver

  23. Other LogP Related Models • LogP attempts to capture in a few parameters the characteristics of parallel platforms. • Platforms are fine-tuned and may use different protocols for short & long messages • LogGP is an extension of LogP where G captures the bandwidth for long messages • pLogP is an extension of LogP where L, o, and g depend on the message size m. • Also seperates sender overhead os and receiver overhead or.

  24. Affine Models • The use of the floor functions in LogP models causes them to be nonlinear. • Causes many problems in analytic & theoretical studies. • Has resulted in proposal of many fully linear models • The time that Pi is busy sending a message is expressed as an affine function of the message size • An affine function of m has form f(m) = a*m + b where a and b are constants. If b=0, then f is linear function • Similarly, the time Pj is busy receiving the message is expressed as an affine function of the message size • We will postpone further coverage of the topic of affine models for the present

  25. Modeling Concurrent Communications • Multi-port model • Assumes that communications are contention-free and do not interfere with each other. • A consequence is that a node may communicate with an unlimited number of nodes without any degradation in performance. • Would require a clique interconnection network to fully support. • May simplify proofs that certain problems are hard • If hard under ideal communications conditions, then hard in general. • Assumption not realistic - communication resources are always limited. • See Casanova text for additional information.

  26. Concurrent Communications Models (2/5) • Bounded Multi-port model • Proposed by Hong and Prasanna • For applications that uses threads (e.g., on a multi-core technology), the network link can be shared by several incoming and outgoing communications. • The sum of bandwidths allocated by operating system to all communications can not exceed bandwidth of network card. • An unbounded nr of communications can take place if they share the total available bandwidth. • The bandwidth defines the bandwidth allotted to each communication • Bandwidth sharing by application is unusual, as is usually handled by operating system.

  27. Concurrent Communications Models (3/5) • 1 port (unidirectional or half-duplex) model • Avoids unrealistically optimistic assumptions • Forbids concurrent communication at a node. • A node can either send data or receive it, but not simultaneously. • This model is very pessimistic, as real world platforms can achieve some concurrent computations. • Model is simple and is easy to design algorithms that follow this model.

  28. Concurrent Communications Models (4/5) • 1 port (bidirectional or full-duplex) model • Currently, most network cards are full-duplex. • Allows a single emission and single reception simultaneously. • Introduced by Blat, et. al. • Current hardware does not easily enable multiple messages to be transmitted simultaneous. • Multiple sends and receives are claimed to be eventually serialized by a single hardware port to the next. • Saif & Parashar did experimental work that suggests asynchronous sends become serialized when message sizes exceed a few megabytes.

  29. Concurrent Communications Models (5/5) • k-ports model • A node may have k>1 network cards • This model allows a node to be involved in a maximum of one emission and one reception on each network card. • This model is used in Chapters 4 & 5.

  30. Bandwidth Sharing • The previous concurrent communication models only consider contention on nodes • Other parts of the network can also limit performance • It may be useful to determine constraints on each network link • This type of network model are useful for performance evaluation purposes, but are too complicated for algorithm design purposes. • Casanova text evalutes algorithms using 2 models: • Hockney model or even simplified versions (e.g. assuming no latency) • Multi-port (ignoring contention) or the 1 port model.

  31. Case Study: Unidirectional Ring • We first consider the platform of p processors arranged in a unidirectional ring. • Processors are denoted Pk for k = 0, 1, … , p-1. • Each PE can find its logical index by calling My_Num().

  32. Unidirectional Ring Basics • The processor can determine the number of PEs by calling NumProcs() • Both preceding commands are supported in MPI, a language implemented on most asychronous systems. • Each processor has its own memory • All processors execute the same program, which acts on data in their local memories • Single Program, Multiple Data or SPMD • Processors communicate by message passing • explicitly sending and receiving messages.

  33. Unidirectional Ring Basics (cont – 2/5) • A processor sends a message using the function send(addr,m) • addr is the memory address (in the sender process) of first data item to be sent. • m is the message length (i.e., nr of items to be sent) • A processor receives a message using function receive(addr,m) • The addr is the local address in receiving processor where first data item is to be stored. • If processor Pi executes a receive, then its predecessor (P(i-1)mod p) must execute a send. • Since each processor has a unique predecessor and successor, they do not have to be specified

  34. Unidirectional Ring Basics (cont – 3/5) • A restrictive assumption is to assume that both the send and receive is blocking. • Then the participating processes can not continue until the communication is complete. • The blocking assumption is typical of 1st generation platforms • A classical assumption is keep the receive blocking but to allow the send is non-blocking • The processor executing a send can continue while the data transfer takes place. • To implement, one function is used to initiate the send and another function is used to determine when communication is finished.

  35. Unidirectional Ring Basics (cont – 4/5) • In algorithms, we simply indicate the blocking and non-blocking operations • More recent proposed assumption is that a single processor can send data, receive data, and compute simultaneously. • All three can occur only if no race condition exists. • Convenient to think of three logical threads of control running on a processor • One for computing • One for sending data • One for receiving data • We will usually use the less restrictive third assumption

  36. Unidirectional Ring Basics (cont – 4/5) • Timings for Send/Receive • We use a simplified version of the Hockney model • The time to send or receive over one link is c(m) = L+ mb • m is the length of the message • L is the startup cost in seconds due to the physical latency and the software overhead • b is the inverse of the data transfer rate.

  37. The Broadcast Operation • The broadcast operation allows an processor Pk to send the same message of length m to all other processors. • At the beginning of the broadcast operation, the message is stored at the address addr in the memory of the sending process, Pk. • At the end of the broadcast, the message will be stored at address addr in the memory of all processors. • All processors must call the following function Broadcast(k, addr, m)

  38. Broadcast Algorithm Overview • The message will go around the ring from processor - from Pk to Pk+1 to Pk+2 to … to Pk-1. • We will assume the processor numbers are modulo p, where p is the number of processors. For example, if k=0 and p=8, then k-1 = p-1 = 7. • Note there is no parallelism in this algorithm, since the message advances around ring only one processor per round. • The predecessor of Pk (i.e, Pk-1) does not send the message to Pk.

  39. Analysis of Broadcast Algorithm • For algorithm to be correct, the “receive” in Step 10 will execute before Step 11. • Running Time: • Since we have a sequence of p-1 communications, the time to broadcast a message of length m is (p-1)(L+mb) • MPI does not typically use ring topology for creating communication primitives • Instead use various tree topologies that are more efficient on modern parallel computer platforms. • However, these primitives are simpler on a ring. • Prepares readers to implement primitives, when more efficient than using MPI primitives.

  40. Scatter Algorithm • Scatter operation allows Pkto send a different message of length m to each processor. • Initially, Pk holds a message of length m to be sent to Pq at location “addr [q]”. • To keep the array of addresses uniform, space for a message to Pk is also provided. • At the end of the algorithm, each processor stores its message from Pk at location msg. • The efficient way to implement this algorithm is to pipeline the messages. • Message to most distant processor (i.e,, Pk-1) is followed by message to processor Pk-2.

  41. Discussion of Scatter Algorithm • In Steps 5-6, Pk successively send messages to the other p-1processors in the order of their distance from Pk. • In Step 7, Pk stores its message to itself. • The other processors concurrently move messages along as they arrive in steps 9-12. • Each processor uses two buffers with addresses tempS and tempR. • This allows processors to send a message and to receive the next message in parallel in Step12.

  42. Discussion of Scatter Algorithm (cont) • In step 11, tempS  tempR means two addresses are switched so received value can be sent to next processor. • When a processor receives its message from Pk, the processor stops forwarding (Step 10). • Whatever is in the receive buffer, tempR, at the end is stored as its message from Pk (Step 13). • The running time of the scatter algorithm is the same as for the broadcast, namely (p-1)(L+mb)

  43. Example for Scatter Algorithm • Example: In Figure 3.7, let p=6 and k=4. • Steps 5-6: For i = 1 to p-1 do send(addr[(k+p-i) mod p], m) • Let PE = (k+p-i) mod p = (10 – i) mod 6 • For i=1, PE = 9 mod 6 = 3 • For i=2, PE = 8 mod 6 = 2 • For i=3, PE = 7 mod 6 = 1 • For i=4, PE = 6 mod 6 = 0 • For i=5, PE = 5 mod 6 = 5 • Note messages are sent to processors in the order 3, 2, 1, 0, 5 • That is, messages to most distant processors sent first.

  44. Example for Scatter Algorithm (cont) • Example: In Figure 3.7, let p=6 and k=4. • Steps 10: For i = 1 to (k-1-q) mod p do • Compute: (k-1-q) mod p = (3-q) mod 6 for all q. • Note: q≠ k, which is 4 • q = 5  i = 1 to 4 since (3-5) mod 6 = 4 • PE 5 forwards values in loop from i = 1 to 4 • q = 0  i = 1 to 3 since (3-0) mod 6 = 3 • PE 0 forwards values from i = 1 to 3 • q = 1  i = 1 to 2 since (3-1) mod 6 = 2 • PE 1 forwards values from i = 1 to 2 • q = 2  i = 1 to 1 since (3-2) mod 6 = 1 • PE 2 is active in loop when i = 1

More Related