Load Balancing and Switch Scheduling: Xiangheng Liu Andrea Goldsmith
Load Balancing and Switch Scheduling: Xiangheng Liu Andrea Goldsmith
Abstract Packet switching remains one of the bottlenecks in building fast Internet routers. Load balancing and switch scheduling are two important algorithms in the effort to maximize the throughput and minimize the latency of these packet switches. A load balancing algorithm regulates the trafc to conform to the service rates while a switch scheduling algorithm allocates the service rates adaptive to the arrival patterns. Many existing load balancing and switch scheduling algorithms are very similar. We show that load balancing and switch scheduling systems are dual systems based on the linear queue dynamics approximation. This allows us to cast a load balancing problem as a scheduling problem, and vice versa. We further show an example of designing a new algorithm for load balancing using an existing scheduling algorithm based on the duality. The duality perspective also allows us to solve unknown problems. We nd the entropy rate of the randomized bandwidth allocation system with linear queue dynamics based on the knowledge of the entropy rate of the randomized load balancing system. For the general case, we nd both an upper and a lower bound on the entropy rate. The joint use of dual load balancing and switch scheduling algorithms leads to performance gains as we show using mean eld analysis. Index Terms Communication Switching, Load Balancing, Switch Scheduling, Duality, Entropy Rate
The optimal tradeoffs between the complexity and the performance of the algorithms were studied in [2]. The use of memory in randomized load balancing has been proven attractive. In particular, it was shown in [3] that memory gives a multiplicative effect instead of an additive effect for performance improvement. An example of such algorithms is RAND(d, m), where the packet joins the shortest queue among d randomly picked queues and the shortest m queues in the previous time slot. Switch scheduling determines which input to connect with which output in every time slot. It is well known that the crossbar constraint makes the switch scheduling problem a matching problem in a bipartite graph [4]. Even though the scheduling problem appears to be solved by completely different techniques from load balancing, we observe that many load balancing algorithms have a counterpart in switch scheduling algorithms. For example, SQ in load balancing vs. LQF (Longest Queue First) in switch scheduling. In this paper, we aim to develop a fundamental relationship between the load balancing and switch scheduling algorithms. We show that the two problems can be cast as duals of each other. This duality also allows us to come up with new algorithms and solve new problems for one problem based on results for the other. We also observe that dual algorithms usually work well together. This is often due to the mathematical duality that is fundamental to the system. We study the joint system with dual load balancing and switch scheduling algorithms. Using mean eld analysis, our results indicate signicant performance gain of the joint algorithm. The rest of the paper is organized as follows. We rst show the duality between the load balancing and switch scheduling algorithms based on a linear queue dynamics approximation for a one dimensional system in Section II. We extend the argument to an N N switch in Section III, where we also illustrate how we come up with new algorithms using this duality. We explore how the duality helps us in nding the entropy rate of the randomized bandwidth allocation algorithm in Section IV. The performance of jointly using dual load balancing and switch scheduling algorithms for the one-dimensional system is analyzed in Section V. We conclude in Section VI.
I. I NTRODUCTION Address lookup, packet buffering and packet switching are three potential bottlenecks in developing high-speed Internet routers. A packet switch forwards a packet from the ingress port to the egress port and the goal is to minimize the delay caused by such a device while maximizing the throughput. The load balanced switch [1] proposes a twostage architecture that has a load balancing stage and a switch scheduling stage. The load balanced switch promises a simple architecture that guarantees 100% throughput. In this paper, we study the fundamental relationship between the load balancing and switch scheduling algorithms and the benet of this duality. Load balancing is a fundamental problem in many practical scenarios. A familiar example is the supermarket model where a central allocator assigns each arriving customer to one of a collection of servers to minimize the expected delay. The intuitively ideal SQ (join the Shortest Queue) algorithm is optimal but the implementation for large systems can be costly. Random algorithms, such as RAND (join a queue at random) and SQ(d) (pick d queues at random and join the shortest one), are proposed for simple implementation.
A = Poisson(N )
Load Balancer
If we set qi (n) = qi (n), Ai (n) = Di (n) and Di (n) = A(n), then we have qi (n + 1) qi (n) + Ai (n) Di (n). (4)
Bandwith Allocator
Exp(1 )
Exp(2 )
Exp(N )
S = Exp(N )
AND
BANDWIDTH
We rst consider one dimensional load balancing and switch scheduling algorithms. In a 1-D load balancing system, a single packet stream arrives at a load balancer and the load balancer allocates each packet to one of the N servers with individual queues. The 1-D switch scheduling algorithm is also referred to as bandwidth allocation. All N input queues share one server and the scheduler determines which queue to serve next when the server is idle. Figure 1 shows the 1-D load balancing and scheduling problems. In this section, we show the duality between the two problems when the queue dynamics are linear. We assume that the arrivals occur at the beginning of a time slot, departures occur in the middle of a time slot and the queue lengths are measured at the end of a time slot. Let qi (n) denote the ith queue length at time slot n and Ai (n) and Di (n) denote the number of arrivals and departures in time slot n, respectively. Note that Ai (n) and Di (n) only take binary values. For any queue i, we have qi (n + 1) = max [qi (n) + Ai (n) Di (n), 0] . (1)
Clearly, the dynamics of the queue is not linear. A linear approximation is often used for simplicity in analysis. qi (n + 1) qi (n) Di (n) + Ai (n). (2)
This linear approximation can be an accurate one if the system is overloaded, or in other words, if the queues are almost always non-empty. In a load balancing system, the load balancer controls the arrivals Ai (n) to each of the queues while the departures Di (n) are determined by the service disciplines. On the other hand, in the bandwidth allocation system, the bandwidth allocator determines the departures Di (n) from every queue while the arrivals Ai (n) are not controllable. The linear approximation of the system dynamics, Eqn. (2), has an equivalent representation: qi (n + 1) qi (n) + Di (n) Ai (n). (3)
Note that the new system dynamics have exactly the same form as Equation (2). Therefore, if we have a bandwidth allocation problem with the system dynamics of Equation (2) and we need to determine Di (n), we can solve an equivalent load balancing problem as in (4), where we solve for Ai (n), the equivalent of Di (n). The equivalence is achieved by considering a negative dual system. By negative, we mean that the new system has a state variable that is the negative of the original system. By dual, we refer to the swap of the arrival and departure processes. Intuitively, one can think of this as reversing all the arrows in Figure 1(b) and consider the bandwidth allocator as a token allocator. When a token arrives at queue i, one packet leaves that queue. The new state variable qi (n), which is equal to the negative of the queue length, can be explained as the number of negative tokens in the queue. That is, one packet in the queue is equivalent to one negative token in the queue. Note that the new state variable qi (n) 0 while the original state qi (n) 0. In order for the linear queue dynamics to hold, the queues are always non-empty. Thus, the sign constraints on the state variables are always satised. Assuming linear queue dynamics, the negative dual transformation allows us to cast the bandwidth allocation problem as a load balancing problem, and vice versa. When the queue dynamics are linear, the load balancing and bandwidth allocation problems are identical except that the state variable of one system is the negative of the other. Therefore, the optimal load balancing algorithm SQ leads to the optimal bandwidth allocation algorithm LQF, and vice versa. Many other load balancing algorithms, such as RAND, SQ(d) and RAND(d,m), all have corresponding counterparts in bandwidth allocation algorithms. When the system is under-loaded, the queues can be empty from time to time. In this case, the linear queue dynamics is no longer a good approximation. The main barrier to equalize the two systems are the different constraints on the state variables qi (n) 0 and qi (n) 0. With the negative dual transformation, we arrive at qi (n + 1) = min qi (n) + Ai (n) Di (n), 0 . (5)
Note this differs from Equation (1) where a maximization is taken. We are not able to establish a duality result when the system is under-loaded. Remark: The linear queue dynamics qi (n + 1) = qi (n) Di (n) + Ai (n) is also equivalent to qi (n) = qi (n + 1) + Di (n)Ai (n). This raises the interesting question of whether load balancing in forward time is equivalent to bandwidth allocation in reverse time. However, in order for this claim to be true, we need to dene the new system state as qi (n) = qi (n+1) and qi (n+1) = qi (n). It is easy to see that these two
q11 (n)
A1 (n) A2 (n)
D1 (n) D2 (n)
AN (n)
qN N (n)
DN (n)
Load Balancing
Switch Scheduling
Fig. 2. Load Balancing and Switch Scheduling as the two stages in a Load Balanced Switch
denitions contradict each other since qi (n + 1) = qi (n + 2) from the rst denition and qi (n + 1) = qi (n) from the second one. Also, if the claim were true, then the SQ policy in load balancing should stay optimal for the bandwidth allocation problem. But this is not true since LQF is optimal for bandwidth allocation. III. 2-D S CENARIO : L OAD BALANCING AND S WITCH S CHEDULING IN A C ROSSBAR S WITCH The duality relationship between the load balancing and switch scheduling algorithms in Section II can be easily extended to an N N switch. This is because the queue dynamics are exactly the same except that we now have N 2 queues and we have two indices: the ingress port number and the egress port number. Figure 2 shows the architecture of a load balanced switch. The load balanced switch as proposed in [1] has two stages: the load balancing stage and the switch scheduling stage. Note that there is only a single-stage buffer that locates between the two stages. In [1], both the load balancing and switch scheduling algorithms are specied. In this paper, the load balanced switch refers to the general switch architecture that includes both stages which share a single-stage of buffering. The load balancing and switch scheduling algorithms can be arbitrary. As shown in Figure 2, both the load balancing and switch scheduling problems are essentially bipartite graph matching problems. The switch scheduling problem has long been solved using bipartite graph matching techniques. Here we explain why the load balancing part is also a bipartite graph matching. A packet that arrives at an input i has a certain destination j. The load balancing refers to the allocation of a packet to input i with destination j to one of the VOQs (Virtual Output Queues) of qkj where k = 1, 2, , N . Since we have N arrival processes, we have N load balancers that are coordinated by a single load balancing algorithm. In any given time slot, we can only have one packet arriving at a
given input port. We assume zero queuing at the input side. Hence, all the packets that arrive in time slot n need to be transported to one of the VOQs in the same time slot. We assume that there is a maximum of one READ operation at each output port of the load balancing stage. Therefore each of the input ports needs to be connected with a different output port and the packet goes to the VOQ with the proper destination port. This ensures that the load balancing problem is a bipartite graph matching problem. With the negative dual transformation as we discussed in Section II, we can cast a load balancing problem as a scheduling problem and vice versa. In a crossbar switch, the scheduling problem is much better studied than the load balancing problem. As an example of making use of the duality between the two algorithms, we can transform the Maximum Weight Matching (MWM) algorithm in switch scheduling to the Minimum Weight Matching (MinWM) algorithm in load balancing. The reason that we go from maximizing to minimizing is because the state variables in the equivalent systems are the negative of each other. There is one technicality. In the bipartite graph of the load balancing stage, we only need to connect the input ports with packet arrivals. For example, in a given time slot, if only input i has a packet arrival and all other inputs have no arrivals, then it is essentially a 1-D load balancing algorithm and the MinWM becomes the SQ algorithm that chooses the shortest queue to join among qkj for k = 1, , N . When there are only two inputs i and i that have packet arrivals with destinations j and j , respectively, the MinWM is to choose the matching that minimizes qkj + qk j with k = k . Note that j and j can be equal but k = k . The graph is to connect input i with output k and input i with output k . The algorithm easily extends to the case where an arbitrary number of input ports have packet arrivals. Similarly, we can use the duality relationship to come up with other new load balancing algorithms from known switch scheduling algorithms.
AND
In Section II and III, we have shown that load balancing and switch scheduling are dual systems when the queue dynamics are linear. Due to the duality, we would expect the property of one system holds true for the other. The entropy rate of randomized load balancing was studied in [5] and a closed form was found for a class of randomized load balancing algorithms. In this section, we study the entropy rate of a queuing system with randomized bandwidth allocation with the help of duality. First we review some important results from [5]. The 1D load balancing system as shown in Figure 1(a) is studied in [5]. In particular, it is assumed that all servers have identical service rates i = 1 for i = 1, , N . The class of the randomized load balancing algorithms that can be described by a coin toss model is considered. The coin toss model is dened as follows. Let k be the permutation of numbers of 1, 2, , N that arranges the queues in the increasing order in time slot k right after departures Di (k). Let p = (p1 , , pN ) be a probability vector representing the probabilities of the outcome of the toss of a coin with N sides and let p1 p2 pN . If a packet arrives in time slot k, we toss an N -sided coin distributed according to p. If the outcome of the coin toss is C, 1 C N , then the packet joins the queue k (C). The randomized algorithm SQ(d), in which the packet joins the shortest queue among d random choices, can be identied with N i+1 d N d N i d
pi =
(6)
The algorithm RAND and SQ are special cases with d = 1 and d = N respectively. RAND can be identied with the 1 1 vector ( N , , N ). SQ corresponds to the probability vector (1, 0, , 0). Theorem 1 (Nair et. al. [5]): Suppose the arrival process to the N -queue system is stationary, ergodic and renewal. Let the service distribution be independent of the arrival process and i.i.d. Under mild technical conditions [5], the entropy rate of the queue-size process of any algorithm that belongs to the coin toss model is equal to N (HER (A) + H(S) + H(C)). Note that is the arrival rate per queue and that A, S, C are random variables representing the inter-arrival time, the service time and the result of the coin toss, respectively. The arrival process is Poisson with rate N , thus A is exponen1 tially distributed with mean N . The service time S for every packet is identically distributed with mean 1. The probability distribution of C is determined by the algorithm. For the class of SQ(d), the entropy H(C) decreases as d increases. For example,H(C) = log2 (N ) when d = 1 and H(C) = 0 when d = N . With identical arrival and service processes, the entropy rate of the load balancing system decreases as d
increases. As we know, for the class of algorithms specied by the coin toss model, the system performance improves as d increases. Therefore, the entropy rate can be seen as a performance index: the lower the entropy rate is, the better the system performs. A class of the randomized bandwidth allocation algorithms can be similarly described with a coin toss model. We let (k) be the permutation of 1, , N and it arranges the queue sizes in decreasing order in time slot k right after arrivals Ai (k). The coin toss probability for LQF(d)1 is the same as that of SQ(d). We toss a coin to determine which queue to serve. When the coin toss result is C, we serve the queue specied by k (C). In the bandwidth allocation problem, as long as the queue chosen by the algorithm is not empty, the linear queue dynamics are exact. This certainly include the scenario when all the queues are always non-empty. However, if the LQF is used, as long as there is at least one packet in the system at any time, then the system has linear dynamics. Theorem 2: Suppose the arrival processes are stationary, ergodic, renewal and independent. Let the service distribution be independent of the arrival process and i.i.d. If all the queues in the system conform to the linear dynamics, under the same mild conditions as in Theorem 1, the entropy rate of the queue-size processes of any randomized bandwidth allocation algorithm that can be specied by the coin toss model is equal to N (HER (A) + H(S) + H(C)), where is the rate of Poisson arrivals to each queue, A is the inter-arrival time for each queue, S is the service time, and C is the result of the coin toss. Proof: When the linear dynamics holds, we use the negative dual transform shown in Section II and cast the bandwidth allocation problem as a load balancing problem. Let qi (k) = qi (k), Ai (k) = Di (k), Di (k) = Ai (k) for i = 1, , N , then the randomized bandwidth allocation problem is precisely a load balancing problem that can be described as follows. Let (k) be the permutation of 1, , N and it arranges the q1 (k), q2 (k), , qN (k) in increasing order in time slot k right after departures Di (k). An Nsided coin toss determines which queue the packet is to join. Note that the inter-arrival time A is the actual interdeparture time in the bandwidth allocation system. Since the queues are always non-empty so the server is always busy, the inter-departure time is equal to the service time. Thus H(A) = H(S). Also note that S is the inter-arrival time and is equal to H(A), where A is an exponential random H(S) 1 variable with mean . Thus, the entropy rate is equal to + H(S) + H(C)) = N (H(A) + H(S) + H(C). N (H(A) When the linear queue dynamics hold, the entropy rate of the bandwidth allocation system shares the same closed form
1 In LQF(d), we randomly choose d samples in each time slot and serve the longest queue among the d samples. If all the d queues are empty, no packet will be served in the current time slot.
as the load balancing system. However, Theorem 2 does not hold when the queues can be empty since the queue dynamics are no longer linear. The bijection that proves Theorem 1 no longer holds. In the next theorem, we show two injections that give a lower bound and an upper bound on the entropy rate in the general case. Theorem 3: Suppose the arrival processes are stationary, ergodic, renewal and independent. Let the service distribution be independent of the arrival process and i.i.d. Under the same mild conditions as in Theorem 1, the entropy rate of the queue size process of any randomized bandwidth allocation algorithm specied by the coin toss model is upper bounded by N (HER (A)+H(S)+H(I)+H(C)) and lower bounded by N (HER (A) + H(S + I)) + H(C), where the arrival rate to each of the queues is , C is the coin toss result and I is the server idle time between services. Proof: We follow all the notations in [5]. Consider a queue size process that is restricted to (0, K) for K > 0: {Q0 , , QK }. Let N (K) denote the number of arrivals in (0, K). Let VK+ be the future service times for the packets in the queue at time K+. In addition, we dene V IK+ be the future inter service idle times for the packets in the queue at time K+. We also assume V IK+ has nite entropy for any K. The key is that the entropy rate decreases as we go down the injections. We have the following injection: (Q0 , V0+ , V I0+ , a1 , AN (K)1 , S N (K) , I N (K) , C N (K) ) (Q0 , , QK ). This gives the upper bound since HER (Q) = limK =
H(Q0 , ,QK ) K N (K)1 ,S N (K) ,I N (K) ,C N (K) ) limK H(A K N (HER (A) + H(S) + H(I) + H(C))
Poisson(N )
Load Balancer
Bandwidth Allocator
Exp(N ) Fig. 3. Joint Load Balancing and Bandwidth Allocation in 1-D system
The detailed argument establishing the last step can be sim(K) ilarly found in [7]. Note that limK NK = N and the entropies of Q0 , V0+ , V I0+ are all nite and thus disappear from the entropy rate expression when we take K . Thus we have an upper bound of the entropy rate of the queue size process. We also have (Q0 , , QK ; VK + ) (a1 , AN (k)1 , (S + I)N (k) , C N (k) ). Recall that H(VK + ) is nite by assumption. This injection gives us a lower bound on the entropy rate in the form of N (HER (A) + +H(S + I) + H(C)). Note that when I = 0 with probability 1, the two bounds meet since H(I) = 0 and S + I = S so that H(S + I) = H(S). However, the tightness of these bounds are not known. When the queues are always non-empty, the inter service idle time I = 0 with probability 1 and this gives the same expression as in Theorem 2.
V. M EAN F IELD A NALYSIS FOR J OINT L OAD BALANCING AND BANDWIDTH A LLOCATION One of the original motivations of this paper is to extend the load balanced switch to deploy dual load balancing and switch scheduling algorithms in two stages. The motivation of using dual algorithms comes from the observation that the joint use of dual algorithms often achieve optimal performance in linear systems, for example, the Kalman lter followed by a state feedback controller is optimal for LQG (Linear Quadratic Gaussian) control. We conjecture that such a joint system gives better performance than the systems where only either load balancing or switch scheduling algorithms are adopted. We study the one dimensional system shown as in Figure 3 and we evaluate the performance using mean eld analysis. We consider a system where the arrival process is Poisson(N ) and a load balancer allocates all the packets to a bank of N queues. We assume that these N queues share the same server that operates at rate N . A bandwidth allocator at the server side determines which queue it serves when the server is free. In this section, we consider the system where SQ(d) is used for load balancing while LQF(d) is used for bandwidth allocation. The mean eld analysis of SQ(d) is discussed in [6] and it was shown that the cumulative distribution of di 1 queue length is P (Q1 i) = d1 . Let si (t) denote the fraction of the queues with load at least i at time t. Then si (t) satisfy the following set of differential equations for the joint system with SQ(d) and LQF(d). dsi (t) = (sd (t)sd (t))1((1si+1 (t))d (1si (t))d ). i1 i dt In equilibrium, for all i,
ki dsi (t) dt
0.5
0.45
0.4
0.9
R EFERENCES
[1] C.S. Chang, D.S. Lee, Y.S Jou, Load balanced Birkhoff-von Neumann Switches IEEE Workshop on High Performance Switching and Routing, 2001. [2] A. Czumaj and V. Stemann, Randomized Allocation Processes, Proceedings of the 38th IEEE Symposium on Foundations of Computer Science (FOCS), 1997. [3] D. Shah and B. Prabhakar, The use of memory in randomized load balancing, IEEE ISIT 2002. [4] P. Giaccone, B. Prabhakar, D. Shah, Randomized Scheduling Algorithms for High Aggregate Bandwidth switches, IEEE Infocom 2002. [5] C. Nair, B. Prabhakar, D. Shah, The Randomness in Randomized Load Balancing, Proceedings of the 39th Annual Allerton Conference on Communication, Control and Computing, pp. 912-921, October 2001. [6] N. McKeown and B. Prabhakar, EE384Y Lecture Notes, Stanford University 2003. [7] B. Prabhakar and R. Gallager, Entropy and the Timing Capacity of Discrete Queues, IEEE Transactions on Information Theory, pp. 357370, February 2003.
0.8
0.35
0.7
0.3
0.25
0.6
0.5
0.2
0.4
0.15
0.3
0.1
0.2
0.05
0.1
1.5
2.5
3.5
4.5
1.5
2.5
3.5
4.5
Load i
Load i
Fig. 4. Performance of Joint Load Balancing and Bandwidth Allocation (a) = 0.5 (b) = 0.95.
(7)
By the law of large numbers, P (Q1 i) = si (t). Since s0 (t) = 1 for all t. We can nd the distribution of the queue length using recursions. Figure 4 plots the queue length distribution for two different arrival rates = 0.5 and = 0.95. In general, the lower the curve is in the plot, the better the system performance is. Both plots show the performance gain by jointly using the SQ(2) and LQF(2) algorithms. The performance gain is more signicant in more heavy-loaded systems. This performance gain holds for other values of d as well.
VI. C ONCLUSIONS We study a fundamental relationship between load balancing and switch scheduling algorithms. We show that the two problems are equivalent based on a negative dual transformation for the linear queue dynamics approximation. This duality can help us come up with new algorithms in load balancing based on the existing scheduling algorithms and vice versa. The duality also directly leads to the entropy rate of the bandwidth allocation system when the linear queue dynamics are exact (all the queues in the system are always non-empty) since we already know the entropy rate of the load balancing system. However, when the the queues do not have linear dynamics, we are not able to nd the exact entropy rate, instead, we nd an upper bound and a lower bound. The joint use of load balancing and switch scheduling are shown to improve the performance in the 1-D system. We conjecture that similar performance gains can be obtained for the loadbalanced switch as well.
ACKNOWLEDGMENTS The authors would like to thank Devavrat Shah, Isaac Keslassy, Prof. Balaji Prabhakar and Prof. Nick McKeown for their helpful discussions and feedbacks.