1 ========================
3 ========================
5 This document describes some of the internals of the ovs-vswitchd
6 process. It is not complete. It tends to be updated on demand, so if
7 you have questions about the vswitchd implementation, ask them and
8 perhaps we'll add some appropriate documentation here.
10 Most of the ovs-vswitchd implementation is in vswitchd/bridge.c, so
11 code references below should be assumed to refer to that file except
12 as otherwise specified.
17 Bonding allows two or more interfaces (the "slaves") to share network
18 traffic. From a high-level point of view, bonded interfaces act like
19 a single port, but they have the bandwidth of multiple network
20 devices, e.g. two 1 GB physical interfaces act like a single 2 GB
21 interface. Bonds also increase robustness: the bonded port does not
22 go down as long as at least one of its slaves is up.
24 In vswitchd, a bond always has at least two slaves (and may have
25 more). If a configuration error, etc. would cause a bond to have only
26 one slave, the port becomes an ordinary port, not a bonded port, and
27 none of the special features of bonded ports described in this section
30 There are many forms of bonding, but ovs-vswitchd currently implements
31 only a single kind, called "source load balancing" or SLB bonding.
32 SLB bonding divides traffic among the slaves based on the Ethernet
33 source address. This is useful only if the traffic over the bond has
34 multiple Ethernet source addresses, for example if network traffic
35 from multiple VMs are multiplexed over the bond.
37 Enabling and Disabling Slaves
38 -----------------------------
40 When a bond is created, a slave is initially enabled or disabled based
41 on whether carrier is detected on the NIC (see iface_create()). After
42 that, a slave is disabled if its carrier goes down for a period of
43 time longer than the downdelay, and it is enabled if carrier comes up
44 for longer than the updelay (see bond_link_status_update()). There is
45 one exception where the updelay is skipped: if no slaves at all are
46 currently enabled, then the first slave on which carrier comes up is
49 The updelay should be set to a time longer than the STP forwarding
50 delay of the physical switch to which the bond port is connected (if
51 STP is enabled on that switch). Otherwise, the slave will be enabled,
52 and load may be shifted to it, before the physical switch starts
53 forwarding packets on that port, which can cause some data to be
54 "blackholed" for a time. The exception for a single enabled slave
55 does not cause any problem in this regard because when no slaves are
56 enabled all output packets are blackholed anyway.
58 When a slave becomes disabled, the vswitch immediately chooses a new
59 output port for traffic that was destined for that slave (see
60 bond_enable_slave()). It also sends a "gratuitous learning packet" on
61 the bond port (on the newly chosen slave) for each MAC address that
62 the vswitch has learned on a port other than the bond (see
63 bond_send_learning_packets()), to teach the physical switch that the
64 new slave should be used in place of the one that is now disabled.
65 (This behavior probably makes sense only for a vswitch that has only
66 one port (the bond) connected to a physical switch; vswitchd should
67 probably provide a way to disable or configure it in other scenarios.)
72 Bonding accepts unicast packets on any bond slave. This can
73 occasionally cause packet duplication for the first few packets sent
74 to a given MAC, if the physical switch attached to the bond is
75 flooding packets to that MAC because it has not yet learned the
76 correct slave for that MAC.
78 Bonding only accepts multicast (and broadcast) packets on a single
79 bond slave (the "active slave") at any given time. Multicast packets
80 received on other slaves are dropped. Otherwise, every multicast
81 packet would be duplicated, once for every bond slave, because the
82 physical switch attached to the bond will flood those packets.
84 Bonding also drops received packets when the vswitch has learned that
85 the packet's MAC is on a port other than the bond port itself. This is
86 because it is likely that the vswitch itself sent the packet out the
87 bond port on a different slave and is now receiving the packet back.
88 This occurs when the packet is multicast or the physical switch has not
89 yet learned the MAC and is flooding it. However, the vswitch makes an
90 exception to this rule for broadcast ARP replies, which indicate that
91 the MAC has moved to another switch, probably due to VM migration.
92 (ARP replies are normally unicast, so this exception does not match
93 normal ARP replies. It will match the learning packets sent on bond
96 The active slave is simply the first slave to be enabled after the
97 bond is created (see bond_choose_active_iface()). If the active slave
98 is disabled, then a new active slave is chosen among the slaves that
99 remain active. Currently due to the way that configuration works,
100 this tends to be the remaining slave whose interface name is first
101 alphabetically, but this is by no means guaranteed.
106 When a packet is sent out a bond port, the bond slave actually used is
107 selected based on the packet's source MAC and VLAN tag (see
108 choose_output_iface()). In particular, the source MAC and VLAN tag
109 are hashed into one of 256 values, and that value is looked up in a
110 hash table (the "bond hash") kept in the "bond_hash" member of struct
111 port. The hash table entry identifies a bond slave. If no bond slave
112 has yet been chosen for that hash table entry, vswitchd chooses one
115 Every 10 seconds, vswitchd rebalances the bond slaves (see
116 bond_rebalance_port()). To rebalance, vswitchd examines the
117 statistics for the number of bytes transmitted by each slave over
118 approximately the past minute, with data sent more recently weighted
119 more heavily than data sent less recently. It considers each of the
120 slaves in order from most-loaded to least-loaded. If highly loaded
121 slave H is significantly more heavily loaded than the least-loaded
122 slave L, and slave H carries at least two hashes, then vswitchd shifts
123 one of H's hashes to L. However, vswitchd will only shift a hash from
124 H to L if it will decrease the ratio of the load between H and L by at
127 Currently, "significantly more loaded" means that H must carry at
128 least 1 Mbps more traffic, and that traffic must be at least 3%
134 Each bond balancing mode has different considerations, described
140 LACP bonding requires the remote switch to implement LACP, but it is
141 otherwise very simple in that, after LACP negotiation is complete,
142 there is no need for special handling of received packets.
147 SLB bonding allows a limited form of load balancing without the remote
148 switch's knowledge or cooperation. The basics of SLB are simple. SLB
149 assigns each source MAC+VLAN pair to a link and transmits all packets
150 from that MAC+VLAN through that link. Learning in the remote switch
151 causes it to send packets to that MAC+VLAN through the same link.
153 SLB bonding has the following complications:
155 0. When the remote switch has not learned the MAC for the
156 destination of a unicast packet and hence floods the packet to
157 all of the links on the SLB bond, Open vSwitch will forward
158 duplicate packets, one per link, to each other switch port.
160 Open vSwitch does not solve this problem.
162 1. When the remote switch receives a multicast or broadcast packet
163 from a port not on the SLB bond, it will forward it to all of
164 the links in the SLB bond. This would cause packet duplication
165 if not handled specially.
167 Open vSwitch avoids packet duplication by accepting multicast
168 and broadcast packets on only the active slave, and dropping
169 multicast and broadcast packets on all other slaves.
171 2. When Open vSwitch forwards a multicast or broadcast packet to a
172 link in the SLB bond other than the active slave, the remote
173 switch will forward it to all of the other links in the SLB
174 bond, including the active slave. Without special handling,
175 this would mean that Open vSwitch would forward a second copy of
176 the packet to each switch port (other than the bond), including
177 the port that originated the packet.
179 Open vSwitch deals with this case by dropping packets received
180 on any SLB bonded link that have a source MAC+VLAN that has been
181 learned on any other port. (This means that SLB as implemented
182 in Open vSwitch relies critically on MAC learning. Notably, SLB
183 is incompatible with the "flood_vlans" feature.)
185 3. Suppose that a MAC+VLAN moves to an SLB bond from another port
186 (e.g. when a VM is migrated from this hypervisor to a different
187 one). Without additional special handling, Open vSwitch will
188 not notice until the MAC learning entry expires, up to 60
189 seconds later as a consequence of rule #2.
191 Open vSwitch avoids a 60-second delay by listening for
192 gratuitous ARPs, which VMs commonly emit upon migration. As an
193 exception to rule #2, a gratuitous ARP received on an SLB bond
194 is not dropped and updates the MAC learning table in the usual
195 way. (If a move does not trigger a gratuitous ARP, or if the
196 gratuitous ARP is lost in the network, then a 60-second delay
199 4. Suppose that a MAC+VLAN moves from an SLB bond to another port
200 (e.g. when a VM is migrated from a different hypervisor to this
201 one), that the MAC+VLAN emits a gratuitous ARP, and that Open
202 vSwitch forwards that gratuitous ARP to a link in the SLB bond
203 other than the active slave. The remote switch will forward the
204 gratuitous ARP to all of the other links in the SLB bond,
205 including the active slave. Without additional special
206 handling, this would mean that Open vSwitch would learn that the
207 MAC+VLAN was located on the SLB bond, as a consequence of rule
210 Open vSwitch avoids this problem by "locking" the MAC learning
211 table entry for a MAC+VLAN from which a gratuitous ARP was
212 received from a non-SLB bond port. For 5 seconds, a locked MAC
213 learning table entry will not be updated based on a gratuitous
214 ARP received on a SLB bond.