Definitions for Open Flow 1.2

[openvswitch] / DESIGN
diff --git a/DESIGN b/DESIGN

index 56e2605321ec215e1dea519f4a42d3307312e15d..ec29e3955767d16526e078dd81c7da87d3de7512 100644 (file)
--- a/DESIGN
+++ b/DESIGN
@@ -9,13 +9,172 @@ successful deployment.  The end of this document contains contact
  information that can be used to let us know how we can make Open vSwitch
  more generally useful.
  
+Asynchronous Messages
+=====================
+
+Over time, Open vSwitch has added many knobs that control whether a
+given controller receives OpenFlow asynchronous messages.  This
+section describes how all of these features interact.
+
+First, a service controller never receives any asynchronous messages
+unless it explicitly configures a miss_send_len greater than zero with
+an OFPT_SET_CONFIG message.
+
+Second, OFPT_FLOW_REMOVED and NXT_FLOW_REMOVED messages are generated
+only if the flow that was removed had the OFPFF_SEND_FLOW_REM flag
+set.
+
+Third, OFPT_PACKET_IN and NXT_PACKET_IN messages are sent only to
+OpenFlow controller connections that have the correct connection ID
+(see "struct nx_controller_id" and "struct nx_action_controller"):
+
+    - For packet-in messages generated by a NXAST_CONTROLLER action,
+      the controller ID specified in the action.
+
+    - For other packet-in messages, controller ID zero.  (This is the
+      default ID when an OpenFlow controller does not configure one.)
+
+Finally, Open vSwitch consults a per-connection table indexed by the
+message type, reason code, and current role.  The following table
+shows how this table is initialized by default when an OpenFlow
+connection is made.  An entry labeled "yes" means that the message is
+sent, an entry labeled "---" means that the message is suppressed.
+
+                                             master/
+  message and reason code                     other     slave
+  ----------------------------------------   -------    -----
+  OFPT_PACKET_IN / NXT_PACKET_IN
+    OFPR_NO_MATCH                              yes       ---
+    OFPR_ACTION                                yes       ---
+    OFPR_INVALID_TTL                           ---       ---
+
+  OFPT_FLOW_REMOVED / NXT_FLOW_REMOVED
+    OFPRR_IDLE_TIMEOUT                         yes       ---
+    OFPRR_HARD_TIMEOUT                         yes       ---
+    OFPRR_DELETE                               yes       ---
+
+  OFPT_PORT_STATUS
+    OFPPR_ADD                                  yes       yes
+    OFPPR_DELETE                               yes       yes
+    OFPPR_MODIFY                               yes       yes
+
+The NXT_SET_ASYNC_CONFIG message directly sets all of the values in
+this table for the current connection.  The
+OFPC_INVALID_TTL_TO_CONTROLLER bit in the OFPT_SET_CONFIG message
+controls the setting for OFPR_INVALID_TTL for the "master" role.
+
+
+OFPAT_ENQUEUE
+=============
+
+The OpenFlow 1.0 specification requires the output port of the OFPAT_ENQUEUE
+action to "refer to a valid physical port (i.e. < OFPP_MAX) or OFPP_IN_PORT".
+Although OFPP_LOCAL is not less than OFPP_MAX, it is an 'internal' port which
+can have QoS applied to it in Linux.  Since we allow the OFPAT_ENQUEUE to apply
+to 'internal' ports whose port numbers are less than OFPP_MAX, we interpret
+OFPP_LOCAL as a physical port and support OFPAT_ENQUEUE on it as well.
+
+
+OFPT_FLOW_MOD
+=============
+
+The OpenFlow 1.0 specification for the behavior of OFPT_FLOW_MOD is
+confusing.  The following table summarizes the Open vSwitch
+implementation of its behavior in the following categories:
+
+    - "match on priority": Whether the flow_mod acts only on flows
+      whose priority matches that included in the flow_mod message.
+
+    - "match on out_port": Whether the flow_mod acts only on flows
+      that output to the out_port included in the flow_mod message (if
+      out_port is not OFPP_NONE).
+
+    - "updates flow_cookie": Whether the flow_mod changes the
+      flow_cookie of the flow or flows that it matches to the
+      flow_cookie included in the flow_mod message.
+
+    - "updates OFPFF_ flags": Whether the flow_mod changes the
+      OFPFF_SEND_FLOW_REM flag of the flow or flows that it matches to
+      the setting included in the flags of the flow_mod message.
+
+    - "honors OFPFF_CHECK_OVERLAP": Whether the OFPFF_CHECK_OVERLAP
+      flag in the flow_mod is significant.
+
+    - "updates idle_timeout" and "updates hard_timeout": Whether the
+      idle_timeout and hard_timeout in the flow_mod, respectively,
+      have an effect on the flow or flows matched by the flow_mod.
+
+    - "updates idle timer": Whether the flow_mod resets the per-flow
+      timer that measures how long a flow has been idle.
+
+    - "updates hard timer": Whether the flow_mod resets the per-flow
+      timer that measures how long it has been since a flow was
+      modified.
+
+    - "zeros counters": Whether the flow_mod resets per-flow packet
+      and byte counters to zero.
+
+    - "sends flow_removed message": Whether the flow_mod generates a
+      flow_removed message for the flow or flows that it affects.
+
+An entry labeled "yes" means that the flow mod type does have the
+indicated behavior, "---" means that it does not, an empty cell means
+that the property is not applicable, and other values are explained
+below the table.
+
+                                          MODIFY          DELETE
+                             ADD  MODIFY  STRICT  DELETE  STRICT
+                             ===  ======  ======  ======  ======
+match on priority            ---    ---     yes     ---     yes
+match on out_port            ---    ---     ---     yes     yes
+updates flow_cookie          yes    yes     yes
+updates OFPFF_SEND_FLOW_REM  yes     +       +
+honors OFPFF_CHECK_OVERLAP   yes     +       +
+updates idle_timeout         yes     +       +
+updates hard_timeout         yes     +       +
+resets idle timer            yes     +       +
+resets hard timer            yes    yes     yes
+zeros counters               yes     +       +
+sends flow_removed message   ---    ---     ---      %       %
+
+(+) "modify" and "modify-strict" only take these actions when they
+    create a new flow, not when they update an existing flow.
+
+(%) "delete" and "delete_strict" generates a flow_removed message if
+    the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set.
+    (Each controller can separately control whether it wants to
+    receive the generated messages.)
+
+
+Multiple Table Support
+======================
+
+OpenFlow 1.0 has only rudimentary support for multiple flow tables.
+Notably, OpenFlow 1.0 does not allow the controller to specify the
+flow table to which a flow is to be added.  Open vSwitch adds an
+extension for this purpose, which is enabled on a per-OpenFlow
+connection basis using the NXT_FLOW_MOD_TABLE_ID message.  When the
+extension is enabled, the upper 8 bits of the 'command' member in an
+OFPT_FLOW_MOD or NXT_FLOW_MOD message designates the table to which a
+flow is to be added.
+
+The Open vSwitch software switch implementation offers 255 flow
+tables.  On packet ingress, only the first flow table (table 0) is
+searched, and the contents of the remaining tables are not considered
+in any way.  Tables other than table 0 only come into play when an
+NXAST_RESUBMIT_TABLE action specifies another table to search.
+
+Tables 128 and above are reserved for use by the switch itself.
+Controllers should use only tables 0 through 127.
+
  
  IPv6
  ====
  
  Open vSwitch supports stateless handling of IPv6 packets.  Flows can be
  written to support matching TCP, UDP, and ICMPv6 headers within an IPv6
-packet.
+packet.  Deeper matching of some Neighbor Discovery messages is also
+supported.
  
  IPv6 was not designed to interact well with middle-boxes.  This,
  combined with Open vSwitch's stateless nature, have affected the
@@ -70,6 +229,169 @@ nodes that do not connect to link with such large MTUs.  Currently, Open
  vSwitch doesn't process jumbograms.
  
  
+In-Band Control
+===============
+
+In-band control allows a single network to be used for OpenFlow traffic and
+other data traffic.  See ovs-vswitchd.conf.db(5) for a description of
+configuring in-band control.
+
+This comment is an attempt to describe how in-band control works at a
+wire- and implementation-level.  Correctly implementing in-band
+control has proven difficult due to its many subtleties, and has thus
+gone through many iterations.  Please read through and understand the
+reasoning behind the chosen rules before making modifications.
+
+In Open vSwitch, in-band control is implemented as "hidden" flows (in that
+they are not visible through OpenFlow) and at a higher priority than
+wildcarded flows can be set up by through OpenFlow.  This is done so that
+the OpenFlow controller cannot interfere with them and possibly break
+connectivity with its switches.  It is possible to see all flows, including
+in-band ones, with the ovs-appctl "bridge/dump-flows" command.
+
+The Open vSwitch implementation of in-band control can hide traffic to
+arbitrary "remotes", where each remote is one TCP port on one IP address.
+Currently the remotes are automatically configured as the in-band OpenFlow
+controllers plus the OVSDB managers, if any.  (The latter is a requirement
+because OVSDB managers are responsible for configuring OpenFlow controllers,
+so if the manager cannot be reached then OpenFlow cannot be reconfigured.)
+
+The following rules (with the OFPP_NORMAL action) are set up on any bridge
+that has any remotes:
+
+   (a) DHCP requests sent from the local port.
+   (b) ARP replies to the local port's MAC address.
+   (c) ARP requests from the local port's MAC address.
+
+In-band also sets up the following rules for each unique next-hop MAC
+address for the remotes' IPs (the "next hop" is either the remote
+itself, if it is on a local subnet, or the gateway to reach the remote):
+
+   (d) ARP replies to the next hop's MAC address.
+   (e) ARP requests from the next hop's MAC address.
+
+In-band also sets up the following rules for each unique remote IP address:
+
+   (f) ARP replies containing the remote's IP address as a target.
+   (g) ARP requests containing the remote's IP address as a source.
+
+In-band also sets up the following rules for each unique remote (IP,port)
+pair:
+
+   (h) TCP traffic to the remote's IP and port.
+   (i) TCP traffic from the remote's IP and port.
+
+The goal of these rules is to be as narrow as possible to allow a
+switch to join a network and be able to communicate with the
+remotes.  As mentioned earlier, these rules have higher priority
+than the controller's rules, so if they are too broad, they may
+prevent the controller from implementing its policy.  As such,
+in-band actively monitors some aspects of flow and packet processing
+so that the rules can be made more precise.
+
+In-band control monitors attempts to add flows into the datapath that
+could interfere with its duties.  The datapath only allows exact
+match entries, so in-band control is able to be very precise about
+the flows it prevents.  Flows that miss in the datapath are sent to
+userspace to be processed, so preventing these flows from being
+cached in the "fast path" does not affect correctness.  The only type
+of flow that is currently prevented is one that would prevent DHCP
+replies from being seen by the local port.  For example, a rule that
+forwarded all DHCP traffic to the controller would not be allowed,
+but one that forwarded to all ports (including the local port) would.
+
+As mentioned earlier, packets that miss in the datapath are sent to
+the userspace for processing.  The userspace has its own flow table,
+the "classifier", so in-band checks whether any special processing
+is needed before the classifier is consulted.  If a packet is a DHCP
+response to a request from the local port, the packet is forwarded to
+the local port, regardless of the flow table.  Note that this requires
+L7 processing of DHCP replies to determine whether the 'chaddr' field
+matches the MAC address of the local port.
+
+It is interesting to note that for an L3-based in-band control
+mechanism, the majority of rules are devoted to ARP traffic.  At first
+glance, some of these rules appear redundant.  However, each serves an
+important role.  First, in order to determine the MAC address of the
+remote side (controller or gateway) for other ARP rules, we must allow
+ARP traffic for our local port with rules (b) and (c).  If we are
+between a switch and its connection to the remote, we have to
+allow the other switch's ARP traffic to through.  This is done with
+rules (d) and (e), since we do not know the addresses of the other
+switches a priori, but do know the remote's or gateway's.  Finally,
+if the remote is running in a local guest VM that is not reached
+through the local port, the switch that is connected to the VM must
+allow ARP traffic based on the remote's IP address, since it will
+not know the MAC address of the local port that is sending the traffic
+or the MAC address of the remote in the guest VM.
+
+With a few notable exceptions below, in-band should work in most
+network setups.  The following are considered "supported' in the
+current implementation:
+
+   - Locally Connected.  The switch and remote are on the same
+     subnet.  This uses rules (a), (b), (c), (h), and (i).
+
+   - Reached through Gateway.  The switch and remote are on
+     different subnets and must go through a gateway.  This uses
+     rules (a), (b), (c), (h), and (i).
+
+   - Between Switch and Remote.  This switch is between another
+     switch and the remote, and we want to allow the other
+     switch's traffic through.  This uses rules (d), (e), (h), and
+     (i).  It uses (b) and (c) indirectly in order to know the MAC
+     address for rules (d) and (e).  Note that DHCP for the other
+     switch will not work unless an OpenFlow controller explicitly lets this
+     switch pass the traffic.
+
+   - Between Switch and Gateway.  This switch is between another
+     switch and the gateway, and we want to allow the other switch's
+     traffic through.  This uses the same rules and logic as the
+     "Between Switch and Remote" configuration described earlier.
+
+   - Remote on Local VM.  The remote is a guest VM on the
+     system running in-band control.  This uses rules (a), (b), (c),
+     (h), and (i).
+
+   - Remote on Local VM with Different Networks.  The remote
+     is a guest VM on the system running in-band control, but the
+     local port is not used to connect to the remote.  For
+     example, an IP address is configured on eth0 of the switch.  The
+     remote's VM is connected through eth1 of the switch, but an
+     IP address has not been configured for that port on the switch.
+     As such, the switch will use eth0 to connect to the remote,
+     and eth1's rules about the local port will not work.  In the
+     example, the switch attached to eth0 would use rules (a), (b),
+     (c), (h), and (i) on eth0.  The switch attached to eth1 would use
+     rules (f), (g), (h), and (i).
+
+The following are explicitly *not* supported by in-band control:
+
+   - Specify Remote by Name.  Currently, the remote must be
+     identified by IP address.  A naive approach would be to permit
+     all DNS traffic.  Unfortunately, this would prevent the
+     controller from defining any policy over DNS.  Since switches
+     that are located behind us need to connect to the remote,
+     in-band cannot simply add a rule that allows DNS traffic from
+     the local port.  The "correct" way to support this is to parse
+     DNS requests to allow all traffic related to a request for the
+     remote's name through.  Due to the potential security
+     problems and amount of processing, we decided to hold off for
+     the time-being.
+
+   - Differing Remotes for Switches.  All switches must know
+     the L3 addresses for all the remotes that other switches
+     may use, since rules need to be set up to allow traffic related
+     to those remotes through.  See rules (f), (g), (h), and (i).
+
+   - Differing Routes for Switches.  In order for the switch to
+     allow other switches to connect to a remote through a
+     gateway, it allows the gateway's traffic through with rules (d)
+     and (e).  If the routes to the remote differ for the two
+     switches, we will not know the MAC address of the alternate
+     gateway.
+
+
  Suggestions
  ===========