E810 Eswitch Switchdev Mode TechConfigGuide - Rev1.0
E810 Eswitch Switchdev Mode TechConfigGuide - Rev1.0
August 2021
Revision 1.0
645272-001
E810 eSwitch switchdev Mode
Technology and Configuration Guide
Revision History
2 645272-001
E810 eSwitch switchdev Mode
Technology and Configuration Guide
Contents
645272-001 3
E810 eSwitch switchdev Mode
Technology and Configuration Guide
4 645272-001
E810 eSwitch switchdev Mode
Technology and Configuration Guide
2.0 Introduction
The Intel® Ethernet 800 Series is the next generation of Intel Ethernet Controllers and Network
Adapters. The 800 Series is designed with an enhanced programmable pipeline, allowing deeper and
more diverse protocol header processing. It has capabilities like intelligent offloads to enable high
performance, I/O virtualization for max performance in a virtualized server, and Intel® Ethernet
Adaptive Virtual Function (Intel® Ethernet AVF) to ease SR-IOV migration to future Intel Ethernet
products.
This document describes a new feature introduced in 800 Series to support a new switchdev mode for
the controller's embedded switch (eSwitch). This feature allows the adapters to support hardware assist
for virtual switch (OVS, Linux Bridge) filters.
This document describes:
• The theory of the eSwitch in both legacy and switchdev modes.
• The advantages of switchdev mode.
• Some limitations with switchdev mode.
• How to configure switchdev mode on 800 Series Network Adapters, including OVS filter
configuration.
Note: For configuration details, refer to Appendix A, “Sample Scripts”.
645272-001 5
E810 eSwitch switchdev Mode
Technology and Configuration Guide
4.1 Overview
Modern network controllers have a complex embedded switch, referred to as an eSwitch or a Virtual
Ethernet Bridge (VEB). This eSwitch is configured and controlled by the Physical Function driver (PF
driver or LAN driver) per-port on the Ethernet controller.
Starting with the Linux ice PF driver 1.5.8 and NVM 2.50 on 800 Series Ethernet Network Controllers,
the eSwitch can be configured per-PF into one of two states via a devlink interface:
• eSwitch Legacy mode (default)
• eSwitch switchdev mode
In the default state (referred to as Legacy mode), the PF driver has limited control of the VFs attached
to it. SR-IOV-based VMs, or Containers in the default eSwitch legacy mode, bypass the hypervisor and
the software virtual switch (OVS or Linux Bridge) completely. By default, only MAC and VLAN filters are
added by the PF driver to the VFs at the hardware level. All other configuration is added through
software (either through the Linux Bridge or OVS) and does not take advantage of hardware switching
capabilities by the Ethernet controller.
In switchdev mode, the PF driver supports a standard Linux kernel abstraction layer (switchdev) to
expose the control plane of the Ethernet controller's eSwitch to the software vSwitch (OVS/Linux
Bridge). Each port attached to the eSwitch is represented as a Port Representor (PR) netdev that
provides a hook to configure and control VM guest or container VF network interfaces. This enables
limited vSwitch functionality offload to the Ethernet controller to allow for OVS/Linux Bridge hardware
assist support.
Section 4.1.2 explains more about the eSwitch in switchdev mode on an 800 Series Ethernet Network
Controller: how to switch between eSwitch modes, how to outline the routing behavior, and how to
determine which filters can be offloaded (limited OVS acceleration support).
6 645272-001
E810 eSwitch switchdev Mode
Technology and Configuration Guide
645272-001 7
E810 eSwitch switchdev Mode
Technology and Configuration Guide
To offload a data forwarding path (software vSwitch flows) from the kernel to Ethernet controller
hardware (eSwitch), the bridge Forwarding Database (FDB) entries are mirrored down to the Ethernet
controller. By default, the hardware FDB has entries for {MAC, VLAN, PORT} tuples.
With switchdev mode, the 800 Series Ethernet Network Controller PF driver also supports switch rule
programming related to (L3/L4 tuple), hence FDB with L3/L4 tuple can be offloaded. Thus, L2, L3, and
L4 tuple (L2/L3/L4 Header fields) can be used for hardware switching.
For example, when a route is added in OVS, a function calls the Intel switchdev driver, which then
determines whether this route needs to be offloaded to the hardware. Routes that do not involve the
eSwitch are typically not offloaded.
Following is a summary on Port Representor capabilities:
• Netdevs for all switching ports created on an eSwitch in switchdev mode:
— Uplink ports (PF_PR)
— Virtual ports (VF_PR)
• PR reflects the originating item's link state.
• PR should not be used to control Receive Side Scaling (RSS) input set or Intel® Ethernet Flow
Director (Intel® Ethernet FD) settings for the VF.
• Supports default/exception paths.
• When packets are sent from the PR netdev, they are sourced routed directly to the port.
• Scope is limited to single Uplink port in one switching domain/switchdev instance.
Figure 2 shows the configuration of a single port in switchdev mode.
8 645272-001
E810 eSwitch switchdev Mode
Technology and Configuration Guide
VF Port Representor:
Admin has the choice to rename the VF Port Representors like VF_PR, as a VF Port Representor for
each VF. Each VF must have a unique Port Representor. This is recommended to use corresponding
VF_PRs for any VF-related configuration.
For example, if admin wants to limit a VF's interrupt rate for Rx and Tx (“Bounding interrupt rates
using rx-usecs-high”), it applies the command using the VF_PR, as shown here:
ethtool -C $<VF_PR> rx-usecs-high $<interrupt rate cap>
645272-001 9
E810 eSwitch switchdev Mode
Technology and Configuration Guide
10 645272-001
E810 eSwitch switchdev Mode
Technology and Configuration Guide
5.1.1 Overview
Consider the following important notes:
• Intel® Ethernet 800 Series Network Adapters support configuration of switchdev mode
independently per physical port. This means that some ports can be in eSwitch Legacy mode, while
others can be configured in eSwitch switchdev mode.
• The eSwitch mode (switchdev or legacy) is changeable without a reboot. Mode change commands
must go into effect without reboot or driver reload.
• While in switchdev mode, the following configurations are not supported:
— ADQ on the PF or VF
— Link Aggregation on the PF
— L2 forwarding on the PF
— Trusted VFs (and “ip link commands” to configure a VF as trusted VF)
• RDMA and SR-IOV are both supported in switchdev mode, but RDMA must be enabled with a
specific sequence:
1. Change eSwitch mode to switchdev.
2. Create SR-IOV VFs.
3. Enable RDMA (load iRDMA driver).
Similarly, when removing the interface from the bridge or MAC/VLAN when RDMA is active, you
must follow this exact sequence of steps:
1. Remove RDMA if it is active (rmmod iRDMA driver).
2. Destroy SR-IOV VFs if they exist.
3. Remove the interface from the bridge or MAC/VLAN.
4. Reactivate RDMA and recreate SR-IOV VFs as needed.
645272-001 11
E810 eSwitch switchdev Mode
Technology and Configuration Guide
$PF1 The Physical Interface (LAN port). eSwitch mode of this port is set as switchdev. This is also
referred to as Uplink Port.
$VF1
Two SR-IOV VFs associated with $PF1.
$VF2
$VF1_PCI
PCI addresses for $VF1 and $VF2 associated with $PF1.
$VF2_PCI
$VF1_MAC
MAC Addresses assigned to $VF1 and $VF2.
$VF2_MAC
$VF1_IP
IP Addresses assigned to $VF1 and $VF2.
$VF2_IP
$VF1_PR
Port Representors for $VF1 and $VF2.
$VF2_PR
$MASK Subnet Mask for IP Address for $PF1 and $VF1-PR, $VF2-PR.
Note: When the PF driver is already in switchdev mode, for each VF that is attached to the PF
there is a corresponding VF_PR netdev. When the VF is removed, the corresponding
VF_PR netdev is automatically removed aw well.
3. Create the software vSwitch (Linux Bridge or OVS) and add the PF interface as an uplink to the
vSwitch.
Note: This must be done before the PF is set to switchdev mode.
Bridge example:
ip link add $<BR> type bridge
ip link set $<PF1> master $<BR>
OVS example:
ovs-vsctl add-br $<BR>
ovs-vsctl add-port $<BR> $<PF1>
ovs-vsctl show
12 645272-001
E810 eSwitch switchdev Mode
Technology and Configuration Guide
4. Use the Linux devlink API to change the eSwitch mode of the PF PCI device to switchdev or legacy
mode.
Here, PF1_PCI is pci/0000:xx:xx.x which is the address of PCI device $PF1. It can be found by:
lspci -D | grep $<PF1>
or
ethtool -i $<PF1>
Note: By default, the 800 Series driver always starts with all interfaces in Legacy mode. eSwitch
switchdev settings and related filters persist with PF interface resets, but not driver reloads or
system reboots. To persist switchdev mode settings between reboots, create a script to apply
the changes at boot time.
645272-001 13
E810 eSwitch switchdev Mode
Technology and Configuration Guide
• Default OVS max-idle (aging time) is 10 seconds. This means that after 10 seconds, all the
hardware offloaded flows are deleted followed by slow path learning and again offloading to
hardware. As a workaround, it is recommended to configure the max-idle value to a larger value
(such as 10 hours) as follows:
ovs-vsctl set Open_vSwitch . other_config:max-idle=36000000
• After configuring SR-IOV VFs in switchdev mode, to confirm that all interfaces (PF and VFs) are
connected to same switch instance, read the unique phys_switch_id as follows:
cat /sys/class/net/$<PF1>/phys_switch_id
or
cat /sys/class/net/$<VF_PR>/phys_switch_id
The phys_switch_id entry must show same value for all interfaces belonging to the same
switchdev instance.
• When eSwitch mode is set as switchdev then host admin can read interface port name within the
Ethernet Controller for associated interfaces by using PF or VF_PR as below -
cat /sys/class/net/$<PF1>/phys_port_name
or
cat /sys/class/net/$<VF_PR>/phys_port_name
14 645272-001
E810 eSwitch switchdev Mode
Technology and Configuration Guide
• The default behavior is to try to add the rule to both hardware and software. No error is
returned if the flow cannot be added to hardware, but an error is reported if the flow cannot be
added to software. skip_hw does not use the fast path, so performance is limited.
The following example uses the skip_sw parameter to tc to add the rule to hardware but not
software.
# tc qdisc add dev eth0 ingress
# tc filter add dev eth0 ingress \
protocol ip \
flower skip_sw \
ip_proto tcp dst_port 1234 \
action drop
In O/P, if the flag shows as offloaded:yes, dp:tc, it means flows are on hardware.
dst_port
FLOW_DISSECTOR_KEY_PORTS L4 source or destination port For TCP/UDP port.
src_port
645272-001 15
E810 eSwitch switchdev Mode
Technology and Configuration Guide
#!/bin/bash
set -x
#set -e
DEVLINK=devlink
TC=tc
$BR=br0
PF1=ens4f0 # (PF whose eSwitch will be configured in switchdev mode. Change accordingly.)
PF1_PCI="pci/0000:af:00.0"
PF1_IP=192.168.66.16
VF1=ens4f0v0
VF2=ens4f0v1
VF1_PCI=0000:af:01.0
VF2_PCI=0000:af:01.1
VF1_MAC=52:54:00:00:16:01
VF2_MAC=52:54:00:00:16:02
VF1_IP=192.168.66.161
VF2_IP=192.168.66.162
VF1_PR=eth0
VF2_PR=eth1
PEER_IP=192.168.66.10
MASK=24
PEER_MAC=68:05:ca:a3:7b:10
rmmod ice
modprobe ice
sleep 2
16 645272-001
E810 eSwitch switchdev Mode
Technology and Configuration Guide
#7. Add VF Port Representors to the bridge and bring all of them up
ip link set $VF1_PR master $BR
ip link set $VF2_PR master $BR
ip link set $VF1_PR up
ip link set $VF2_PR up
ip link set $PF1 up
ip link set $BR up
# Verify settings:
ethtool -k $PF1 | grep "hw-tc"
ethtool -k $VF1_PR | grep "hw-tc"
ethtool -k $VF2_PR | grep "hw-tc"
645272-001 17
E810 eSwitch switchdev Mode
Technology and Configuration Guide
sleep 2
===============================================================================================
#!/bin/bash
set -x
#set -e
DEVLINK=devlink
TC=tc
$BR=br1
PF1=ens4f0
PF1_PCI="pci/0000:af:00.0"
PF1_IP=192.168.66.16
VF1=ens4f0v0
VF2=ens4f0v1
VF1_PCI=0000:af:01.0
VF2_PCI=0000:af:01.1
VF1_MAC=52:54:00:00:16:01
VF2_MAC=52:54:00:00:16:02
VF1_IP=192.168.66.161
VF2_IP=192.168.66.162
VF1_PR=eth0
VF2_PR=eth1
PEER_IP=192.168.66.10
MASK=24
PEER_MAC=68:05:ca:a3:7b:10
#2. Install the Open vSwitch package & start the service
#2.1. Install the OVS package
zypper install openvswitch
18 645272-001
E810 eSwitch switchdev Mode
Technology and Configuration Guide
645272-001 19
E810 eSwitch switchdev Mode
Technology and Configuration Guide
#15. Check connections and Watch rules being added via tc tool
#ping 2nd VF from 1st VF
ip netns exec ns1 ping -c3 $VF2_IP
# In output, check for offload and datapath field as offloaded: yes, dp:tc, it means flows are
offloaded to HW.
===============================================================================================
20 645272-001
E810 eSwitch switchdev Mode
Technology and Configuration Guide
eSwitch Ethernet controller’s embedded switch. Sometimes referred to as Virtual Ethernet Bridge (VEB).
PF Physical Function
PR Port Representor
VF Virtual Function
645272-001 21
LEGAL
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
This document (and any related software) is Intel copyrighted material, and your use is governed by the express license under which
it is provided to you. Unless the license provides otherwise, you may not use, modify, copy, publish, distribute, disclose or transmit
this document (and related materials) without Intel's prior written permission. This document (and related materials) is provided as
is, with no express or implied warranties, other than those that are expressly stated in the license.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a
particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in
trade.
This document contains information on products, services and/or processes in development. All information provided here is subject
to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors which may cause deviations from published specifications.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725
or by visiting www.intel.com/design/literature.htm.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
Other names and brands may be claimed as the property of others.
© 2021 Intel Corporation.
22 645272-001