JND DC 15.a R SG 1of2 PDF
JND DC 15.a R SG 1of2 PDF
Data Center
15.a
Student Guide
Volume 1 of 2
Juniper Networks reserves the right to change, modify, transfer, or otherwise revise this publication without notice.
YEAR 2000 NOTICE
Juniper Networks hardware and software products do not suffer from Year 2000 problems and hence are Year 2000 compliant. The Junos operating system has no known
time-related limitations through the year 2038. However, the NTP application is known to have some difficulty in the year 2036.
SOFTWARE LICENSE
The terms and conditions for using Juniper Networks software are described in the software license provided with the software, or to the extent applicable, in an agreement
executed between you and Juniper Networks, or Juniper Networks agent. By using Juniper Networks software, you indicate that you understand and agree to be bound by its
license terms and conditions. Generally speaking, the software license restricts the manner in which you are permitted to use the Juniper Networks software, may contain
prohibitions against certain uses, and may state conditions under which the license is automatically terminated. You should consult the software license for further details.
Contents
This five-day course is designed to cover best practices, theory, and design principles for data center design including
data center architectures, data center interconnects, security considerations, virtualization, and data center operations.
Objectives
After successfully completing this course, you should be able to:
• State high-level concepts about the different data center architectures.
• Identify features used to interconnect data centers.
• Identify key high-level considerations about securing and monitoring a data center deployment.
• Outline key high-level concepts when implementing different data center approaches.
• Recommend data center cooling designs and considerations.
• Explain device placement and cabling requirements.
• Outline different data center use cases with basic architectures.
• Describe a traditional multitier data center architecture.
• Explain link aggregation and redundant trunk groups.
• Explain multichassis link aggregation.
• Summarize and discuss key concepts and components of a Virtual Chassis.
• Summarize and discuss key concepts and components of a VCF.
• Summarize and discuss key concepts and components of a QFabric System.
• Summarize and discuss key concepts and components of Junos Fusion.
• List the reasons for the shift to IP fabrics.
• Summarize how to scale an IP fabric.
• State the design considerations of a VXLAN overlay.
• Define the term Data Center Interconnect.
• List differences between the different Layer 2 and Layer 3 DCIs.
• Summarize and discuss the benefits and use cases for EVPN.
• Discuss the security requirements and design principles of the data center.
• Identify the security elements of the data center.
• Explain how to simplify security in the data center.
• Discuss the security enforcement layers in the data center.
• Summarize and discuss the purpose of SDN.
• Explain the function of Contrail.
• Summarize and discuss the purpose of NFV.
• Discuss the purpose and function of vSRX and vMX.
• Discuss the importance of understanding the baseline behaviors in your data center.
• List the characteristics of the Junos Space Network Management Platform and describe its deployment
options.
• Describe the importance of analytics.
• Discuss automation in the data center.
• Discuss the benefits of QoS and CoS.
• State the benefits of a converged network.
Course Level
JND-DC is an intermediate-level course.
Prerequisites
The prerequisites for this course are as follows:
• Knowledge of routing and switching architectures and protocols.
• Knowledge of Juniper Networks products and solutions.
• Understanding of infrastructure security principles.
• Basic knowledge of hypervisors and load balancers.
• Completion of the Juniper Networks Design Fundamentals (JNDF) course.
Day 1
Chapter 1: Course Introduction
Chapter 2: Overview of Data Center Design
Chapter 3: Initial Design Considerations
Chapter 4: Traditional Data Center Architecture
Lab: Designing a Multitier Architecture
Day 2
Chapter 5: Ethernet Fabric Architectures
Lab: Ethernet Fabric Architecture
Day 3
Chapter 6: IP Fabric Architecture
Lab: IP Fabric Architecture
Chapter 7: Data Center Interconnect
Lab: Interconnecting Data Centers
Day 4
Chapter 8: Securing the Data Center
Lab: Securing the Data Center
Chapter 9: SDN and Virtualization in the Data Center
Lab: SDN and Virtualization
Chapter 10: Data Center Operation
Lab: Data Center Operations
Day 5
Chapter 11: Traffic Prioritization for Converged Networks
Lab: Prioritizing Data in the Data Center
Chapter 12: Migration Strategies
Lab: Data Center Migration
Chapter 13: High Availability
Franklin Gothic Normal text. Most of what you read in the Lab Guide and
Student Guide.
CLI Input Text that you must enter. lab@San_Jose> show route
GUI Input Select File > Save, and type config.ini
in the Filename field.
CLI Undefined Text where the variable’s value is the Type set policy policy-name.
user’s discretion or text where the
ping 10.0.x.y
variable’s value as shown in the lab
GUI Undefined guide might differ from the value the Select File > Save, and type filename in
user must input according to the lab the Filename field.
topology.
We Will Discuss:
• Objectives and course content information;
• Additional Juniper Networks, Inc. courses; and
• The Juniper Networks Certification Program.
Introductions
The slide asks several questions for you to answer during class introductions.
Prerequisites
The slide lists the prerequisites for this course.
Additional Resources
The slide provides links to additional resources available to assist you in the installation, configuration, and operation of
Juniper Networks products.
Satisfaction Feedback
Juniper Networks uses an electronic survey system to collect and analyze your comments and feedback. Depending on the
class you are taking, please complete the survey at the end of the class, or be sure to look for an e-mail about two weeks
from class completion that directs you to complete an online survey form. (Be sure to provide us with your current e-mail
address.)
Submitting your feedback entitles you to a certificate of class completion. We thank you in advance for taking the time to
help us improve our educational offerings.
Courses
Juniper Networks courses are available in the following formats:
• Classroom-based instructor-led technical courses
• Online instructor-led technical courses
• Hardware installation eLearning courses as well as technical eLearning courses
• Learning bytes: Short, topic-specific, video-based lessons covering Juniper products and technologies
Find the latest Education Services offerings covering a wide range of platforms at
http://www.juniper.net/training/technical_education/.
Junos Genius
The Junos Genius application takes certification exam preparation to a new level. With Junos Genius you can practice for
your exam with flashcards, simulate a live exam in a timed challenge, and even build a virtual network with device
achievements earned by challenging Juniper instructors. Download the app now and Unlock your Genius today!
Find Us Online
The slide lists some online resources to learn and share information about Juniper Networks.
Any Questions?
If you have any questions or concerns about the class you are attending, we suggest that you voice them now so that your
instructor can best address your needs during class.
We Will Discuss:
• High level concepts about the different data center architectures;
• Features used to interconnect data centers;
• Key high level considerations about securing and monitoring a data center deployment; and
• Key high level concepts for implementing and enhancing performance in a data center.
Initial Considerations
The slide lists the topics we will discuss. We discuss the highlighted topic first.
QFabric
The QFabric System is composed of multiple components working together as a single switch to provide high-performance,
any-to-any connectivity and management simplicity in the data center. The QFabric System flattens the entire data center
network to a single tier where all access points are equal, eliminating the effects of network locality and making it the ideal
network foundation for cloud-ready, virtualized data centers. QFabric is a highly scalable system that improves application
performance with low latency and converged services in a non-blocking, lossless architecture that supports Layer 2, Layer 3,
and Fibre Channel over Ethernet (FCoE) capabilities. The reason you can consider the QFabric system as a single system is
that the Director software running on the Director group allows the main QFabric system administrator to access and
configure every device and port in the QFabric system from a single location. Although you configure the system as a single
entity, the fabric contains four major hardware components. The hardware components can be chassis-based, group-based,
or a hybrid of the two.
Junos Fusion
Junos Fusion is a Juniper Networks Ethernet fabric architecture designed to provide a bridge from legacy networks to
software-defined cloud networks. With Junos Fusion, service providers and enterprises can reduce network complexity and
operational costs by collapsing underlying network elements into a single, logical point of management. The Junos Fusion
architecture consists of two major components: aggregation devices and satellite devices. With this structure it can also be
classified as a spine and leaf architecture. These components work together as a single switching system, flattening the
network to a single tier without compromising resiliency. Data center operators can build individual Junos Fusion pods
comprised of a pair of aggregation devices and a set of satellite devices. Each pod is a collection of aggregation and satellite
devices that are managed as a single device. Pods can be small—for example, a pair of aggregation devices and a handful of
satellites—or large with up to 64 satellite devices based on the needs of the data center operator.
IP Fabric
An IP Fabric is one of the most flexible and scalable data center solutions available. Because an IP Fabric operates strictly
using Layer 3, there are no proprietary features or protocols being used so this solution works very well with data centers
that must accommodate multiple vendors. One of the most complicated tasks in building an IP Fabric is assigning all of the
details like IP addresses, BGP AS numbers, routing policy, loopback address assignments, and many other implementation
details.
Layer 2 Options
Three classifications exist for Layer 2 DCIs:
1. No MAC learning by the Provider Edge (PE) device: This type of layer 2 DCI does not require that the PE devices
learn MAC addresses.
2. Data plane MAC learning by the PE device: This type of DCI requires that the PE device learns the MAC
addresses of both the local data center as well as the remote data centers.
3. Control plane MAC learning - This type of DCI requires that a local PE learn the local MAC addresses using the
control plane and then distribute these learned MAC addressed to the remote PEs.
Layer 3 Options
A Layer 3 DCI uses routing to interconnect data centers. Each data center must maintain a unique IP address space. A
Layer 3 DCI can be established using just about any IP capable link. Another important consideration for DCIs is
incorporating some level of redundancy by using link aggregation groups (LAGs), IGPs using equal cost multipath, and BGP or
MP-BGP using the mutipath or multihop features.
Traffic Patterns
Understanding how traffic flows through and within your data center is key to knowing how and where your security devices
need to be placed. For instance, if you know that there will be a large amount of VM to VM or even server to server traffic
(also referred to as east-west traffic) that must be secured, you can consider adding a virtual firewall (vSRX) to inspect this
traffic and apply security policies as needed. If you are strictly concerned with north-south traffic then using a physical
SRX Series device in the network, using one of the previously outlined methods, is ideal.
Another key aspect of your security design is creating logical separation between areas in the data center by defining and
implementing security zones. Then policies can be used to control traffic that is allowed into one zone but not into another.
Basically, you take a trusted versus untrusted approach to traffic flowing through your network. Another method of
segmentation is using virtual routing instances to create logical separation in your network. This approach allows you to
continue to apply security policies while completely separating traffic using the firewall instead of other devices in the
network.
Junos Space
Data center operation can be greatly simplified by using Junos Space. Junos Space is a comprehensive network
management solution that simplifies and automates management of Juniper Networks switching, routing, and security
devices. With all of its components working together, Junos Space offers a unified network management and orchestration
solution to help you more efficiently manage the data center.
Automation
Junos automation is part of the standard Junos OS available on all switches, routers, and security devices running Junos OS.
Junos automation can be used to automate operational and configuration tasks on a network’s Junos devices. The slide
highlights both on-box and off-box automation capabilities. including support for multiple scripting languages.
Implementation Considerations
The slide highlights the topic we discuss next.
Implementation Plan
Implementation planning is more that just bringing the new data center into production. It requires a lot planning and should
be very comprehensive. As part of your implementation planning process you should also consider the traffic prioritization,
FCoE requirements, and high availability needs of the customer.
Traffic Prioritization
Each application in the data center can have different requirements. Certain applications do not allow for the loss of packets
or delay in packet delivery. This makes it important to classify this traffic with a higher priority that other applications that
have mechanisms to accommodate some level of loss or delay. If all traffic was treated the same, you would experience
unnecessary problems with these higher priority applications. By classifying and prioritizing the different applications in your
data center you ensure that traffic is handled in the most efficient manner possible. Most importantly, the same traffic
prioritization rules must be applied throughout the data center to ensure proper handling.
High Availability
High availability should be a major consideration within the data center.It is important to maintain reachability to all services
in a data center regardless of individual failure. Failures will happen in a data center, but you must ensure this does not
affect the user experience. The slide illustrated some of the high availability features that should be considered while
designing a data center.
We Discussed:
• High level concepts about the different data center architectures;
• Features used to interconnect data centers;
• Key high level considerations about securing and monitoring a data center deployment; and
• Key high level concepts for implementing and enhancing performance in a data center.
Review Questions
1.
2.
We Will Discuss:
• Data center cooling design and considerations;
• Device placement and cabling requirements; and
• Different data center use cases including architectural choices.
Physical Layout
One of the first steps in data center design is planning the physical layout of the data center. Multiple physical divisions exist
within the data center that are usually referred to as segments, zones, cells, or pods. Each segment consists of multiple rows
of racks containing equipment that provides computing resources, data storage, networking, and other services.
Physical considerations for the data center include placement of equipment, cabling requirements and restrictions, and
power and cooling requirements. Once you determine the appropriate physical layout, you can replicate the design across all
segments within the data center or in multiple data centers. Using a modular design approach improves the scalability of the
deployment while reducing complexity and easing data center operations.
The physical layout of networking devices in the data center must balance the need for efficiency in equipment deployment
with restrictions associated with cable lengths and other physical considerations. Pros and cons must be considered
between deployments in which network devices are consolidated in a single rack versus deployments in which devices are
distributed across multiple racks. Adopting an efficient solution at the rack and row levels ensures efficiency of the overall
design because racks and rows are replicated throughout the data center.
This section discusses the following data center layout options:
• Top-of-rack (ToR) or bottom-of-rack (BoR);
• Middle-of-row (MoR); and
• End-of-row (EoR).
End of Row
If the physical cable layout does not support ToR or BoR deployment, or if the customer prefers a large chassis-based
solution, the other options would be an EOR or MOR deployment, where network switches are deployed in a dedicated rack
in the row.
In the EoR configuration, which is common in existing data centers with existing cabling, high-density switches are placed at
the end of a row of servers, providing a consolidated location for the networking equipment to support all of the servers in
the row. EoR configurations can support larger form factor devices than ToR and BoR rack configurations, so you end up with
a single access tier switch to manage an entire row of servers. EoR layouts also require fewer uplinks and simplify the
network topology—inter-rack traffic is switched locally. Because EoR deployments require cabling over longer distances than
ToR and BoR configurations, they are best for deployments that involve 1 Gigabit Ethernet connections and relatively few
servers.
Disadvantages of the EoR layout include longer cable runs which can exceed the length limits for 10 Gigabit Ethernet and
40 Gigabit Ethernet connections, so careful planning is required to accommodate high-speed network connectivity. Device
port utilization is not always optimal with traditional chassis-based devices, and most chassis-based devices consume a
great deal of power and cooling, even when not fully configured or utilized. In addition, these large chassis-based devices
can take up a great deal of valuable data center rack space.
Middle of Row
A MoR deployment is similar to an EOR deployment, except that the devices are deployed in the middle of the row instead of
at the end. The MoR configuration provides some advantages over an EoR deployment, such as the ability to reduce cable
lengths to support 10 Gigabit Ethernet and 40 Gigabit Ethernet server connections. High-density, large form-factor devices
are supported, fewer uplinks are required in comparison with ToR and BoR deployments, and a simplified network topology
can be adopted.
You can configure an MoR layout so that devices with cabling limitations are installed in the racks that are closest to the
network device rack. While the MoR layout is not as flexible as the a ToR or BoR deployment, the MoR layout supports
greater scalability and agility than the EoR deployment.
Although minimizing the cable length disadvantage associated with EoR deployments, the MoR deployment still has the
same port utilization, power, cooling, and rack space concerns associated with an EoR deployment.
Environmental Conditions
The slide highlights the topic we discuss next.
Power Considerations
As old data center facilities are upgraded and new data centers are built, an important consideration is to ensure that the
data center network infrastructure is designed for maximum energy and space efficiency as well as having a minimal
environmental impact. Power, space, and cooling requirements of all network components must be accounted for and
compared with different architectures and systems so that the environmental and cost impacts across the entire data center
as a whole can be taken into consideration—even down to the lighting. Many times, the most efficient approach is to
implement high-end, highly scalable systems that can replace a large number of smaller components, thereby delivering
energy and space efficiency. Green initiatives that track resource usage, carbon emissions, and efficient utilization of
resources such as power and cooling are to be considered when designing a data center.
Fewer devices require less power, which in turn reduces cooling requirements, thus adding up to substantial power savings.
Energy Efficiency
Constraints in the amount of power available to support the data center infrastructure, concurrent with the increase in
demand for applications that organizations are experiencing, make designing data centers with a minimum amount of power
consumed per unit of work performed imperative. Every piece of the data center infrastructure matters in energy
consumption.
Juniper Networks have started building energy efficiency techniques into its entire product design line. The Energy
Consumption Rating (ECR) Initiative, formed by multiple organizations in the networking industry to create a common metric
to compare energy efficiency, provides a compelling perspective. This group has defined the energy efficiency ratio (EER) as
a key metric. EER measures the number of gigabits per second of traffic that can be moved through a network device per
kilowatt of electricity consumed—the higher the number, the greater the efficiency. Different designs achieve different
ratings based on this metric.
Cabling Options
The slide highlights the topic we discuss next.
Enterprise IT
Enterprise customers represent a large potion of the Fortune 500 companies. This business model operates on project
lifecycles that requires a wide variety of technology to be delivered quickly. While they look to transition to private clouds,
many legacy workloads and applications exist that require a simple network infrastructure with familiar operations and a
single point of management.
We Discussed:
• Data center cooling design and considerations;
• Device placement and cabling requirements; and
• Different data center use cases including architectural choices.
Review Questions
1.
2.
3.
We Will Discuss:
• Traditional multitier data center architectures;
• Using link aggregation and redundant trunk groups; and
• Using multichassis link aggregation.
Multiple Tiers
Legacy data centers are often hierarchical and consist of multiple layers. The diagram on the slide illustrates the typical
layers, which include access, distribution (sometimes referred to as aggregation), and core. Each of these layers performs
unique responsibilities. We cover the functions of each layer on a subsequent slide in this section.
Hierarchical networks are designed in a modular fashion. This inherent modularity facilitates change and makes this design
option quite scalable. When working with a hierarchical network, the individual elements can be replicated as the network
grows. The cost and complexity of network changes is generally confined to a specific portion (or layer) of the network rather
than to the entire network.
Because functions are mapped to individual layers, faults relating to a specific function can be isolated to that function’s
corresponding layer. The ability to isolate faults to a specific layer can greatly simplify troubleshooting efforts.
Functions of Layers
When designing a hierarchical data center network, individual layers are defined and represent specific functions found
within a network. It is often mistakenly thought that the access, distribution (or aggregation), and core layers must exist in
clear and distinct physical devices, but this is not a requirement, nor does it make sense in some cases. The layers are
defined to aid successful network design and to represent functionality that exists in many networks.
The slide highlights the access, aggregation, and core layers and provides a brief description of the functions commonly
implemented in those layers. If CoS is used in a network, it should be incorporated consistently in all three layers.
Consolidating Tiers
Clearly, more devices within a network architecture means more points of failure. Furthermore, a larger number of devices
introduces a higher network latency, which many applications today simply cannot tolerate. Most legacy chassis switches
add latencies of the order of 20 to 50 microseconds.
Today, we have extremely high-density 10GbE Layer 2 and Layer 3 switches, in the data center core. Many of these switches
have 100 Gbps or more of capacity per slot and over 100 line-rate 10GbE ports in a chassis. This enhanced performance
capability in data center switches has resulted in a trend of simplification throughout the entire data center. With this design
simplification, made possible by higher performing and more capable switches, you can eliminate the distribution tier of a
legacy three-tier data center network altogether.
Using a simplified architecture reduces latency through a combination of collapsed switching tiers, Virtual Chassis for direct
path from server to server, and advanced application-specific integrated circuit (ASIC) technologies. By simplifying the data
center architecture, you reduce the overall number of devices, which means the physical interconnect architecture is
simplified. By simplifying the overall design and structure at the physical layer, your deployment, management, and future
troubleshooting efforts are made easier. We examine these points in greater detail on subsequent slides.
Resource Utilization
In the multitier topology displayed on the slide, you can see that almost half the links are not utilized. In this example you
would also need to be running some type of spanning tree protocol (STP) to avoid loops which would introduce a delay with
your network convergence as well as introduce significant STP control traffic taking up valuable bandwidth.
This topology is relatively simple but allows us to visualize the lack of resource utilization. Imagine a data center with a
hundred racks of servers with a hundred top of rack access switches. The access switches all aggregate up to the
core/distribution switches including redundant connections. In this much larger and complicated network you would have
1000s of physical cable connections that are not being utilized. Now imagine these connections are fiber, in addition to the
unused cables you would also have two transceivers per connection that are not being used. Because of the inefficient use
of physical components there is a significant amount of usable bandwidth that is sitting idle. Later in this chapter we will look
at different technologies that can be used to alleviate some of these challenges including redundant trunk groups (RTGs)
and multichassis link aggregation (MC-LAG).
Fully Converged
Once the role and state for all switch ports is determined, the tree is considered fully converged. The convergence delay can
take up to 50 seconds when the default forwarding delay (15 seconds) and max age timer (20 seconds) values are in effect.
The formula used to calculate the convergence delay for STP is 2 x the forwarding delay + the maximum age.
What Is LAG?
Link aggregation (LAG) is a combination of multiple single physical links into one logical link.
What Is LACP?
The protocol that handles link aggregation is Link Aggregation Control Protocol (LACP). LACP is a standards-based Institute of
Electrical and Electronics Engineers (IEEE) 802.3ad protocol.
What Is RTG?
Redundant trunk group (RTG) is a mechanism that allows a sub-second failover on access switches that are dual-homed to
the upper layer switches.
RTG Example
The slide illustrates a typical topology example in which the RTG feature might be used. In this example, Switch C is
functioning as an access tier switch and has two, multihomed trunks connecting to Switch A and Switch B, which are
operating as core-aggregation tier switches. Switch C has an RTG configured and has the trunk connecting to Switch A set as
active, whereas the trunk connecting to Switch B is set as nonactive (or secondary). Because the RTG is configured, Switch C
cannot run STP or RSTP on the network ports that are participating in the RTG. All other ports can, if needed, participate in
STP or RSTP.
RTG Considerations
The following are RTG design considerations:
• RTG and STP are mutually exclusive. The two uplinks—the primary active and the secondary standby uplinks
configured for RTG—do not run STP. However, we recommend that you still enable STP on the rest of the ports.
• From an architectural perspective, access tier switches typically connect to the upper layer—aggregation or core
switches. If, however, access switches interconnect to each other and RTG is used, then this design will cause a
loop because both access switches have RTG ports that are not running STP. Currently, Juniper Networks EX
Series and QFX Series platforms support RTG on the physical and aggregated Ethernet interfaces.
A Potential Problem
While operational continuity is a top priority, it is not guaranteed simply by adding multiple, bundled connections between
the compute resources (servers) and their attached access switch. This design, while improved over a design with a single
link, still includes potential single points of failure including the access switch and the compute device.
While the survivability of compute resources can be handled through the duplication of the impacted resources on some
other physical device in the network, typically done through virtualization technologies, the access switch, in this deployment
model, remains a single point of failure and prohibits the utilization of the attached resources.
A Solution
To eliminate the access switch as being a single point of failure in the data center environment, you can use multichassis
link aggregation. Multichassis link aggregation builds on the standard LAG concept defined in 802.3ad and allows a LAG
from one device, in our example a server, to be spread between two upstream devices, in our example two access switches
to which the server connects. Using multichassis link aggregation avoids the single point of failure scenario related to the
access switches described previously and allows operational continuity for traffic and services, even when one of the two
switches supporting the server fails.
MC-LAG Overview
An MC-LAG allows two similarly configured devices, known as MC-LAG peers, to emulate a logical LAG interface which
connects to a separate device at the remote end of the LAG. The remote LAG endpoint may be a server, as shown in the
example on the slide, or a switch or router depending on the deployment scenario. The two MC-LAG peers appear to the
remote endpoint connecting to the LAG as a single device.
As previously mentioned, MC-LAGs build on the standard LAG concept defined in 802.3ad and provide node-level
redundancy as well as multihoming support for mission critical deployments. Using MC-LAGs avoids the single point of failure
scenario related to the access switches described previously and allows for operational continuity for traffic and services,
even when one of the two MC-LAG peers supporting the server fails.
MC-LAGs make use of the Interchassis Control Protocol (ICCP), which is used to exchange control information between the
participating MC-LAG peers. We discuss ICCP further on the next slides.
MC-LAG Modes
There are two modes in which an MC-LAG can operate: Active/Standby and Active/Active. Each state type has its own set of
benefits and drawbacks.
Active/Standby mode allows only one MC-LAG peer to be active at a time. Using LACP, the active MC-LAG peer signals to the
attached device (the server in our illustrated example) that its links are available to forward traffic. As you might guess, a
drawback to this method is that only half of the links in the server’s LAG are used at any given time. However, this method is
usually easier to troubleshoot than Active/Active because traffic is not hashed across all links and no shared MAC learning
needs to take place between the MC-LAG peers.
Using the Active/Active mode, all links between the attached device (the server in our illustrated example) and the MC-LAG
peers are active and available for forwarding traffic. Because all links are active, traffic has the potential need to go between
the MC-LAG peers. The ICL-PL can be used to accommodate the traffic required to pass between the MC-LAG peers. We
demonstrate this on the next slide. Currently, the QFX5100 Series switches only support the Active/Active mode and this
mode is the preferred deployment mode for most data center deployments.
Layer 3 Routing
Layer 3 inter-VLAN routing can be provided through MC-LAG peers using integrated routing and bridging (IRB) and VRRP. This
allows compute devices to communicate with other devices on different Layer 3 subnets using gateway access through their
first-hop infrastructure device (their directly attached access switch), which can expedite the required communication
process.
We Discussed:
• Traditional multitier data center architectures;
• Using link aggregation and redundant trunk groups; and
• Using multichassis link aggregation.
Review Questions
1.
2.
3.
We Will Discuss:
• Key concepts and components of a Virtual Chassis;
• Key concepts and components of a Virtual Chassis Fabric (VCF);
• Key concepts and components of a QFabric system; and
• Key concepts and components of Junos Fusion.
Virtual Chassis
The slide highlights the topic we discuss next.
A mixed Virtual Chassis is a Virtual Chassis consisting of a mix of different switch types. Only certain mixtures of switches are
supported. It can consist either of EX4200, EX4500, or EX4550 Series switches or it can consist of EX4300, QFX3500,
QFX3600, or QFX5100 Series switches.
{master:0}
user@Switch-1>
Switch-1 (ttyu0)
login: user
Password:
{master:5}
user@Switch-1> request session member 1
This slide provides an opportunity to discuss and think about how interfaces are named within a Virtual Chassis.
Software Upgrades
You perform software upgrades within a Virtual Chassis system on the master switch. Using the request system
software add command, all member switches are automatically upgraded. Alternatively, you can add the member option
with the desired member ID, as shown on the slide, to upgrade a single member switch.
Software Features
In general you can expect Junos to support the LCD in regards to software features in a mixed Virtual Chassis. The slide
shows the uniform resource locator (URL) that will allow you to view behavior of many of the software features when they are
used in a mixed scenario.
What Is a VCF?
The Juniper Networks VCF provides a low-latency, high-performance fabric architecture that can be managed as a single
device. VCF is an evolution of the Virtual Chassis feature, which enables you to interconnect multiple devices into a single
logical device, inside of a fabric architecture. The VCF architecture is optimized to support small and medium-sized data
centers that contain a mix of 1-Gbps, 10-Gbps, and 40-Gbps Ethernet interfaces.
A VCF is constructed using a spine-and-leaf architecture. In the spine-and-leaf architecture, each spine device is
interconnected to each leaf device. A VCF supports up to twenty total devices, and up to four devices can be configured as
spine devices. QFX5100 Series switches can be placed in either the Spine or Leaf location while QFX3500, QFX3600, and
EX4300 Series switches should only be wired as Leaf devices in a mixed scenario.
You should notice that the goal of the design is to provide connectivity from one ingress crossbar switch to an egress
crossbar switch. Notice that there is no need for connectivity between crossbar switches that belong to the same stage.
VCF Benefits
The slide shows some of the benefits (similar to Virtual Chassis) of VCF when compared to managing 32 individual switches.
VCF Components
You can interconnect up to 20 QFX5100 Series switches to form a VCF. A VCF can consist of any combination of model
numbers within the QFX5100 family of switches. QFX3500, QFX3600, and EX4300 Series switches are also supported in the
line card role.
Each switch has a Packet Forwarding Engines (PFE). All PFEs are interconnected by Virtual Chassis ports (VCPs). Collectively,
the PFEs and their VCP connections constitute the VCF.
You can use the built-in 40GbE QSFP+ ports or SFP+ uplink ports, converted to VCPs, to interconnect the member switches’
PFEs. To use an uplink port as a VCP, explicit configuration is required.
Spine Nodes
To be able to support the maximum throughput, QFX5100 Series switches should be placed in the Spine positions. It is
further recommended to use the QFX5100-24q switch in the Spine position. Although, any QFX5100 Series switch will work
in the Spine position, it is the QFX5100-24q switch that supports 32 40GbE QSFP+ ports which allows for the maximum
expansion possibility (remember that 16 Leaf nodes would take up 16 QSFP+ ports on each Spine). Spines are typically
configured in the RE role (discussed later).
Leaf Nodes
Although not a requirement, it is recommended to use QFX5100 Series devices in the Leaf position. Using a non-QFX5100
Series switch (even just one different switch) requires that the entire VCF is placed into “mixed” mode. When a VCF is placed
into mixed mode, the hardware scaling numbers for the VCF as a whole (MAC table size, routing table size, and many more)
to be scaled down to the lowest common denominator between potential member switches. It is recommended that each
Leaf node has a VCP connection to every Spine node.
Member Roles
The slide shows the different Juniper switches that can participate in a VCF along with their recommended node type (Spine
or Leaf node) as well as their capability to become an RE or line card. It is always recommended to use QFX5100 Series
switches in the Spine position. All other supported switch types should be place in the Leaf position. In a VCF, only a
QFX5100 Series device can assume the RE role (even if you try to make another switch type an RE). Any supported switch
type can be assigned the linecard role.
Master RE
A VCF has two devices operating in the Routing Engine (RE) role—a master Routing Engine and a backup Routing Engine. All
Spine nodes should be configured for the RE role. However, based on the RE election process only two REs will be elected.
Any QFX5100 Series that is configured as an RE but is not elected to the master or backup RE role will take on the linecard
role.
A QFX5100 Series configured for the RE role but operating in the linecard role can complete all leaf or spine related
functions with no limitations within a VCF.
The device that functions as the master Routing Engine:
• Should be a spine device (a “must” for Juniper support).
• Manages the member devices.
• Runs the chassis management processes and control protocols.
• Represents all the member devices interconnected within the VCF configuration. (The hostname and other
parameters that you assign to this device during setup apply to all members of the VCF.)
Backup RE
The device that functions as the backup Routing Engine:
• Should be a spine device (a “must” for Juniper support).
• Maintains a state of readiness to take over the master role if the master fails.
• Synchronizes with the master in terms of protocol states, forwarding tables, and so forth, so that it preserves
routing information and maintains network connectivity without disruption when the master is unavailable.
Linecard Role
The slide describes the functions of the linecard in a VCF.
If more than two devices are configured into the RE role, not every one of those devices will actually take on the RE role.
Instead, two REs (master and backup) will be elected and any other device will be placed into the line card role. The slide
describes the behavior of a Spine node that has been configured for the RE role but has actually taken on the linecard role.
Smart Trunks
There are several types of trunks that you will find in a VCF.
1. Automatic Fabric Trunks - When there are two VCPs between members (2x40G between member 4 and 0) they
are automatically aggregated together to form a single logical connection using Link Aggregation Groups (LAGs).
2. Next-Hop Trunks (NH-Trunks) - These are directly attached VCP between the local member and any other
member. In the slide, NHT1, NHT2, NHT3, and NHT4 are the NH-trunks for member 4.
3. Remote Destination Trunks (RD-Trunks) - These are the multiple, calculated paths between one member and a
remote member. These are discussed on the next slide.
RD-Trunks: Part 1
The slide shows how member 4 is able to determine (using what it learns in the VCCP LSAs) multiple paths to a remote
member (member 15 in the example). In this example, each link between Leaf and Spine is 40G. Because this VCF was
designed using best practices (similar links between all Leaf and Spine nodes), traffic from member 4 to member 15 can be
evenly distributed over the 4 equal cost RD-trunks. The following slides shows what happen when best practices are not
followed.
RD-Trunks: Part 2
The slide shows how member 4 is able to determine (using what it learns in the VCCP LSAs) multiple paths to a remote
member (member 15 in the example). The paths do not need to be equal cost paths. All links between members are
40 Gbps except for the link between member 4 and 0 (80 Gbps) and the link between member 3 and 15 (10 Gbps). Based
on the minimum bandwidth of the path, member 4 will assign a weight to each path. This is shown in the next slide.
Fabric Header
A HiGig fabric header is used to pass frames over VCPs. In the case of layer 2 switching, when an Ethernet frame arrives, the
inbound member will perform an Ethernet switching table lookup (based on MAC address) to determine the destination
member and port. After that, the inbound member encapsulates the incoming frame in the fabric header. The fabric header
specifies the destination member and port (among other things). All members along the path will forward the encapsulated
frame by performing lookups on the fabric header only. Once the frame reaches the destination member, the fabric header is
removed and the Ethernet frame is sent out of the destination port without a second MAC table lookup.
QFabric
The slide highlights the topic we discuss next.
System Components
The QFabric system comprises four distinct components. These components are illustrated on the slide and briefly described
as follows:
• Node devices: The linecard component of a QFabric system, Node devices act as the entry and exit point into
and from the fabric.
• Interconnect devices: The fabric component of a QFabric system, Interconnect devices interconnect and provide
high-speed transport for the attached Node devices.
• Director devices: The primary Routing Engine component of a QFabric system, Director devices provide control
and management services for the system and deliver the primary user interface that allows you to manage all
components as a single device.
• EX Series switches: The control plane link of a QFabric system, EX Series switches provide the required
connectivity between all other system components and facilitate the required control and management
communications within the system.
Node Devices
Node devices connect endpoints (such as servers or storage devices) or external networks to the QFabric system. Node
devices have redundant connections to the system's fabric through Interconnect devices. Node devices are often
implemented in a manner similar to how top-of-rack switches are implemented in legacy multitier data center environments.
By default, Node devices connect to servers or storage devices. However, you can use Node devices to connect to external
networks by adding them to the network Node group.
The QFX3500 and QFX3600 switches can be used as Node devices within a QFabric system. We provide system details for
these devices on subsequent slides.
By default, the QFX3500 and QFX3600 switches function as standalone switches or node devices depending on the how the
device is ordered. However, through explicit configuration, you can change the operation mode between standalone to fabric.
We provide the conversion process used to change the operation mode from standalone to fabric on a subsequent slide in
this content.
Interconnect Devices
Interconnect devices serve as the fabric between all Node devices within a QFabric system. Two or more Interconnect
devices are used in QFabric systems to provide redundant connections for all Node devices. Each Node device has at least
one fabric connection to each Interconnect device in the system. Data traffic sent through the system and between remote
Node devices must traverse the Interconnect devices, thus making this component a critical part of the data plane network.
We discuss the data plane connectivity details on a subsequent slide in this content.
The two Interconnect devices available are the QFX3008-I and the QFX3600-I Interconnect devices. The model deployed will
depend on the size and goals of the implementation. We provide system details for these devices and some deployment
examples on subsequent slides in this content.
Director Devices
Together, two Director devices form a Director group. The Director group is the management platform that establishes,
monitors, and maintains all components in the QFabric system. The Director devices run the Junos operating system (Junos
OS) on top of a CentOS foundation.
These devices are internally assigned the names DG0 and DG1. The assigned name is determined by the order in which the
device is deployed. DG0 is assigned to the first Director device brought up and DG1 is assigned to the second Director device
brought up. The Director group handles tasks such as network topology discovery, Node and Interconnect device
configuration and startup, and system provisioning services.
A Design Option
In some data center environments the QFX3000-M QFabric System may be too small while the QFX3000-G QFabric System
may be too large. In such environments, you could, as shown on the slide, interconnect multiple QFX3000-M QFabric
Systems using a pair of core switches such as the EX9200 Series switches. For detailed information on this design refer to
the associated guide found at: http://www.juniper.net/us/en/local/pdf/reference-architectures/8030012-en.pdf.
Node Groups
The slide provides a brief explanation of the Node group software abstraction along with some other key details that relate to
Node groups including the types of Node groups and the default Node group association for Node devices. We expand on
these points throughout this section.
Note
Node groups are also referred to as independent network
elements (INEs). The INE reference might show up in
some debug and log outputs.
Note
A Server Node group is sometimes referred to as a top-of-rack (ToR).
Note
Redundant server Node groups are also referred
to as PTORs or pair of TORs in some cases.
Junos Fusion
The slide highlights the topic we discuss next.
Junos Fusion
Juniper Networks® Junos® Fusion addresses the challenges posed by traditional network architectures and provides
customers with a bridge from legacy networks to software-defined cloud networks. This innovative architecture is based on
three design principles: simplicity at scale, smart, and flexible.
A highly scalable fabric, Junos Fusion collapses multitier architectures into a single tier, reducing the number of devices in
the data center network and cutting CapEx. Junos Fusion is a centrally managed fabric that features plug-and-play
provisioning and auto-configuration capabilities, which greatly simplifies operations at scale and reduces OpEx while
accelerating the deployment of new applications and services.
Scale
Currently, Junos Fusion allows for a single AD to control up to 64 SDs. If each SD is a QFX5100-96S, the Junos Fusion could
have a total of 6144 Extended Ports.
Supported Devices
The slide shows the current support ADs and SDs.
Modes of Operation
There are two proposed modes of operation for a Junos Fusion:
1. Extended Mode: The basic premise of extended operating mode is based on “simple-edge, smart-core”. The AD
contains a highly scalable L2/L3 networking stack and advanced forwarding plane based on custom silicon.
Each SD, appears as a logical line card in the AD—its physical port configuration; control/forwarding plane state
entirely resides on the aggregation device. In this mode, SD is auto-discovered and managed from the AD. An
SD forwards incoming traffic on an extended port to AD inserting the .1BR defined encapsulation that contains
the EPID associated with the extended port. The AD upon receiving the packet with .1BR encapsulation will
extract the EPID and use the tag to perform the MAC lookup for the incoming frame to determine the outbound
extended port. To forward a frame destined to an extended port, the AD inserts the .1BR encapsulation setting
the EPID field to destined extended port. The SD forwards the incoming traffic on the upstream port to the
destined extended port by extracting the EPID field from the .1BR header and looking up the
“EPID-to-extended-port” mapping table built by the satellite management protocol. In this mode of operation,
local switching on the SD is not possible. All incoming traffic must be sent to the AD for the lookup.
2. Program Mode (not currently supported): In program mode, the SD provides a well-defined programmatic
extension (JSON API) that allows an external application, running on aggregation device or some remote device,
to program the forwarding plane of satellite device. This programming involves identifying specific traffic flows
in the data path based on the fields in the L2/L3 header and specifying the forwarding action that overrides
default .1BR forwarding decision. In program mode, both .1BR interface and programmatic extensions coexist
and complement each other—.1BR provides simplicity whereas programmatic interface provides flexibility by
exploiting the forwarding chip capabilities of the SD.
Extended Mode
As described on the previous slide, Extended mode provides for standard IEEE 802.1BR forwarding behavior. No matter what
interface an incoming frame is destined to, it is always passed to the AD (using IEEE 802.1BR encapsulation) for the MAC
table lookup.
Program Mode
Although not currently supported, program mode will allow for the switching behavior of a SD device to be modified. This
reprogramming is made possible through the JSON-RPC API. This API will allows for third party applications to be developed
that might allow for a SD to perform local switching, packet filtering, etc. Program mode will work in conjunction with IEEE
802.1BR forwarding (see the following slides).
Naming Convention
The slide shows the Extended Port naming convention for a Junos Fusion. Each SD will be represented by an FPC slot
number that must be greater than or equal to 100. Other than that, the Extended ports follow the standard interface naming
convention of prefix-fpc/pic/port.
Auto LAG
You can add more than one link between the AD and SD. When you do, there is no need to configure LAG, instead, the link
are automatically placed into a LAG bundle.
Use Cases
The slide shows the three use cases for Junos Fusion; Provider Edge, Data Center, and Campus.
We Discussed:
• Key concepts and components of a Virtual Chassis;
• Key concepts and components of a VCF;
• Key concepts and components of a QFabric System; and
• Key concepts and components of Junos Fusion.
Review Questions
1.
2.
3.
We Will Discuss:
• The reasons for the shift to IP fabrics;
• The design considerations for an IP fabric;
• How to scale an IP fabric; and
• The design considerations of VXLAN.
Overlay Networking
Overlay networking can help solve many of the requirements and problems discussed in the previous slides. This slide shows
the addition of an overlay network that includes the use of VXLAN. The overlay network consists of the virtual switches and
the VXLAN tunnel endpoints (VTEPs). A VTEP will encapsulate the Ethernet frames that it receives from the virtual switch into
IP and forward the resulting IP packet to the remote VTEP. The underlay network simply needs to forward IP packets between
VTEPs. The receiving VTEP will de-encapsulate the VXLAN IP packets and then forward the resulting Ethernet Frame to the
appropriate VM. Adding and removing VMs from the data center has no effect on the underlay network. The underlay
network simply needs to provide IP connectivity between the VTEPs.
When designing the underlay network in this scenario, you have a few choices. You can use an Ethernet fabric like Virtual
Chassis (VC), Virtual Chassis Fabric (VCF), or QFabric. All of these are valid solutions. Because all of the traffic crossing the
underlay network is IP, the option for an IP fabric becomes available. The choice of underlay network comes down to scale
and future growth. An IP fabric is considered to be the most scalable underlay solution for a few reasons as discussed later
in the chapter.
The diagram shows an IP Clos Fabric using Juniper Networks switches. In an IP Fabric the Ingress and Egress stage crossbar
switches are called Leaf nodes. The middle stage crossbar switches are called Spine nodes. Most diagrams of an IP Fabric
do not present the topology with 3 distinct stages as shown on this slide. Most diagrams show an IP Fabric with the Ingress
and Egress stage combined as a single stage. It would be like taking the top of the diagram and folding it over onto itself with
all Spines nodes on top and all Leaf nodes on the bottom of the diagram (see the next slide).
Layer 3 Connectivity
Remember that your IP Fabric will be forwarding IP data only. Each node will be an IP router. In order to forward IP packets
between routers, they need to exchange IP routes. So, you have to make a choice between routing protocols. You want to
ensure that your choice of routing protocol is scalable and future proof. As you can see by the chart, BGP is the natural
choice for a routing protocol.
IBGP: Part 1
IBGP is a valid choice as the routing protocol for your design. IBGP peers almost always peer to loopback addresses as
opposed to physical interface addresses. In order to establish a BGP session (over a TCP session), a router must have a route
to the loopback address of its neighbor. To learn the route to a neighbor an Interior Gateway Protocol (IGP) like OSPF must be
enabled in the network. One purpose of enabling an IGP is simply to ensure every router know how to get to the loopback
address of all other routers. Another problem that OSPF will solve is determining all of the equal cost paths to remote
destinations. For example, router A will determine from OSPF that there are 2 equal cost paths to reach router B. Now
router A can load balance traffic destined for router B’s loopback address (IBGP learned routes, see next few slides) across
the two links towards router B.
IBGP: Part 2
There is a requirement in an IBGP network that if one IBGP router needs to advertise an IBGP route, then every other IBGP
router must receive a copy of that route (to prevent black holes). One way to ensure this happens is to have every IBGP router
peer with every other IBGP router (a full mesh). This works fine but it does not scale (i.e., add a new router to your IP fabric
and you will have to configure every router in your IP fabric with a new peer). There are two ways to help scale the full mesh
issue; route reflection or confederations. Most often, it is route reflection that is chosen (it is easy to implement). It is
possible to have redundant route reflectors as well (shown on the slide). It is best practice to configure the Spine nodes as
route reflectors.
IBGP: Part 3
You must design your IP Fabric such that all routers load balance traffic over equal cost paths (when they exist) towards
remote networks. Each router should be configured for BGP multipath so that they will load balance when multiple BGP
routes exist. The slide shows that router A and B advertise the 10.1.1/24 network to RR-A. RR-A will use both route for
forwarding (multipath) but will chose only one of those routes (the one from router B because it B has the lowest router ID) to
send to router C (a Leaf node) and router D (a Spine node). Router C and router D will receive the route for 10.1.1/24. Both
copies will have a BGP next hop of router B’s loopback address. This is the default behavior of route advertisement and
selection in the IBGP with route reflection scenario.
Did you notice the load balancing problem (Hint: the problem is not on router C)? Since router C has two equal cost path to
get to router B (learned from OSPF), router C will load balance traffic to 10.1.1/24 over the two uplinks towards the Spine
routers. The load balancing problem lies on router D. Since router D received a single route that has a BGP next hop of router
B’s loopback, it forwards all traffic destined to 10.1.1/24 towards router B. The path to router A (which is an equal cost path
to 10.1.1/24) will never be used in this case. The next slide discusses the solution to this problem.
IBGP: Part 4
The problem on RR-A is that it sees the routes received from routers A and B, 10.1.1/24, as a single route that has been
received twice. If an IBGP router receives different versions of the same route it is supposed to make a choice between them
and then advertise the one, chosen route to its appropriate neighbors. One solution to this problem is to make every Spine
node a route reflector. This would be fine in a small fabric but probably would make sense when there are 10s of Spine
nodes. Another option would be to make each of the advertisements from router A and B look like unique routes. How can we
make the multiple advertisements of 10.1.1/24 from router A and B appear to be unique routes? There is a draft RFC
(draft-ietf-idr-add-paths) that defines the ADD-PATH capability which does just that; make the advertisements look unique. All
Spine routers in the IP Fabric should support this capability for it to work. Once enabled, routers advertise and evaluate
routes based on a tuple of the network and its path ID. In the example, router A and B advertise the 10.1.1/24 route.
However, this time, RR-A and router D support the ADD-PATH capability, RR-A attaches a unique path ID to each route and is
able to advertise both routes to router D. When the routes arrive on the router D, router D installs both routes in its routing
table (allowing it to load balance towards routers A and B.)
EBGP: Part 1
EBGP is also a valid design to use in your IP Fabric. You will notice that the load balancing problem is much easier to fix in the
EBGP scenario. For example, there will be no need for the routers to support any draft RFCs! The first requirement in the
design of your IP Fabric is that each router should be in its own unique AS. You can use AS numbers from the private or
public range or, if you will need thousands of AS numbers, you can use 32-bit AS numbers.
EBGP: Part 2
In an EBGP design, there is no need for route reflectors or an IGP. The BGP peering sessions parallel the physical wiring. For
example, every Leaf node has a BGP peering session with every Spine node. There is no leaf-to-leaf or spine-to-spine BGP
sessions just like there is no leaf-to-leaf or spine-to-spine physical connectivity. EBGP peering is done using the physical
interface IP addresses (not loopback interfaces). To enable proper load balancing, all routers need to be configured for
multipath multiple-as as well as a load balancing policy that is applied to the forwarding table, as shown.
policy-options {
policy-statement PFE-LB {
then {
load-balance per-packet;
}
}
}
...
routing-options {
forwarding-table {
export PFE-LB;
}
}
EBGP: Part 3
The slide shows that the router in AS64516 and AS64517 are advertising 10.1.1./24 to their 2 EBGP peers. Because
multipath multiple-as is configured on all routers, the receiving routers in AS64512 and AS64513 will install both
routes in their routing table and load balance traffic destined to 10.1.1/24.
EBGP: Part 4
The slide shows that the routers in AS64512 and AS64513 are advertising 10.1.1./24 to all of their EBGP peers (all Leaf
nodes). Since multipath multiple-as is configured on all routers the receiving router in the slide, the router in
AS64514, will install both routes in its routing table and load balance traffic destined to 10.1.1/24.
Edge Design
The slide shows one example of an Edge design including multiple interconnected data centers. You should notice that both
IP Fabrics are using the same AS numbers. Normally, you would think that this would cause a routing problem. For example,
if the AS 64513 route in DC1 receives a route that had passed through the AS64513 router in DC2, the route would be
dropped by the router in DC1. It would drop the route because it will have detected an AS PATH loop. To make it possible to
advertise BGP routes between data centers and ensure that routes are dropped, you can configure the edge routers to
perform AS override when advertising routes to the remote DC. This feature causes the Edge routers to replace every AS
number in the AS PATH of the routes to be advertise to their own AS. That way, when the routes arrive at the remote data
center, they appear to have come from one AS (although the AS number will appear several times in the AS PATH attribute).
IP Fabric Scaling
The slide highlights the topic we discuss next.
Scaling
To increase the overall throughput of an IP Fabric, you simply need to increase the number of Spine devices (and the
appropriate uplinks from the Leaf nodes to those Spine nodes). If you add one more Spine node to the fabric, you will also
have to add one more uplink to each Leaf node. Assuming that each uplink is 40GbE, each Leaf node can now forward an
extra 40Gbps over the fabric.
Adding and removing both server-facing ports (downlinks from the Leaf nodes) and Spine nodes will affect the
oversubscription (OS) ratio of a fabric. When designing the IP fabric, you must understand OS requirements of your
customer. For example, does your customer need line rate forwarding over the fabric? Line rate forwarding would equate to
1-to-1 (1:1) OS. That means the aggregate server-facing bandwidth is equal to the aggregate uplink bandwidth. Or, maybe
your customer would be perfectly happy with a 3:1 OS of the fabric. That is, the aggregate server-facing bandwidth is 3 times
that of the aggregate uplink bandwidth. Most customers will probably not require or desire to design around a 1:1 OS.
Instead, they will need to decide (based on their normal bandwidth usage), what OS ratio make the most sense. The next few
slides discuss how to calculate OS ratios of various IP fabric designs.
3:1 Topology
The slide shows a basic 3:1 OS design. All Spine nodes, four in total, are qfx5100-24q routers that each have (32) 40GbE
interfaces. All leaf nodes, 32 in total, are qfx5100-48s routers that have (6) 40GbE uplink interfaces and (48) 10GbE
server-facing interfaces. Each of the (48) 10GbE ports for all 32 Spine nodes will be fully utilized (i.e., attached to
downstream servers). That means that the total server-facing bandwidth is 48 x 32 x 10Gbps which equals 15360 Gbps.
Each of the 32 Leaf nodes has (4) 40GbE Spine-facing interfaces. That means, that the total uplink bandwidth is 4 x 32 x
40Gbps which equals 5120 Gbps. The OS ratio for this fabric is 15360:5120 or 3:1.
An interesting thing to note is that if you remove any number of Leaf nodes, the OS ratio does not change. For example, what
would happen to the OS ratio if their were only 31 nodes. The server facing bandwidth would be 48 x 31 x 10Gbps which
equals 14880 Gbps. The total uplink bandwidth is 4 x 31 x 40Gbps which equals 4960 Gbps. The OS ratio for this fabric is
14880:4960 or 3:1. This fact actually makes your design calculations very simple. Once you decide on an OS ratio and
determine the number of Spine nodes that will allow that ratio, you can simply add and remove Leaf nodes from the topology
without effecting the original OS ratio of the fabric.
2:1 Topology
The slide shows a basic 2:1 OS design in which two Spine nodes were added to the topology from the last slide. All Spine
nodes, six in total, are qfx5100-24q routers that each have (32) 40GbE interfaces. All leaf nodes, 32 in total, are
qfx5100-48s routers that have (6) 40GbE uplink interfaces and (48) 10GbE server-facing interfaces. Each of the (48) 10GbE
ports for all 32 Spine nodes will be fully utilized (i.e., attached to downstream servers). That means that the total
server-facing bandwidth is still 48 x 32 x 10Gbps which equals 15360 Gbps. Each of the 32 Leaf nodes has (6) 40GbE
Spine-facing interfaces. That means, that the total uplink bandwidth is 6 x 32 x 40Gbps which equals 7680 Gbps. The OS
ratio for this fabric is 15360:7680 or 2:1.
1:1 Topology
The slide shows a basic 1:1 OS design. All Spine nodes, six in total, are qfx5100-24q routers that each have (32) 40GbE
interfaces. All leaf nodes, 32 in total, are qfx5100-48s routers that have (6) 40GbE uplink interfaces and (48) 10GbE
server-facing interfaces. There are many ways that an 1:1 OS ratio can be attained. In this case, although the Leaf nodes
each have (48) 10GbE server-facing interfaces, we are only going to allow 24 servers to be attached at any given moment.
That means that the total server-facing bandwidth is still 24 x 32 x 10Gbps which equals 7680 Gbps. Each of the 32 Leaf
nodes has (6) 40GbE Spine-facing interfaces. That means, that the total uplink bandwidth is 6 x 32 x 40Gbps which equals
7680 Gbps. The OS ratio for this fabric is 7680:7680 or 1:1.
Best Practices
When designing an IP fabric you should follow some best practices. Remember, two of the main goals of an IP fabric design
(or an Clos design) is to provide a non-blocking architecture that also provides predictable load-balancing behavior.
Some of the best practices that should be followed include...
• All Spine nodes should be the exact same type of router. They should be the same model and they should also
have the same line cards installed. This helps the fabric to have a predictable load balancing behavior.
• All Leaf nodes should be the exact same type of router. Leaf nodes do not have to be the same router as the
Spine nodes. Each Leaf node should be the same model and they should also have the same line cards
installed. This helps the fabric to have a predictable load balancing behavior.
• Every Leaf node should have an uplink to every Spine node. This helps the fabric to have a predictable load
balancing behavior.
• All uplinks from Leaf node to Spine node should be the exact same speed. This helps the fabric to have
predictable load balancing behavior and also helps with the non-blocking nature of the fabric. For example, let
us assume that a Leaf has one 40GbE uplink and one 10GbE uplink to the Spine. When using the combination
of OSPF (for loopback interface advertisement and BGP next hop resolution) and IBGP, when calculating the
shortest path to the BGP next hop, the bandwidth of the links will be taken into consideration. OSPF will most
likely always chose the 40GbE interface for forwarding towards remote BGP next hops. This essentially blocks
the 10GbE interface from ever being used. In the EBGP scenario, the bandwidth will not be taken into
consideration, so traffic will be equally load balanced over the two different speed interfaces. Imagine trying to
equally load balance 60 Gbps of data over the two links, how will the 10GbE interface handle 30Gbps of traffic?
The answer is...it won’t.
VXLAN
The slide highlights the topic we discuss next.
VXLAN
VXLAN is defined in RFC 7348. It describes a scheme to tunnel (overlay) Layer 2 networks over a Layer 3 network. Each
overlay network is termed a VXLAN segment and is identified by a 24 bit segment ID called the VXLAN Network Identifier
(VNI). Usually, an tenant’s 802.1q VLAN is mapped to a single VNI. The 24 bit segment ID allow for ~16M VXLAN segments to
coexist within the same administrative domain.
VXLAN Benefits
The slide lists some of the benefits of using a VXLAN overlay scheme.
VXLAN Gateway
Based on what we have learned, the VTEP allows VMs on remote servers (that also have VTEPs) to communicate with each
other. What happens when a VM wants to communicate with the Internet or with a bare metal server (BMS)? In the case of
the Internet, how will the VXLAN packets from the sending VTEP get decapsulated before they leave the data center and are
passed on to the Internet. You must have a VXLAN Gateway to perform this function. A VXLAN Gateway is a networking device
that has the ability to perform the VTEP function. Some of the scenarios that a VXLAN Gateway can help with include the
following:
• VM to Internet connectivity;
• VM to BMS connectivity;
• VM on one VNI to communicate with a VM on another VNI (using IRB interfaces); and
• DC to DC connectivity (data center interconnection).
BUM Traffic
The slide discusses the handling of BUM traffic by VTEPs according to the VXLAN standard model. In this model, you should
note that the underlay network must support a multicast routing protocol, preferably some form of Protocol Independent
Multicast Sparse Mode (PIM-SM). Also, the VTEPs must support Internet Group Membership Protocol (IGMP) so that they can
inform the underlay network that it is a member of the multicast group associated with a VNI.
For every VNI used in the data center, there must also be a multicast group assigned. Remember that there are 2^24 (~16M)
possible VNIs so your customer will need 2^24 group addresses. Luckily, 239/8 is a reserved set of organizationally scoped
multicast group addresses (2^24 group addresses in total) that can be used freely within your customer’s data center.
Multicast Forwarding
When VTEP B receives a broadcast packet from a local VM, VTEP B encapsulates the Ethernet frame into the appropriate
VXLAN/UDP/IP headers. However, it sets the destination IP address of the outer IP header to the VNI’s group address
(239.1.1.1 on the slide). Upon receiving the multicast packet, VTEP B’s DR (the PIM router closest to VTEP B) encapsulates
the multicast packet into unicast PIM register messages that are destined to the IP address of the RP. Upon receiving the
register messages, the RP de-encapsulates the register messages and forwards the resulting multicast packets down the
(*,G) tree. Upon receiving, the multicast VXLAN packet, VTEP A does the following:
1. Strips the VXLAN/UDP/IP headers;
2. Forwards the broadcast packet towards the VMs using the virtual switch;
3. If VTEP B was unknown, VTEP A learns the IP address of VTEP B; and
4. Learns the remote MAC address of the sending VM and maps it to VTEP B’s IP address.
In the designing IP Fabric-based data center, you must ensure that the appropriate devices support PIM-SM, IGMP, and the
PIM DR and RP functions.
We Discussed:
• The reasons for the shift to IP fabrics;
• The design considerations for an IP fabric;
• How to scale an IP fabric; and
• The design considerations of VXLAN.
Review Questions
1.
2.
3.
We Will Discuss:
• The definition of the term Data Center Interconnect (DCI);
• The differences between the different Layer 2 and Layer 3 DCIs; and
• The benefits and use cases for EVPN.
DCI Overview
The slide lists the topics we will discuss. We discuss the highlighted topic first.
DCI Types
A DCI provides connectivity between remote data center sites. A Layer 3 DCI uses IP routing between data centers. It is
assumed that each DC uses a unique set of IP addresses. A Layer 2 DCI stretches the Layer 2 network (VLANs) from one data
center to another.
Transport
This slides list some of the commonly used Layer 2 and Layer 3 DCI technologies. Several technologies on the list use an
Multiprotocol Label Switching (MPLS) network to provide the connectivity between data centers including VPLS, EVPN, and
Layer 3 VPNs.
MPLS Advantages
Many of the DCI technologies listed on the previous slide depend on an MPLS network to transport frames between data
centers. Although in most cases an MPLS network can be substituted with an IP network (i.e., by encapsulating MPLS in
GRE), there are several advantages to using an MPLS network:
1. Fast failover between MPLS nodes: Fast reroute and Node/Link protection are two features of an MPLS network
that allow for 50ms or better recovery time in the event of a link failure or node failure along the path of an
MPLS label switched path (LSP).
2. Scalable VPNs: VPLS, EVPN, L3 MPLS VPNs are DCI technologies that use MPLS to transport frames between
data centers. These same technologies allow for the interconnection of many sites (potentially hundreds)
without the need for the manual setup of a full mesh of tunnels between those sites. In most cases, adding a
new site only requires administrator to configure the devices at the new site. The remote sites do not need to be
touched.
3. Traffic engineering: MPLS allows for the administrator to decide the path takes over the MPLS network. You no
longer have to take the same path calculated by the IGP (i.e., all data takes the same path between sites). You
can literally direct different traffic types to take different paths over the MPLS network.
4. Any-to-any connectivity: When using an MPLS backbone to provide the DCI, it will allow you the flexibility to
provide any type of MPLS-based Layer 2 DCI, Layer 3 DCI, or both combinations that you choose. An MPLS
backbone is a network that can generally support most types of MPLS or IP-based connectivity at the same
time.
Things to Remember
The following are some of the key things to remember about working with MPLS labels:
• MPLS labels can be either assigned manually or set up by a signaling protocol running in each LSR along the path
of the LSP. Once the LSP is set up, the ingress router and all subsequent routers in the LSP do not examine the IP
routing information in the labeled packet—they use the label to look up information in their label forwarding tables.
Changing Labels by Segment
• Much as with ATM VCIs, MPLS label values change at each segment of the LSP. A single router can be part of
multiple LSPs. It can be the ingress or egress router for one or more LSPs, and it also can be a transit router of one
or more LSPs. The functions that each router supports depend on the network design.
• The LSRs replace the old label with a new label in a swap operation and then forward the packet to the next router
in the path. When the packet reaches the LSP’s egress point, it is forwarded again based on longest-match IP
forwarding.
• There is nothing unique or special about most of the label values used in MPLS. We say that labels have local
significance, meaning that a label value of 10254, for example, identifies one LSP on one router, and the same
value can identify a different LSP on another router.
Label-Switching Routers
An LSR understands and forwards MPLS packets, which flow on, and are part of, an LSP. In addition, an LSR participates in
constructing LSPs for the portion of each LSP entering and leaving the LSR. For a particular destination, an LSR can be at
the start of an LSP, the end of an LSP, or in the middle of an LSP. An individual router can perform one, two, or all of these
roles as required for various LSPs. However, a single router cannot be both entrance and exit points for any individual LSP.
Label-Switched Path
An LSP is a one-way (unidirectional) flow of traffic, carrying packets from beginning to end. Packets must enter the LSP at the
beginning (ingress) of the path, and can only exit the LSP at the end (egress). Packets cannot be injected into an LSP at an
intermediate hop.
Generally, an LSP remains within a single MPLS domain. That is, the entrance and exit of the LSP, and all routers in between,
are ultimately in control of the same administrative authority. This ensures that MPLS LSP traffic engineering is not done
haphazardly or at cross purposes but is implemented in a coordinated fashion.
RSVP
The Junos OS uses RSVP as the label distribution protocol for traffic engineered LSPs.
• RSVP was designed to be the resource reservation protocol of the Internet and “provide a general facility for
creating and maintaining distributed reservation state across a set of multicast or unicast delivery paths”
(RFC 2205). Reservations are an important part of traffic engineering, so it made sense to continue to use
RSVP for this purpose rather than reinventing the wheel.
• RSVP was explicitly designed to support extensibility mechanisms by allowing it to carry what are called opaque
objects. Opaque objects make no real sense to RSVP itself but are carried with the understanding that some
adjunct protocol (such as MPLS) might find the information in these objects useful. This encourages RSVP
extensions that create and maintain distributed state for information other than pure resource reservation. The
designers believed that extensions could be developed easily to add support for explicit routes and label
distribution.
Continued on the next page.
LDP
LDP associates a set of destinations (prefixes) with each data link layer LSP. This set of destinations is called the FEC. These
destinations all share a common data LSP path egress and a common unicast routing path. LDP supports topology-driven
MPLS networks in best-effort, hop-by-hop implementations. The LDP signaling protocol always establishes LSPs that follow
the contours of the IGP’s shortest path. Traffic engineering is not possible with LDP.
Protects Interfaces
Link protection is the Junos OS nomenclature for the facility backup feature defined in RFC 4090. Link protection is just one
of several methods to protect traffic as it traverses the MPLS network. The link protection feature is interface based, rather
than LSP based. The slide shows how the R2 node is protecting its interface and link to R3 through a bypass LSP.
Node Protection
Node protection is the Junos OS nomenclature for the facility backup feature defined in RFC 4090. Node protection is
another one of several methods to protect traffic as it traverses the MPLS network. Node protection uses the same
messaging as link protection. The slide shows that R2 is protecting against the complete failure of R3 through a bypass LSP.
Provider Routers
Provider (P) routers are located in the IP/MPLS core. These routers do not carry VPN data center routes, nor do they interface
in the VPN control and signaling planes. This is a key aspect of the RFC 4364 scalability model; only PE devices are aware of
VPN routes, and no single PE router must hold all VPN state information.
P routers are involved in the VPN forwarding plane where they act as label-switching routers (LSRs) performing label
swapping (and popping) operations.
VPN Site
A VPN site is a collection of devices that can communicate with each other without the need to transit the IP/MPLS
backbone (i.e., a single data center). A site can range from a single location with one switch or router to a network consisting
of many geographically diverse devices.
Layer 2 DCI
The slide highlights the topic we discuss next.
CCC
Circuit Cross Connect (CCC) is the “static routing” of DCIs. For each CCC connection, an administrator must map DC-facing
interface (or subinterface) to both an outbound MPLS LSP (for sending data to remote DCs) and an inbound MPLS LSP (for
receiving data from remote DCs). This was the original method of creating layer 2 VPNs using Juniper Networks devices. This
method works fine but it does have one glaring issue. CCC does not use MPLS label stacking (2 or more MPLS labels stacked
on top of the Ethernet frame) to multiplex different types of data over an LSP. Instead, only one MPLS label is stacked on top
of the original Ethernet frame. Therefore, the LSPs used for a particular CCC are dedicated for that purpose and cannot be
reused for any other purpose. For every CCC your must create 2 MPLS LSPs. If you wanted to create a CCC for two VLANs
between the same two sites, you would have to instantiate 4 MPLS LSPs between the same to PE devices.
Notice that the PE devices do not have to learn MAC addresses to forward data. The simply use the mapping of the CE-facing
interface to the inbound and outbound MPLS LSP to forward data.
VPLS
To the DC, a VPLS DCI appears to be a single LAN segment. In fact, it appears to act similarly to a learning bridge. That is,
when the destination media access control (MAC) address is not known, an Ethernet frame is sent to all remote sites. If the
destination MAC address is known, it is sent directly to the site that owns it. The Junos OS supports two variations of VPLS,
BGP signaled VPLS and LDP signaled VPLS.
In VPLS, PE devices learn MAC addresses from the frames that it receives. They use the source and destination addresses to
dynamically create a forwarding table (vpn-name.vpls) for Ethernet frames. Based on this table, frames are forwarded out of
directly connected interfaces or over an MPLS LSP across the provider core. This behavior allows an administrator to not
have to manually map Layer 2 circuits to remote sites.
EVPN
Similar to VPLS, to CE devices the MPLS network appears to function as a single broadcast domain per VLAN. The network
acts similar to a learning bridge. PE devices learn MAC addresses from received Ethernet frames from the locally attached
DC. Once a local MAC is learned, the MAC is advertised to remote PE devices using EVPN MP-BGP NLRI. The remote PEs, in
turn, map the BGP learned MAC address to outbound MPLS LSPs. The synchronizing of learned MAC addresses minimizes
the flooding of BUM traffic over the MPLS network.
Why EVPN?
Service providers are interested in EVPN because it improves their service offerings by allowing them to replace VPLS with a
newer, more efficient technology. Data center builders are interested in using EVPN because of its proactive nature of
learning MAC addresses.
OTT of L3VPN
The slide shows an example of the signaling and data plane when using EVPN/VXLAN over a Layer 3 VPN. The two MX Series
devices represent the PE devices for the Layer 3 VPN. The layer 3 VPN can be over a private MPLS network or could be a
purchased Service Provider service. From the QFX perspectives, they are separated by an IP network. The QFXs simply
forward VXLAN packets between each other based on the MAC addresses learned through EVPN signaling. The MX devices
have an MPLS layer 3 VPN between each other (Bidirectional MPLS LSPs, IGP, L3 VPN MP-BGP routing, etc). The MXs
advertise the local QFX’s loopback address to the other MX.
When forwarding data from West to East, QFX1 takes a locally received Ethernet frame and encapsulates it in a VXLAN
packet destined to QFX2’s loopback address. MX1 performs a lookup for the received packet on the VRF table associated
with the VPN interface (the incoming interface) and encapsulates the VXLAN packet into two MPLS headers (outer for MPLS
LSP, inner for MX2 VRF mapping). Upon receiving the MPLS encapsulated packet, MX2 uses the inner MPLS header to
determine the VRF table so that it can route the remaining VXLAN packet to QFX2. QFX2 strips the VXLAN encapsulation and
forwards the original Ethernet frame to the destination host.
EVPN over IP
The slide shows an example of the signaling and data plane when using EVPN over an IP network. EVPN MP-BGP is used to
synchronize MAC tables.
When forwarding data from West to East, QFX1 takes a locally received Ethernet frame and encapsulates it in a VXLAN
packet destined to MX1’s loopback address. QFX2 strips the VXLAN encapsulation and forwards the remaining Ethernet
frame to the destination host.
Layer 3 DCI
The slide highlights the topic we discuss next.
Layer 3 DCI
A layer 3 DCI uses routing to interconnect DCs. Each data center must maintain a unique IP address space such that there is
no overlap. A DCI can be established using just about any IP carrying link including a point-to-point IP link, Layer 3 MPLS VPN,
and IPSec VPN, or a GRE tunnel. Also using standard routing, multiple redundant connections are possible between DCs.
Layer 3 VPNs
The Junos OS supports Layer 3 provider-provisioned VPNs based on RFC 4364. In this model, the provider edge (PE) routers
maintain VPN-specific routing tables called VPN route and forwarding (VRF) tables for each of their directly connected VPNs.
To populate these forwarding tables, the CE routers advertise routes to the PE routers using conventional routing protocols
like RIP, OSPF and EBGP.
The PE routers then advertise these routes to other PE routers with Multiprotocol Border Gateway Protocol (MP-BGP) using
extended communities to differentiate traffic from different VPN sites. Traffic forwarded from one VPN site to another is
tunneled across the network using MPLS.
We Discussed:
• The definition of the term DCI;
• The differences between the different Layer 2 and Layer 3 DCIs; and
• The benefits and use cases for EVPN.
Review Questions:
1.
2.
3.