The best-path selection time is proportional to the table size as well as time required for update batching. Let's look at a slightly different scenario to demonstrate how BGP multi-path may potentially improve convergence. Thus, it may be expected to see two paths to every route from AS on every router in AS But this is not always possible in situations where any topology other than BGP full mesh is used inside the AS.
Per the distance-vector behavior, the reflectors will only re-advertise the best-path to AS prefixes, and since both RRs elect paths consistently, they will advertise the same path to R3, R4 and R2. Both R3 and R4 will receive the prefix R2 will receive the best path via R1 as well but prefer using its eBGP connection.
On contrary, if R1, R2, R3 and R4 were connected in the full mesh, then every router would have seen exits via R1 and R2 and be able to use BGP multi-path if configured. If BGP speakers were able to utilize multiple paths at the same time, then it could be possible to alleviate the severity of a network failure.
Even better, it is theoretically possible to do "fast" re-route in the case where multiple equal-cost equivalent and thus loop--less paths are available in BGP.
Such switchover could be performed in the forwarding engine, as soon as the failure is signaled. However, there are two major problems with the re-route mechanism of this type:.
In the following sections we are going to review various technologies developed to accelerate BGP convergence, enabling far better reaction times compared to "pure BGP based" failure detection and repair.
If you take the full Internet routing table, which is above k prefixes Y , then simply transporting the prefixes alone will consume over 10 Megabytes, not to count the path attributes and other information. Tuning TCP transport performance includes the following:.
Very detailed information on tuning BGP transport could be found in [10] Chapter 3. We, therefore, skip an in-depth discussion of this topic here.
It could be noted that TCP keepalives could be used for the same purpose, but since BGP already has similar mechanics these are not of any big help. Such instability is dangerous since there is no built-in session dampening mechanism in the session establishment process.
The last option is on by default for eBGP sessions, and tracks the outgoing interface associated with the BGP session. The command to disable fast peering session deactivation is no bgp fast-external-fallover. Notice that this feature is by default off for iBGP sessions, as those are supposed to be routed and restored using the underlying IGP mechanics.
Using BFD is the best option on multipoint interfaces, such as Ethernet, that do not support fast link down detection e. BFD is especially attractive in the platforms that implement it in the hardware. The command to activate BFD fallover is neighbor fall-over bfd. In the following sections, we'll discuss the use of IGP for fast reporting of link failures. BGP prefixes typically rely on recursive next-hop resolution. That is, next-hops associated with prefixes are normally not directly connected, but rather resolved via IGP.
This process runs periodically and among other work performs full BGP table walk and validates the BGP next-hop values. The validation consists of resolving the next-hop recursively through the router's RIB and possibly changing the forwarding information in response to IGP events.
However, the IGP will probably converge faster and report R1's address as unreachable. The default BGP Scanner run-time is 60 seconds, and could be changed using the command bgp scan-time. Notice that setting this value too low may result in extra burden on router's CPU if you have large BGP tables, since the scanner process has to perform full table walk every time it executes. IGP protocols could be tuned to react to a network change within hundreds of milliseconds see [6] and it would be desirable to make BGP aware of such changes as quickly as possible.
The idea is to make the BGP process register the next-hop values with the RIB "watcher" process and require a "call-back" every time information about the prefix corresponding to the next-hop changes. The first event is more important and reported faster than metric change.
Overall, IGP delays report of an event for the duration of bgp nextop trigger delay XX interval which is 5 seconds by default. This allows for more consecutive events to be processed and received from IGP and effectively implements event aggregation.
This delay is helpful in various "fate sharing" scenarios where a facility failure affects multiple links in the network, and BGP needs to ensure that all IGP nodes have reported this failure and IGP has fully converged. Normally, you should set the NHT delay to be slightly above the time it takes the IGP to fully converge upon a change in the network. In a fast-tuned IGP network, you can set this delay to as low as 0 seconds, so that every IGP event is reported immediately, though this requires careful underlying IGP tuning to avoid oscillations.
See [6] for more information on tuning the IGP protocol settings, but in short, you need to tune the SPF delay value in IGP to be conservative enough to capture all changes that could be caused by a failure in the network.
This will affect every prefix that has the next-hop changed as a result of IGP event, and could take significant amount of time, based on number of prefixes associated with this nexthop.
For example, if an AS has two connections to the Internet and receives full BGP tables over both connections, then a single exit failure will force full-table walk for over k prefixes.
The last, less visible contributor to faster convergence is Hierarchical FIB. Look at the figure below - it shows how FIB could be organized as either "flat" or "hierarchical". In the "flat" case, BGP prefixes have their forwarding information directly associated - e. In such case, any change to a BGP next-hop may require updating a lot of prefixes sharing the same next-hop, which is a time consuming process. If the next-hop value remains the same, and only the output interface changes, the FIB update process still needs walking over all BGP prefixes and reprogramming the forwarding information.
The use of hierarchical FIB is automatic and does not require any special commands. All major networking equipment vendors support this feature. Evaluation of the test data must be done with an understanding of generally accepted testing practices regarding repeatability, variance, and statistical significance of a small number of trials.
For any repeated tests that are averaged to remove variance, all parameters MUST remain the same. The processing of the authentication hash, particularly in devices with a large number of BGP peers and a large amount of update traffic, can have an impact on the control plane of Papneja, et al. If authentication is enabled, it MUST be documented correctly in the reporting format.
Convergence Events Convergence events or triggers are defined as abnormal occurrences in the network, which initiate route flapping in the network and hence forces the reconvergence of a steady state network.
In a real network, a series of convergence events may cause convergence latency operators desire to test. These convergence events must be defined in terms of the sequences defined in RFC This basic document begins all tests with a router initial setup. Additional documents will define BGP data- plane convergence based on peer initialization. The convergence events may or may not be tied to the actual failure. For cases where the redundancy cannot be disabled, the results are no longer comparable and the level of impact on the measurements is out of scope of this document.
Test Cases All tests defined under this section assume the following: a. BGP peers are in Established state. BGP state should be cleared from Established state to Idle prior to each test. This is recommended to ensure that all tests start with BGP peers being forced back to Idle state and databases flushed.
Furthermore, the traffic generation and routing should be verified in the topology to ensure there is no packet loss observed on any advertised routes. The arrival timestamp of advertised routes can be measured by installing an inline monitoring device between the emulator and the DUT or by using the span port of the DUT connected with an external analyzer.
The time base of such an inline monitor or external analyzer needs to be synchronized with the protocol and traffic emulator. Some modern emulators may have the capability to capture and timestamp every NLRI packet leaving and arriving at the emulator ports. Basic Convergence Tests These test cases measure characteristics of a BGP implementation in non-failure scenarios like: 1.
All variables affecting convergence should be set to a basic test state as defined in Section 4. To ensure adjacency establishment, wait for three keepalives to be received from the DUT or a configurable delay before proceeding with the rest of the test. Start the traffic from the emulator tx towards the DUT targeted at a route specified in the route mixture e. Record the time when the traffic targeted towards routeA is received by the emulator on the appropriate traffic egress interface. A full convergence for the route update is the measurement between the first route Rt-A and the last route Rt-last.
Note: It is recommended that a single test with the same route mixture be repeated several times. A report should provide the standard deviation and the average of all tests. Running tests with a varying number of routes and route mixtures is important to get a full characterization of a single peer. Reference Test Setup: This test uses the setup as shown in Figure 2.
Procedure: A. These values MAY be basic test or a unique set completely described in the test setup. Start the traffic from the emulator towards the Helper Node targeted at a specific route e.
Advertise routeA from the emulator to the DUT and note the time. Record when routeA is received by the DUT. Record the time when the traffic targeted towards routeA is received on the Route Egress Interface. Reference Test Setup: This test uses the setup as shown in Figure 3. Loopback interfaces are configured on the DUT and Helper Node, and connectivity is established between them using any config options available on the DUT.
To ensure adjacency establishment, wait for three keepalives to be received from the DUT or a configurable delay before proceeding with the rest of the test I. Start the traffic from the emulator towards the DUT targeted at a specific route e.
Record the time when the route is received by the DUT. Record the time when the traffic targeted towards routeA is received from the egress interface of the DUT on the emulator. With each set route mixture, the test should be repeated multiple times.
The results should record the average, mean, standard deviation. Reference Test Setup: This test uses the setup as shown in Figure 1. The shutdown event is defined as an administrative shutdown event on the DUT. All variables affecting convergence, like authentication, policies, and timers, should be set to basic-test policy. Establish two BGP adjacencies from the DUT to the emulator, one over the peer interface and the other using a second peer interface.
Advertise the same route, routeA, over both adjacencies with preferences so that the Best Egress Interface for the preferred next hop is Emp1 interface. Initially, traffic would be observed on the best egress route, Emp1, instead of Emp2.
This time is called Shutdown time. Measure the convergence time for the event to be detected and traffic to be forwarded to Next-Best Egress Interface Dp2. Stop the offered load and wait for the queues to drain.
Restart the data flow. Measure the convergence time taken for the traffic to be rerouted from Dp2 to Best Egress Interface, Dp1.
It is recommended that the test be repeated with a varying number of routes and route mixtures or with a number of routes and route mixtures closer to what is deployed in operational networks. The shutdown event is defined as a shutdown of the local interface of the Tester via a logical shutdown event. The procedure used in Section 5. However, policy directs the routes to be sent only over one of the paths. Reference Test Setup: This test uses the setup as shown in Figure 1, and the procedure used in Section 5.
All variables affecting convergence, like authentication, policies, and timers, should be set to basic-policy. Initially, traffic would be observed on the Best Egress Interface.
Measure the convergence time for the event to be detected and traffic to be forwarded to Next-Best Egress Interface. This time is Tr-rr2, also called TR2-traffic-on. Stop the offered load and wait for the queues to drain and restart the data flow.
Measure the convergence time taken for the traffic to be rerouted to the Best Egress Interface. This time is Tr-rr1, also called TR1-traffic-on. When sending updates over an iBGP peering, the next-hop is normally not modified unless next-hop-self is used.
It is very common to use next-hop-self to not have to carry the external next-hops in your IGP. Next-hop-self is good in theory, it does have some drawbacks though. If a router sets the next-hop to itself, normally a loopback, all traffic is attracted to that next-hop. Assume that this next-hop loses its eBGP peering where it is learning all the external prefixes.
Because we are setting the next-hop to our self, the IGP will not be aware of this event. Convergence will depend on BGP where this router must send withdraw messages, then routers have to calculate a new best path and so on.
If we instead had sent the prefixes with an unmodified next-hop, convergence would have depended on the IGP, where this exit link would be part of the IGP. These are the kind of decision that are always involved in a design, there are always tradeoffs. Keep your IGP as small as possible or achieve faster convergence? If you do put exit links into your IGP, you should either make the links passive or redistribute the network into your IGP.
So which one is better? External prefixes have a wider flooding scope domain than internal prefixes and does not generally require a full SPF run. If ISIS is in use, then partial route calculation PRC would be run and not a full run because this would be considered a leaf of the SPF topology, only reachability would have changed, not the topology. Almost any network of scale, will use RRs in some form. This however has some drawbacks, we will have less path diversity, suboptimal exit paths and a slower converging network.
How can we work around these issues? An RR will only reflect one best path, which means that a lot of the paths in the network will not get used. This means that an RR will see the same network from different PEs as two different prefixes. Achieving diversity in an IP network takes some more work. This can be done in various ways such as shadow RR, shadow session or add path.
Using a shadow RR or shadow session only requires support on the RR while add path requires support on the PEs as well. The second part is to have the router actually install the backup path. With a backup path we can achieve fast convergence.
If we want to do load sharing, that would require multi path and equal cost routes or relaxing BGP best path algorithm to install multiple paths. In some situations we will not achieve diversity because a PE will not advertise its route due to policy. This means that this diverse path will never get advertised in the network.
To overcome this, the feature best-external needs to be used, the PE will then advertise its best external path together with the real best path.
Fun fact. Setting a different MRAI per neighbor will break the update-groups. So, if you want to maintain the same update-group members, adjust MRAI for all of them accordingly. Good to know. Fantastic post Daniel. Thanks, Michael!
0コメント