Internet is out of memory?

It means poorly maintained networks will have more and more problems as the routing tables get even bigger. For all others it'll mean a bug fix with router reboots which should result in zero downtime assuming the ISP uses a resilient architecture (re: simple failover). A lot of companies which use BGP peering for the purpose of their ARIN (or insert your IP authority here) assigned address spaces will similarly need to do a round of OS updates with router reboots. It isn't as big a deal as most people think.
 
Isn't it just a config option they forgot to change?
 
Can't they just download more memory from private networks?

The internet is not really well designed though, it really needs to work more like a mesh rather than a centrally managed network.
 
It means someone is not doing his job. It's been known forever that the internet BGP routing table grows over time and routers have finite memory to hold the routing table so memory utilization has to be monitored via SNMP with a plan in place to either upgrade memory or route processor in advance of impending depletion. Wouldn't think ISPs would let this slip by so it's probably the larger companies that have assigned Autonomous System number for eBGP peering that's not doing their job.

If you want to see what a BGP route looks like:

- Browse to http://bgp.he.net to get your IP address and your ISP's AS number

- Telnet to a public route server such as route-server.ip.att.net and issue the command "show route <your ip address>" without the double quotes and <>

- Look for the "AS path:" line, the first AS path # is what it's directly peered with and if there's more than one that's the path it takes to the destination

Use the mentioned web site to convert AS # to ISP name.
 
Isn't it just a config option they forgot to change?

Yes and no.

In a router you can divide the information in various storage:

1) The routing process itself such as BGP (when we talk about Internet) but also other protocols such as ISIS, OSPF, EIGRP etc.

2) The routing process(es) fills an internal database called RIB - Routing Information Base.

3) Out of this RIB the various routes are being compiled into a FIB - Forwarding Information Base which is then pushed down to the chips handling the physical interfaces.

The routing process itself uses the onboard system CPU and RAM in the router.

The RIB itself often use the system CPU and RAM aswell but have limits in how many entries it can maintain. Basically because the various hashing algorithms will use some RAM to make it faster for lookups. Common limits here are anything from 1M entries to whatever the vendor has set as limit for a particular model. This gives if you import a full table of say 570k or so entries it will take 570k entries in the RIB. If you have a second peer and take a full table from there aswell it will take another 570k, making it 1140k entries in total in the RIB.

The FIB is where the magic happens. When a packet enters the physical interface its handled by an onboard set of FPGA/ASICs (often called NPU - Network Processing Unit). This way the packets are (basically) never sent ("punted") to the system CPU (which is much slower in handling packets). The FIB is the memory in these chips so the NPU will know which physical interface the packet should be sent out on (without having to ask the much slower system CPU).

Its also the FIB which is the main limit here because the memory used for doing the hardware lookups in the FIB is often TCAMs which is somewhat expensive (other methods exists aswell).

Now even if the full table is about 570k entries (or so) when you compile it the compiler can aggregate multiple routes within the same subnet and the same destination into a single entry. Which gives that so far the need for the size of a FIB has been below 512k (even if the full table has been at 570k or so entries).

What happend a few days ago was that the aggregated result out of the RIB went pass the 512k mark which gives all sort of problems for devices that can only store 512k entries in their FIB.

Some devices will crash, some devices will use 512k entries and then throw away the stuff that didnt fit in there (meaning unless you have a default route packets will be dropped towards destinations which cannot be found in the FIB) and some devices will start to "punt" packets to the much slower system CPU (which have the full RIB) and do the lookups there (making multi 10Gbit links go down to a throughput of about 200Mbit or so which also means packets being dropped).

Depending on which router you have if you are lucky you can just reconfigure the TCAM and reboot the device. In this TCAM its not only routing entries (well forwarding entries to be technical) who are stored but also ACLs, QoS stuff etc. This shouldnt be rocket science however since the TCAM is used for other stuff this might not be an option for you (because then you will just move the problem so the QoS stops working or whatever). Also reconfiguring TCAM in most cases will force you to reboot your device (which with BGP will make some downtime or a hickup in your network since it will try to converge around the router who just went offline for reboot).

If you are unlucky you will need to buy yourself a new router and/or to be more aggressive on what you accept from your peers.

A workaround is to be more aggressive in your filtering on which entries you accept from your peers. If you only have one uplink there is no need for you to maintain a full table. You can still use BGP to announce your own routes but from the peer just import a single default route.

Other filtering options is how long prefixes you allow (usually /24 is the limit for the global table, if you filter at /23 or /22 you will shrink this full table - the downside is that the packets for these routes you filter out will be sent to your default route(s) instead (transit)) etc.

So next time you have a requirement of maintaining a full table (without filtering) make sure that you select a device that can do at least 1M entries in the FIB. Also when setting up devices dont forget to check if you can reconfigure the TCAMs to optimize it for future use (this way you wont need to reboot and wont have downtime or other problems next time another default limit will be passed).
 
As I understood it, most require TCAM memory reconfiguration, at the expense of ipv6, which shouldn't be a problem in the short term as its adoption is practically non existant.
 
Back
Top