Help with very sporadic network blips, cisco gear

Sage2k

[H]ard|Gawd
Joined
Mar 25, 2002
Messages
1,551
I have a 3750 acting as the DSW, with a few 2960's (asw's) attached to it going to different buildings.

One building has been complaining of random 15 second to 1 minute outages. Its rare though. Happens anywhere from once per week to maybe 4 times per week. The times of the day it happens is random also.

I've had them ping the gateway when it happens, and the ping fails for a few seconds then starts to work.

One of them ran a continuous packet sniff, and sent it to me after it happened. Nothing weird going on in the sniff, except for the obvious failure of getting traffic. When it happens I see a crap ton of arp requests to the gateway, which makes sense I suppose since the hosts can't get past the gateway anymore.

They've replaced the fiber to LIU, the sfp, and then even the 2960. Its updated to latest code.

I'm guessing the problem is at the 3750. We havent touched it yet. CPU/memory looks fine (when checking the history). No input/output errors on any interface. Logs don't show anything, not even a flapping interface. But we are always looking at this after the fact, since by the time they email or call us, it has already passed...

Other buildings that are fed from this 3750 haven't complained though, so I'm not sure if they are experiencing problems or not.

Any other ideas on what I can look at?
 
Now I am NOT a Cisco bunny in anyway and despise reliance of farken CLI (I have a mouse and I am prepared to bash it it up gay CLI fanatics arses) but I have come across issues that you have described too many times.

Look at the port speed negotiation and also for Flapping. In many cases, a Cisco will put its female gender parts in the air and flap, other times they will just get touchy on a port and try to renegotiate the speed. Lock the port down to a manual speed and see what happens. If still flaky, back it down to the next speed down. These issues are a lot more common on copper circuits that are at distance limits or have been installed by idiots with no clue and too close to electrical noise.
Some of the ISP's I have worked with found that the uber-dollar Cisco's were best put on eBay and replaced with cheaper options that were not so anal.
 
I'll check out port speed neg, but I havent seen any flapping. We have it set to log when a port goes up or down and I haven't seen any. All fiber uplinks also. Fiber and sfp already replaced just in case (on customer side, guess we can try ours next) Zero input/output errors.
 
May also pay to grab a pair of external media convertors and convert fibre to copper and then feed into switch, mini-SFP's are nothing special at driving strong signalling in any case.
 
May also pay to grab a pair of external media convertors and convert fibre to copper and then feed into switch, mini-SFP's are nothing special at driving strong signalling in any case.

What? SFPs are just fine at driving strong signalling. I've seen SFPs do 80km no problem at all.

For this problem I would see if you are seeing any kind of interface flapping in the logs. Should be easy enough to detect that.

You might want to check on light levels, and also follow the fiber and check if there's any pinches or anything that might cause a pinch or something when activated. For example, an air conditioner, or anything like that.
 
Console on the switch when it happens, if you can't get to the switch fast enough set up a syslog server that will tell you exactly what errors are coming across and when. That will help you figure out what is going on. When this has happened to me it was a loop at the other end of the fiber. It would freak the 2960 out and stop the storm from going across the fiber. But again, the best way to figure this out is look at the logs.
 
  1. What? SFPs are just fine at driving strong signalling. I've seen SFPs do 80km no problem at all.
  2. For this problem I would see if you are seeing any kind of interface flapping in the logs. Should be easy enough to detect that.
  3. You might want to check on light levels, and also follow the fiber and check if there's any pinches or anything that might cause a pinch or something when activated. For example, an air conditioner, or anything like that.

1# I have seen SFP's that couldn't reach the next room in a DC. To each, their own. Just because you haven't seen it, doesn't mean it doesn't happen.

2# OP already declared NO flapping.

3# Now coming back to low signal issues possible...?

I am not going argue, the OP needs to look at ALL possibilities. Swapping SFP's, using external MC's, checking cables or anything is easy enough. Let him see what he finds.
 
1# I have seen SFP's that couldn't reach the next room in a DC. To each, their own. Just because you haven't seen it, doesn't mean it doesn't happen.

2# OP already declared NO flapping.

3# Now coming back to low signal issues possible...?

I am not going argue, the OP needs to look at ALL possibilities. Swapping SFP's, using external MC's, checking cables or anything is easy enough. Let him see what he finds.

As someone who dispatches field ops all day, (on a 15,000+ site cell-tower backhaul network) every day to replace SFP's, they can go bad anytime. Do your SFP's have any kind of diagnostics ? Sorry, I work mostly in WAN transport on Alcatel, and Cisco 15xxx and don't have much experience with Cisco campus fiber.

And second the kinked jumper. If it's in a location where other people are running new cabling at all, if it's damaged, or macro-bent, anything moving over it could cause a loss of light. Or, even a connector that isn't fully seated.
 
Check for Spanning-Tree problems on the network. It sounds like Spanning-Tree is converging causing the ports to go through the process (Blocking, Listening, Learning, Forwarding). Configure ports that do not point to the root bridge with root guard to prevent superior BPDUs from causing Spanning-Tree to converge.
 
As someone who dispatches field ops all day, (on a 15,000+ site cell-tower backhaul network) every day to replace SFP's, they can go bad anytime. Do your SFP's have any kind of diagnostics ? Sorry, I work mostly in WAN transport on Alcatel, and Cisco 15xxx and don't have much experience with Cisco campus fiber.

Sorry bud, not sure if you were aiming a question at me or concurring with my statement that SFP's are no more reliable than other forms of media?

I am the poor bastard that comes to sites to troubleshoot issues that guys sitting in a ACed office somewhere else can't figure out. Swap SFP's or tails and issues goes. Other times it is prove that a network jock missed something simple like a F...ing loopback or shitty cabling putting touchy switches like Cisco's into a fizz.
 
Just a quick update.

I moved some of the users to a different vlan on the same devices.

Vlan 101 is having the problem. I moved some users to vlan 100. The people on vlan 100 are NOT having the problem. So its isolated to the one vlan.

I also noticed a device on the bad vlan that was sending a constant 1.2mb stream of traffic. Tracked it down to some really old security camera that hasn't been used in many many years. It doesn't have a valid IP address, but is somehow sending a crap load of traffic? I just removed it...hopefully it helps..we'll find out in a few days
 
Now I am NOT a Cisco bunny in anyway and despise reliance of farken CLI (I have a mouse and I am prepared to bash it it up gay CLI fanatics arses) but I have come across issues that you have described too many times.

Look at the port speed negotiation and also for Flapping. In many cases, a Cisco will put its female gender parts in the air and flap, other times they will just get touchy on a port and try to renegotiate the speed. Lock the port down to a manual speed and see what happens. If still flaky, back it down to the next speed down. These issues are a lot more common on copper circuits that are at distance limits or have been installed by idiots with no clue and too close to electrical noise.
Some of the ISP's I have worked with found that the uber-dollar Cisco's were best put on eBay and replaced with cheaper options that were not so anal.


Hrmmm ....... sounds to me like you are a typical:

ForumTroll_zps68e4557c.jpg
 
Last edited:
Hey OP.. did you check your interfaces on the 3750 for CRC errors?

What is the distance between the 3750 and the 2960, cable wise, that is having problems.

What kind of fiber are you running and what transceiver tech are you using i.e. 50/125 MM / 6/125 MM, Singlemode etc.... and what are using to shoot the laser i.e. ShortReach, LongReach, etc.... These questions will be more helpful in ruling out the layer 1 stuff.

Also an important question to ask .... is that Cisco 3750 a true Cisco or did you guys get it from ebay and possibly running a Chisco (Chinese Knock off with Linksys etc... parts on the inside). I have seen this once or twice and boy were the businesses angry when I cracked that chassis open and showed them.

Let me ask also if you have investigated for a possble loop situation causing issues where as a properly configured 3750 would correct through spanning-tree?

also if you are allowed too you are welcome to share your running-config from the 3750 here so we can dig through it and look for something as simple as a function that needs to be toggled.
 
0 errors. Trust me, we checked multiple times. We have even replaced much of the hardware.

Not sure on distance, but its not too far...maybe 1/8 mile or less between buildings?
It's single mode, not sure what size, 1000BaseLX SFP
Switch only has a single uplink. No real chance for their to be a loop situation, but we did double check.

Its a true 3750 afaik. We are a large organization and purchase millions of dollars worth of equipment per year from a reputable local company.

Checked spanning tree, doesn't seem to be the problem (confirmed with cisco tac)

Don't think its layer 1 issue at this point, because the problem only happens on one vlan and not others. All vlans are trunked to all down linked switches. So the fact that one is fine and the other isn't lends me to believe its not layer 1.


At this point I feel its some sort of corrupt/weird/compromised traffic that is only on this specific vlan.

Earlier today I did find a rogue device (old security camera i think) that was sending out a good bit of traffic into the affected vlan. I shut the port it was connected to, so I'm hoping this helps!



Due to security reasons I don't think I can share the config :(
 
Due to security reasons I don't think I can share the config :(

Understood. Its really hard to troubleshoot when everything you do is show up and running fine and then it sporadically happens.

Forgive my typos im really tired.

Check your MTUs / show system mtu
check your jumbo and make sure its off, unless you need it, no system mtu jum 9000 or whatever frame size you use.

Also on Fiber modules you CANT change the duplexing/speed like that troll said. If he didn't hate Cisco so much apparently he would have added some valu to this conversation. They are fixed at 1000base rates/fduplex.

Coudl you have a compromised PC/server with a virus/trojan that someone is systematically doing something sinister on? Internal security threat i.e. employee? External, i.e. Trojan with someone connecting in somehow?

Is there is a firewall/ASA between the two connections?

Is there a VPN between the buildings? Could be tunnel timeout issues. I see this ALL THE TIME.

Are you using Quality of Service and have something adjusted wrong?

Is the fiber leased from say ATT etc... or do you own the fiber and everything between both endpoints?

Are you passing so much burst traffic that you might get getting overrun buffers on that 3750? 3750 is NOT a Distro switch by any stretch of imagination, however in limited load situations as long as the buffers arent beat to shit they will perform admirably. A real distro switch starts around the 4900 series of fixed config or chassis switches and the even newer nexus line. You may want to evaluate if your 3750 can handle the load of being distro. What I mean by this is that ethernet is very bursty and acting as a distro switch, while you think that 1gbps is no alot but when you have multiple 1gps of aggregated bandwidth being slammed in at once from 24 to 48 hosts worth of broadcast traffic per port from each access switch it can definitely slam the 3750s shared buffers into a pile of shit fast causing all kinds of delays and backups.

Are you using the 2960 to perform any intervlan routing? Yes they can on the 12.x images up to 8 SVIs can be routed. The 2960 doesn't have alot of CPU power so DHCP and other services will bog it down during heavy traffic scenarios.

Forgive me if you have tried all of this but I am only offering mental ammo here, stuff to consider.


edit*** And lastly I am not kidding, could be the NSA using that prism bullshit breaking the law spying on your crap and all. Check the news if you are unaware of what I am talking about hahaha.. Okay probably not but you never know these days.
 
Last edited:
Check your MTUs / show system mtu
check your jumbo and make sure its off, unless you need it, no system mtu jum 9000 or whatever frame size you use.

Coudl you have a compromised PC/server with a virus/trojan that someone is systematically doing something sinister on? Internal security threat i.e. employee? External, i.e. Trojan with someone connecting in somehow?

Is there is a firewall/ASA between the two connections?

Is there a VPN between the buildings? Could be tunnel timeout issues. I see this ALL THE TIME.

Are you using Quality of Service and have something adjusted wrong?

Is the fiber leased from say ATT etc... or do you own the fiber and everything between both endpoints?

Are you passing so much burst traffic that you might get getting overrun buffers on that 3750? 3750 is NOT a Distro switch by any stretch of imagination, however in limited load situations as long as the buffers arent beat to shit they will perform admirably. A real distro switch starts around the 4900 series of fixed config or chassis switches and the even newer nexus line. You may want to evaluate if your 3750 can handle the load of being distro. What I mean by this is that ethernet is very bursty and acting as a distro switch, while you think that 1gbps is no alot but when you have multiple 1gps of aggregated bandwidth being slammed in at once from 24 to 48 hosts worth of broadcast traffic per port from each access switch it can definitely slam the 3750s shared buffers into a pile of shit fast causing all kinds of delays and backups.

Are you using the 2960 to perform any intervlan routing? Yes they can on the 12.x images up to 8 SVIs can be routed. The 2960 doesn't have alot of CPU power so DHCP and other services will bog it down during heavy traffic scenarios.

Appreciate the thoughts so far. Let me answer what I can.


1.I believe i saw the MTU was set at 1500, but I will confirm when I go back to work.

2. I am leaning toward the idea that it is a compromised device. This is what I'm currently investigating.

3. No firewall/asa

4. No vpn

5. QOS *was* enabled globally but with only default settings...I disabled it during troubleshooting and saw no change

6. We own the fiber and everything between.

7. I have looked at QOS and burst traffic. Have not seen any queued or dropped packets. There is really not much traffic going on in this location. Average traffic is 2-5mb/sec

8. the 2960 is doing no routing.
 
Back
Top