Solaris 11.1 upgrade - vlan stopped working

PigLover · Nov 1, 2012

Sorry for the long post.

I upgraded a test server to Solaris 11.1 from 11/11 a few days ago. After I did the upgrade my one VLAN-based network connection stopped working. I've been messing with it for days and am a bit baffled.

Note that everything was up and working just before the upgrade. This is not a switch configuration problem or a cables issue.

All of the other networking survived the upgrade just fine...

The link that stopped working is a VLAN running on top of a two-link LAG. The untagged link running over the same LAG works just fine.

So here's the strange part. As I was trying to get underneath it all today I fired up wireshark to see if I could figure it out. As soon as I put wireshark up on the interface (in its default promiscuous mode) the link started working. All the packets in the trace looked normal. All was good. As soon as I stopped the trace the link was dead again. Start a trace - link in promiscuous mode - and all is good again. Stop and it stops...

If I bring bring up wireshark on the link without promiscuous mode the link does NOT start working. If I initiate a ping that should go out on the broken VLAN I see a series of ARP requests but no answers (running wireshark on the machine being ping'd I see all of the ARP request come in and the answers go out, but the Solaris machine never sees the answers).

So - did Oracle manage to break VLANs in 11.1? Any ideas how to get it working again?

A few bits on info from the machine. The only thing that looks odd/wrong I've highlighted below.

Phil@TEST:~$ dladm show-link
LINK CLASS MTU STATE OVER
e1000g1 phys 1500 up --
e1000g0 phys 1500 up --
ixgbe0 phys 9000 up --
ixgbe1 phys 9000 up --
aggr2 aggr 9000 up ixgbe0 ixgbe1
aggr2vlan5 vlan 9000 up aggr2

Phil@TEST:~$ dladm show-vlan
LINK VID OVER FLAGS
aggr2vlan5 5 aggr2 -----

Phil@TEST:~$ ipadm show-addr aggr2vlan5
ADDROBJ TYPE STATE ADDR
aggr2vlan5/v4 dhcp ok 192.168.5.101/24

Phil@TEST:~$ dladm show-linkprop aggr2vlan5
LINK PROPERTY PERM VALUE DEFAULT POSSIBLE
aggr2vlan5 autopush rw -- -- --
aggr2vlan5 zone rw -- -- --
aggr2vlan5 state r- unknown up up,down
aggr2vlan5 mtu rw 9000 1500 1500-9000
aggr2vlan5 maxbw rw -- -- --
aggr2vlan5 cpus rw -- -- --
aggr2vlan5 cpus-effective r- 0-7 -- --
aggr2vlan5 rxfanout rw -- 8 --
aggr2vlan5 rxfanout-effective r- 16 -- --
aggr2vlan5 pool rw -- -- --
aggr2vlan5 pool-effective r- -- -- --
aggr2vlan5 priority rw high high low,medium,high
aggr2vlan5 forward rw 1 1 1,0
aggr2vlan5 protection rw -- -- mac-nospoof,
restricted,
ip-nospoof,
dhcp-nospoof
aggr2vlan5 mac-address r- 0:1b:21:6b:23:98 0:1b:21:6b:23:98 --
aggr2vlan5 allowed-ips rw -- -- --
aggr2vlan5 allowed-dhcp-cids rw -- -- --
aggr2vlan5 rxrings r- -- -- --
aggr2vlan5 rxrings-effective r- -- -- --
aggr2vlan5 txrings r- -- -- --
aggr2vlan5 txrings-effective r- -- -- --
aggr2vlan5 txrings-available r- 0 -- --
aggr2vlan5 rxrings-available r- 0 -- --
aggr2vlan5 rxhwclnt-available r- 0 -- --
aggr2vlan5 txhwclnt-available r- 0 -- --
aggr2vlan5 vsi-mgrid rw -- -- --
aggr2vlan5 etsbw-lcl rw -- 0 --
aggr2vlan5 etsbw-lcl-effective r- -- -- --
aggr2vlan5 etsbw-rmt-effective r- -- -- --
aggr2vlan5 etsbw-lcl-advice r- -- -- --
aggr2vlan5 cos rw -- 0 --

paret0 · Nov 1, 2012

I don't see a "tagmode", or anything to do with tags in your aggr2vlan5 properties. Have you tried to rebuild the aggregation? I would.

Does the switch have anything pertinent to say about things?

PigLover · Nov 1, 2012

I did completely de-configure/re-configure the 'aggr2vlan5' link. Thought there might be some configuration corruption due to the upgrade. But it did not clear.

See the 'diadm show-vlan' output above. This is the only way Solaris 11 will show the tag mode, and it clearly shows vlan tag 5.

Also note that this is a completely functional config under Solaris 11 11/11. If I reboot and select the prior boot environment in Grub it's all happy...

It is very, very confusing.

paret0 · Nov 1, 2012

It's pretty quick to reconfigure a 2-nic aggregation up from ipadm create-ip.

No VLANS on this aggr, but I seem to show a few more props? [mac spoofed for public consumption]

$ dladm show-linkprop aggr0
LINK PROPERTY PERM VALUE DEFAULT POSSIBLE
aggr0 autopush rw -- -- --
aggr0 zone rw -- -- --
aggr0 state r- up up up,down
aggr0 mtu rw 9000 1500 1500-9000
aggr0 maxbw rw -- -- --
aggr0 cpus rw -- -- --
aggr0 cpus-effective r- 0-3 -- --
aggr0 rxfanout rw -- 1 --
aggr0 rxfanout-effective r- 2 -- --
aggr0 pool rw -- -- --
aggr0 pool-effective r- -- -- --
aggr0 priority rw high high low,medium,high
aggr0 tagmode rw vlanonly vlanonly normal,vlanonly
aggr0 forward rw 1 1 1,0
aggr0 default_tag rw 1 1 --
aggr0 vlan-announce rw off off off,gvrp
aggr0 gvrp-timeout rw 250 250 100-100000
aggr0 learn_limit rw 1000 1000 --
aggr0 learn_decay rw 200 200 --
aggr0 stp rw 1 1 1,0
aggr0 stp_priority rw 128 128 --
aggr0 stp_cost rw auto auto --
aggr0 stp_edge rw 1 1 1,0
LINK PROPERTY PERM VALUE DEFAULT POSSIBLE
aggr0 stp_p2p rw auto auto true,false,auto
aggr0 stp_mcheck rw 0 0 1,0
aggr0 protection rw -- -- mac-nospoof,
restricted,
ip-nospoof,
dhcp-nospoof
aggr0 mac-address rw 0:14:4f:9e:4c:5d 0:14:4f:9e:4c:5d --
aggr0 allowed-ips rw -- -- --
aggr0 allowed-dhcp-cids rw -- -- --
aggr0 rxrings rw -- -- --
aggr0 rxrings-effective r- -- -- --
aggr0 txrings rw -- -- sw,hw
aggr0 txrings-effective r- -- -- --
aggr0 txrings-available r- 0 -- --
aggr0 rxrings-available r- 0 -- --
aggr0 rxhwclnt-available r- 0 -- --
aggr0 txhwclnt-available r- 1 -- --
aggr0 vsi-mgrid rw -- :: --
aggr0 etsbw-lcl rw -- 0 --
aggr0 etsbw-lcl-effective r- -- -- --
aggr0 etsbw-rmt-effective r- -- -- --
aggr0 etsbw-lcl-advice r- -- -- --
aggr0 cos rw -- 0 --

PigLover · Nov 1, 2012

paret0 said:
It's pretty quick to reconfigure a 2-nic aggregation up from ipadm create-ip.

No VLANS on this aggr, but I seem to have more props?

Yes. There are different properties listed on the LAG itself and on the "virtual" VLAN link. I think this makes sense - but I will go through it just to be sure. Thanks for the suggestion. @work now, but I'll double check the linkprops on the underlying aggregate itself tonight to ensure that they look right. This is something I haven't looked at in great detail.

From your show-linkprops, however, most of the differences seem to make sense. "tagmode" makes sense for an Ethernet link (or a LAG of Ethernet links) because you need to specify whether or not to use 802.1q framing on the link or not, but it makes little sense to specify "tagmode" on the VLAN itself - which is already tagged and can't contain nested VLANs inside of it. Similarly for the "default tag", which is the VLAN, if any, which will be presented untagged on the link.

paret0 said:
$ dladm show-linkprop aggr0
LINK PROPERTY PERM VALUE DEFAULT POSSIBLE
aggr0 autopush rw -- -- --
aggr0 zone rw -- -- --
aggr0 state r- up up up,down
aggr0 mtu rw 9000 1500 1500-9000
aggr0 maxbw rw -- -- --
aggr0 cpus rw -- -- --
aggr0 cpus-effective r- 0-3 -- --
aggr0 rxfanout rw -- 1 --
aggr0 rxfanout-effective r- 2 -- --
aggr0 pool rw -- -- --
aggr0 pool-effective r- -- -- --
aggr0 priority rw high high low,medium,high
aggr0 tagmode rw vlanonly vlanonly normal,vlanonly
aggr0 forward rw 1 1 1,0
aggr0 default_tag rw 1 1 --
aggr0 vlan-announce rw off off off,gvrp
aggr0 gvrp-timeout rw 250 250 100-100000
aggr0 learn_limit rw 1000 1000 --
aggr0 learn_decay rw 200 200 --
aggr0 stp rw 1 1 1,0
aggr0 stp_priority rw 128 128 --
aggr0 stp_cost rw auto auto --
aggr0 stp_edge rw 1 1 1,0
LINK PROPERTY PERM VALUE DEFAULT POSSIBLE
aggr0 stp_p2p rw auto auto true,false,auto
aggr0 stp_mcheck rw 0 0 1,0
aggr0 protection rw -- -- mac-nospoof,
restricted,
ip-nospoof,
dhcp-nospoof
aggr0 mac-address rw 0:14:4f:9e:4c:5d 0:14:4f:9e:4c:5d --
aggr0 allowed-ips rw -- -- --
aggr0 allowed-dhcp-cids rw -- -- --
aggr0 rxrings rw -- -- --
aggr0 rxrings-effective r- -- -- --
aggr0 txrings rw -- -- sw,hw
aggr0 txrings-effective r- -- -- --
aggr0 txrings-available r- 0 -- --
aggr0 rxrings-available r- 0 -- --
aggr0 rxhwclnt-available r- 0 -- --
aggr0 txhwclnt-available r- 1 -- --
aggr0 vsi-mgrid rw -- :: --
aggr0 etsbw-lcl rw -- 0 --
aggr0 etsbw-lcl-effective r- -- -- --
aggr0 etsbw-rmt-effective r- -- -- --
aggr0 etsbw-lcl-advice r- -- -- --
aggr0 cos rw -- 0 --

paret0 · Nov 1, 2012

And I totally understand about "confusing". If you have an aggregation of 2 10G nics, you're probably pissed off too...

Just poking here,
Is there a diff -bi of /kernel/drv/ixgbe.conf between snapshots under the different BEs?

PigLover · Nov 1, 2012

Not really all that pissed off, though I'll bet not that many people have 10Gbe networking in their garage! Its really a lab config for trying things out. Besides, the 10Gbe LAG still actually works - it has a live address and passes traffic just fine on the "untagged" link, which is where most of the traffic is anyway. The "tagged" link is part of a 10Gbe-only VLAN used to access a SAN. On the switch (a Juniper EX2500) this VLAN is set up so that it never forwards to any 1Gbe client or to a switch with 1Gbe connections, guaranteeing the SAN access can never be impaired by any head-of-line blocking caused by access from slow clients on the 1Gbe part of the network. It was a trial set up to demonstrate exactly this kind of problem in mixed speed networks...and is really not necessary any longer. Sorta like having a set of servers with a "back door" interface at 10Gbe to a shared SAN and a separate "front door" door interface to clients that might be connected at 10Gbe or 1Gbe each. Except the split interfaces are actually simulated with the VLAN. I could just as well tear down the extra VLAN and be happy.

Except that I am a curious sort of guy and really do want to know why it stopped working...

Good idea on doing the driver conf diffs from the old boot environment. Hadn't thought of that one either. I'll give it a look.

On writing this and thinking about it some more I may try unplumbing the IP links on the underlying aggregate to see if that makes a difference. It shouldn't, but who knows.

PigLover · Nov 1, 2012

Fixed. Don't know exactly how/why, but fixed.

When I came home this afternoon I deleted the all of the IPs associated with the VLAN and the LAG. I deleted the VLAN itself and deleted the LAG. Basically tore down all of the datalink and IP layers leaving only the raw interface cards. Rebuilt the LAG, rebuilt the VLAN and reinstalled the IPs. And like magic, the whole thing is happy.

Something in the configs of the LAG or the VLAN must have been corrupted during the upgrade. But now its all fat, dumb and happy again.

Very odd.

paret0 · Nov 1, 2012

Sounds like some install script set something to default, and didn't bother setting it back?
I'm pretty stoked about how well the upgrade to Solaris 11.1 went here (so far).

Cheers!

nstuy · Dec 10, 2012

If you reboot your system, does the VLAN over the aggregation still work? I ran into the same issue with a brand new installation of Solaris 11.1. I setup the networking and everything was fine until I rebooted. Then it stopped. I tore down everything to the physical adapters and rebuilt - then it worked (until I rebooted again).

Any thoughts?

PigLover · Dec 11, 2012

Interesting question. Unfortunately, its also one I can't answer anymore because the config I was using has been torn down.

paret0 · Dec 11, 2012

Just back from an elective reboot, and can say that aggr0 is up LACP on fast on 4 ports and switch agrees, vlans are up (6 hosts on 2 vlans) and switch agrees, IPMP is also up in all zones (24 vnics, 12 hosts), VBox NAT net is up on the 10.10.x.x band(6 hosts), DHCP and static IPv4, IPv6 Assisted RA client and DHCPv6 client, MTU 9014 across the board.

Looks like a green light for throttle-up...

nstuy · Dec 11, 2012

Guess I'll have to dig in to find out what's ailing my network configuration. One of my machines (an older Sun x86 Workstation) lost its network connectivity when I upgraded it from Solaris 11 to 11.1. The other machine was a Dell PowerEdge 1950 III that got a fresh Solaris 11.1 install.

For the clean install, here are the steps I performed as root:
netadm enable -p ncp DefaultFixed
dladm show-phys
LINK MEDIA STATE SPEED DUPLEX DEVICE
net0 Ethernet up 1000 full e1000g0
net1 Ethernet down 0 unknown igb0
net2 Ethernet down 0 unknown igb1
dladm create-aggr -l net0 -l net1 -l net2 lag0
dladm set-linkprop -p mtu=9000 lag0
dladm modify-aggr -P L2 lag0
dladm modify-aggr -L passive -T short lag0
dladm create-vlan -l lag0 -v 102 lag0v102
ipadm create-ip lag0v102
ipadm create-addr -T static -a 10.10.12.12 lag0v102/v4
route -fp add default 10.10.12.1

then setup my name servers:
nano /etc/resolv.conf
domain mylittlesecret.com
nameserver 209.166.161.120
nameserver 209.166.161.121
nscfg import -f dns/client
svcadm enable dns/client

then NTP:
nano /etc/inet/ntp.conf
server 0.vmware.pool.ntp.org
server 1.vmware.pool.ntp.org
server 2.vmware.pool.ntp.org
svcadm enable network/ntp

My switch (Brocade FastIron SuperX) is set to active for the link aggregation for the three ports to which the server is connected corresponding to net0, net1, and net2.

Any ideas of where I might have gone wrong?

paret0 · Dec 11, 2012

Are your links properly configured under lag0?
I see no mention of ipadm or properties of the underlying links in your notes. You probably just didn't include them, but a Solaris aggr can only inherit the props of the underlying links, so if for instance if they are not configured for jumbo frames or IPv6, the resulting aggregation won't be capable of high MTUs or static/dhcp v6 addys.

nsswitch? What does cat /etc/nsswitch.conf show? Should say "files dns" at minimum, maybe even "multicast" too. If it's just "files", refresh name-service/switch:default with svcadm.

A C/P and Google should get you right there if svcadm syntax is unfamiliar territory.

(Why don't you do your own DNS w VMWare or pfSense, or Bind or Unbound on Solaris?)

nstuy · Dec 12, 2012

Thanks for responding. Looks like I didn't do the following items:

nano /etc/nsswitch.conf
hosts: files dns
ipnodes: files dns
nscfg import -f name-service/switch
svcadm refresh name-service/switch

What's weird is that I thought Solaris 11.1 added dns to those as a new default. Could have sworn I checked it before but I guess not, so thanks for that reminder.

It looks like when I set the MTU on the aggregation:
dladm set-linkprop -p mtu=9000 lag0
that all the underlying links got their MTU set to 9000 as well. That may be something new in 11.1:
dladm show-linkprop net0
LINK PROPERTY PERM VALUE DEFAULT POSSIBLE
net0 speed r- 1000 1000 --
net0 autopush rw -- -- --
net0 zone rw -- -- --
net0 duplex r- full full half,full
net0 state r- up up up,down
net0 adv_autoneg_cap rw 1 1 1,0
net0 mtu rw 9000 1500 60-9000
net0 flowctrl rw bi bi no,tx,rx,bi,pfc,
auto
net0 flowctrl-effective r- -- -- --
net0 adv_10gfdx_cap r- -- 0 1,0
net0 en_10gfdx_cap -- -- 0 1,0
net0 adv_1000fdx_cap r- 1 0 1,0
net0 en_1000fdx_cap rw 1 1 1,0
net0 adv_1000hdx_cap r- 1 0 1,0
net0 en_1000hdx_cap rw 1 1 1,0
net0 adv_100fdx_cap r- 1 0 1,0
net0 en_100fdx_cap rw 1 1 1,0
net0 adv_100hdx_cap r- 1 0 1,0
net0 en_100hdx_cap rw 1 1 1,0
net0 adv_10fdx_cap r- 1 0 1,0
net0 en_10fdx_cap rw 1 1 1,0
net0 adv_10hdx_cap r- 1 0 1,0
net0 en_10hdx_cap rw 1 1 1,0
net0 maxbw rw -- -- --
net0 cpus rw -- -- --
net0 cpus-effective r- -- -- --
net0 rxfanout rw -- 1 --
net0 rxfanout-effective r- 0 -- --
net0 pool rw -- -- --
net0 pool-effective r- -- -- --
net0 priority rw high high low,medium,high
net0 tagmode rw vlanonly vlanonly normal,vlanonly
net0 forward rw 1 1 1,0
net0 default_tag rw 1 1 --
net0 vlan-announce rw off off off,gvrp
net0 gvrp-timeout rw 250 250 100-100000
net0 learn_limit rw 1000 1000 --
net0 learn_decay rw 200 200 --
net0 stp rw 1 1 1,0
net0 stp_priority rw 128 128 --
net0 stp_cost rw auto auto --
net0 stp_edge rw 1 1 1,0

net0 stp_p2p rw auto auto true,false,auto
net0 stp_mcheck rw 0 0 1,0
LINK PROPERTY PERM VALUE DEFAULT POSSIBLE
net0 protection rw -- -- mac-nospoof,
restricted,
ip-nospoof,
dhcp-nospoof
net0 mac-address rw a0:36:9f:c:8:c8 0:1e:c9:fe:3:ac --
net0 allow-autoconf rw 1 1 1,0
net0 allowed-ips rw -- -- --
net0 allowed-dhcp-cids rw -- -- --
net0 rxrings rw -- -- --
net0 rxrings-effective r- -- -- --
net0 txrings rw -- -- --
net0 txrings-effective r- -- -- --
net0 txrings-available r- 0 -- --
net0 rxrings-available r- 0 -- --
net0 rxhwclnt-available r- 0 -- --
net0 txhwclnt-available r- 0 -- --
net0 pfcmap rw -- 11111111 00000000-11111111
net0 pfcmap-lcl-effective r- -- -- --
net0 pfcmap-rmt-effective r- -- -- --
net0 ntcs r- 0 0 --
net0 vsi-mgrid rw -- :: --
net0 vsi-mgrid-enc rw -- oracle_v1 none,oracle_v1
net0 lro rw off auto on,off,auto
net0 lro-effective r- off off on,off
net0 etsbw-lcl rw -- 0 --
net0 etsbw-lcl-effective r- -- -- --
net0 etsbw-rmt-effective r- -- -- --
net0 etsbw-lcl-advice r- -- -- --
net0 cos rw -- 0 --

Unfortunately, rebooting still kills the networking (can't ping it from another machine like I could after I initially setup the networking). Any other thoughts?

nstuy · Dec 12, 2012

Here's some additional info:
dladm show-link net0
LINK CLASS MTU STATE OVER
net0 phys 9000 up --
dladm show-link net1
LINK CLASS MTU STATE OVER
net1 phys 9000 up --
dladm show-link net2
LINK CLASS MTU STATE OVER
net2 phys 9000 up --
dladm show-link lag0
LINK CLASS MTU STATE OVER
lag0 agar 9000 up net0 net1 net2

PigLover · Dec 12, 2012

Do you have NWAM disabled?

nstuy · Dec 12, 2012

I think I disabled NWAM. I executed this as root:
netadm enable -p ncp DefaultFixed

PigLover · Dec 13, 2012

Yup - that should have disabled NWAM.

paret0 · Dec 13, 2012

nstuy - Can you make that vlan work on un-aggregated interface?

Like - If you only have 3, remove one of your 3 nics from the lag, and plumb it.
Then set up the vlan on it, to see if that works.

Shot in the dark here - MAC addresses may have 'reoriented', and the new lag may show something different to your router or firewall than previously, which may fuck up something??

nstuy · Dec 13, 2012

So I just tore everything down and rebuilt it...

ipadm delete-addr lag0v102/v4
ipadm delete-ip lag0v102
dladm delete-vlan lag0v102
dladm delete-aggr lag0
dladm create-aggr -l net0 -l net1 -l net2 -l net3 lag0
dladm set-linkprop -p mtu=9000 lag0
dladm modify-aggr -P L2 lag0
dladm modify-aggr -L passive -T short lag0
dladm create-vlan -l lag0 -v 102 lag0v102
ipadm create-ip lag0v102
ipadm create-addr -T static -a 10.10.12.12 lag0v102/v4
route -fp add default 10.10.12.1
reboot

And now the networking comes up ok after the reboot. I did this before and it didn't work. Wondering if I'm crazy here.

PigLover · Dec 13, 2012

That is similar to the experience I had with the upgrade to 11.1. A really simple LAG group of two NICs, with three VLANs defined on it. Didn't work after the upgrade. I could get it to work occasionally by playing with it, but it always broke again on reboot. I tested and tried to reasonably diagnose it (including running WireShark to try and see what was happening - but magically every time WireShark put the link into promiscuous mode to sniff packets it all magically started working - and as soon as WireShark exited it all stopped again).

Out of desperation I finally deleted everything about the connection and then built it all back up. Never had a problem with it again...until the day I had to tear it all down to re-purpose the 10Gbe switch.

Something about the Solaris 11 to 11.1 upgrade process leaves something related to LAG groups corrupted. No idea what it is. I am not a support customer so Oracle doesn't care. But deleting it all and building it again makes it all nice.

paret0 · Dec 14, 2012

Lends a little credence to the "mac reorientation" theory...
IOW, the lag reconfigures, and presents your router or switch with a different mac addy (or some other parameter which is different from previous).

nstuy · Dec 14, 2012

In my case, I did a bare metal installation of Solaris 11.1 and had the problem. I think it may be a subtle bug in 11.1.

PigLover · Dec 14, 2012

paret0 said:
Lends a little credence to the "mac reorientation" theory...
IOW, the lag reconfigures, and presents your router or switch with a different mac addy (or some other parameter which is different from previous).

Definitely NOT a MAC problem. During my troubleshooting I reset the LAG several times on the switch (Juniper EX4500) and did a mac-table clear. The LAG was virgin-pure on the switch end and it still didn't work.

Its a config bug in 11.1.

Solaris 11.1 upgrade - vlan stopped working

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

[H]ard|Gawd

Limp Gawd

n00b

[H]ard|Gawd

Limp Gawd

n00b

Limp Gawd

n00b

n00b

[H]ard|Gawd

n00b

[H]ard|Gawd

Limp Gawd

n00b

[H]ard|Gawd

Limp Gawd

n00b

[H]ard|Gawd