I ran into an “exciting” bug yesterday. It was seen in a 4500-X VSS pair running 3.7.0 code. When there has been a switchover meaning that the secondary switch became active, there’s a risk that information is not properly synced between the switches. What we were seeing was that this VSS pair was “eating” the packets, essentially black holing them. Any multicast that came into the VSS pair would not be properly forwarded even though the Outgoing Interface List (OIL) had been properly built. Everything else looked normal, PIM neighbors were active, OILs were good (except no S,G), routing was there, RPF check was passing and so on.
It turns out that there is a bug called CSCus13479 which requires CCO login to view. The bug is active when Portchannels are used and PIM is run over them. To see if an interface is misbehaving, use the following command:
hrn3-4500x-vss-01#sh platfo hardware rxvlan-map-table vl 200 <<< Ingress port Executing the command on VSS member switch role = VSS Active, id = 1 Vlan 200: l2LookupId: 200 srcMissIgnored: 0 ipv4UnicastEn: 1 ipv4MulticastEn: 1 <<<<< ipv6UnicastEn: 0 ipv6MulticastEn: 0 mplsUnicastEn: 0 mplsMulticastEn: 0 privateVlanMode: Normal ipv4UcastRpfMode: None ipv6UcastRpfMode: None routingTableId: 1 rpSet: 0 flcIpLookupKeyType: IpForUcastAndMcast flcOtherL3LookupKeyTypeIndex: 0 vlanFlcKeyCtrlTableIndex: 0 vlanFlcCtrl: 0 Executing the command on VSS member switch role = VSS Standby, id = 2 Vlan 200: l2LookupId: 200 srcMissIgnored: 0 ipv4UnicastEn: 1 ipv4MulticastEn: 0 <<<<< ipv6UnicastEn: 0 ipv6MulticastEn: 0 mplsUnicastEn: 0 mplsMulticastEn: 0 privateVlanMode: Normal ipv4UcastRpfMode: None ipv6UcastRpfMode: None routingTableId: 1 rpSet: 0 flcIpLookupKeyType: IpForUcastAndMcast flcOtherL3LookupKeyTypeIndex: 0 vlanFlcKeyCtrlTableIndex: 0 vlanFlcCtrl: 0
From the output you can see that "ipv4MulticastEn" is set to 1 on one switch and 0 to the other one. The state has not been properly synched or somehow misprogrammed which leads to this issue with black holing multicast. It was not an easy one to catch so I hope this post might help someone.
This also shows that there are always drawbacks to clustering so weigh the risk of running in systems in clusters and having bugs affecting both devices as opposed to running them stand alone. There's always a tradeoff between complexity, topologies and how a network can be designed depending on your choice.
Interesting, I already got this problem without knowing that it was a bug. What did you do to enable the multicast on the second switch without upgrading and/or reloading the box ?
Thanks !
Silly me to forget adding the fix. Remove PIM from interface and then put it back again. I’ll add this to the post.
Heh, thanks for this super fast answer ! :-p