ACI Fabric Discovery - Not working!

Problem:

I work with ACI - but in test environments. I am frequently in a position where the fabric needs totally resetting, or a new fabric is being built from switches & components which have been used in other fabrics.

As a result of the above churn, there's all sorts of weirdness which goes on. Often configuration still exists on the boxes from previous activities, software versions are mismatched and sometimes the boxes have been used for standalone NX-OS activities so don't even have ACI images on them!

These issues manifest themselves in a few different ways - however the most frustrating of which is as follows.

A typical fabric discovery exercise assumes that all ACI spine<->leaf connections are in place. The APIC appliances are connected to the leaf switches. All devices should be reset and awaiting fabric discovery, and the APIC is at the initial prompt of asking for a fabric name.

So, you check the leaf switches are awaiting discovery - software versions mismatch between the APIC and leafs but hopefully this is no big deal because you can update everything once discovered, great! Moving on...

Now we configure the APIC, we put in the TEP pool, infra VLAN, GPIO ranges, MGMT IPs etc, and the APIC does it's thing and comes up, we can get to the web GUI and everything is hunky dory.

Now...the problem begins. Our APIC is dual-connected to two leafs, but the APIC only uses a single link since it's an active-standby configuration until one link fails. This means our discovery process should happen as follows. We register the first leaf, then the spines, then any other leafs (including the leaf connected to the standby link of the APIC).

This is all well and good, with the above information in mind, we navigate to the fabric membership screen in the APIC, expecting a single entry for the first leaf. BUT IT'S NOT THERE! AUGH!

So we console to the leaf, looks all good...it's in discovery, link towards the APIC is up, LLDP confirms that the leaf can see the APIC. Why doesn't it work?!

tl;dr - Everything is connected correctly, links are up, APIC cannot see the first switch...

Solution:

After a rather lengthy problem description, I'll shorten the solution.

  1. Software versioning matters - but it's not the first thing I would check.
  2. Time also matters - but this would be the third thing I would check.

The first and foremost thing to verify, is that ALL fabric members, i.e. ALL SPINES & ALL LEAFS have been cleaned with the setup-clean-config.sh script and then reloaded.

The issue I was having is that one of my spines was not clean, and it was pushing the incorrect infra VLAN towards the leafs. This doesn't show up on the console when looking at it - it still appears in fabric discovery. But after this happens it 'pollutes' the leaf switch. The onbly way to really resolve this is to clean the leaf again and reload.

So, in order of things I would check if your fabric is not discovering properly:

  1. ALL switches are clean.
  2. Software versions are at least close.
  3. The time difference between nodes is close (within an hour or so).