I recently finished migrating a customers ISE cluster configuration to new hardware. It was about time because the old hardware has been EOL since this summer. This would suggest that I’m quite late to this, but there might still be people out there caught by surprise. I would therefore like to share my experience.
Summary
Like a scientific paper, which includes a summary at the top so you don’t have to read through the entire research, unless you are peer reviewing, I thought I put the outcome first, and then you can read how I got here.
What went well
Thorough planning is crucial. Think through all the steps necessary for a smooth migration. Thanks to that, most of the procedure went well with as little instabilities as possible.
Challenges
Unfortunately, even with the latest firmware and ISE 3.2 with the latest patch, there were a few bugs that wasn’t discovered in time.
Lessons learned
- Do all the preparations in head of time. You might discover bugs or hardware defects, therefore it is good to have a few days to fix those before the change window.
- Have a long change window. I expected to be finished in one day but due to some unexpected challenges, it took all weekend. Fortunately I was prepared for that.
- If you suspect a hardware defect, always perform a powercycle by pulling the powercable for a minute and then reconnect it. Cisco TAC will ask if you have done that anyway.
Planning and Design phase
Pre-sale
At first we had to choose what new hardware should replace the old one. The performance have gone quite a long way since the old SNS-3595.
- With an 8 node SNS-3595 cluster, we had a capacity of 160.000 concurrent sessions. That was with 4 PSN nodes, able to handle 40.000 sessions each.
- With the new 6 node SNS-3755 cluster, we have a capacity of 200.000 concurrent sessions. That is with 2 PSN nodes, 100.000 each.
The customers active sessions haven’t increased dramatically since we installed the last ISE appliances. Even if it would suddenly increase near the 200.000 limit, it’s easy to scale at a later time.
We were considering the SNS-3795, but the only benefit was increased disk and RAM. With the huge price difference, we thought it was not worth it
New cluster looks like this:
Approach
I reached out to Cisco for recommended approach and they pretty much suggested the backup/restore procedure, which is tested, proven and well documented. The only thing is that you need to prepare for a weekend job if you want to avoid instability in the network.
Note: This is not meant to be a complete detailed guide on how perform the backup/restore procedure. Details can be studied here:
- Ciscos Guide: https://www.cisco.com/c/en/us/td/docs/security/ise/3-2/upgrade_guide/Upgrade_Journey/PDF/b_ise_upgrade_guide_3_2_pdf.pdf
- Unofficial Guide: https://www.wiresandwi.fi/blog/cisco-ise-general-steps-for-upgrades-using-backup-and-restore-method
Planning the installation
This should be gathered before any installation begins:
- A table including:
- Hostnames
- CIMC IP addresses;
- Temporary MGMT IP’s for patching and installing certificates
- Physical design including
- rack number and position
- physical cabling
- Documentation of pxGrid and other external ISE integrations that needs to be checked after migration. Those can include, but not limited to:
- External Identity Sources like AD or Entra ID.
- pxGrid DNAC integration
- Make sure you are able to change applicable DNS records if hostnames change.
- Other important data that is needed:
- DNS servers
- NTP servers
- Domain name
- AD join credentials, if applicable
- Cisco credentials to migrate smart license
Table of necessary data
Note: Hostnames and IP’s have been changed to protect the customers identity.
Explanations:
- As you can see, the CIMC hostname can differ from the ISE hostname.
- Both MGMT IP’s are in the 10.10.200.0/23 subnet.
- It should not be any problem if you want to reuse your old hostnames
You will need to change the MGMT IP during the migration
Physical Design
Below is an example of how a physical design could look like:
Explanations:
- The stencils used can be downloaded from Cisco. Just search for “cisco stencils”
- The cabling scheme is designed to use backup interfaces.
- Blue lines are the MGMT VLAN
- Orange lines are for guest flow traffic.
- Black lines are the CIMC network.
- Rack number and positions are documented elsewhere.
Installing the appliances
Installation includes:
- Mounting the appliances into racks
- Patching CIMC and MGMT ports
- Configure CIMC with IP information
- Upgrade CIMC firmware
The first thing I discovered after mounting the appliances in the racks and patched all the cables was that you could not use the SFP ports for MGMT. You have to use the RJ45 ports (8 and 9 on the picture below).
Another funny thing is that the RJ45 ports ID’s are counted from left to right, but the SFP ports (1) are flipped upside-down so they are counted right to left.
CIMC configuration
Assuming no DHCP is configured on the network, you have to connect with console and set CIMC IP manually. To enter CIMC configuration utility, press F8 at startup.
Other things you can configure while there:
- Hostname
- NTP server
- DNS server
Upgrading SNS appliance firmware
After everything was installed and patched, I upgraded the CIMC firmware using this guide: https://www.cisco.com/c/en/us/td/docs/security/ise/sns3700hig/sns-37xx-firmware-4-x-xx_upgrade_guide.html
Notes:
- Check the release notes to see if your upgrade path is supported. You may have to upgrade to an intermediate version.
- I installed the firmware on the same MGMT VRF as the ISE appliances to avoid any issues with firewall or slow connection.
- Use a supported browser for the KVM interface. I use Firefox.
Preparations
Preparations include:
- Configure initial setup with temporary IP’s
- Configure a repository
- Install latest patch
- Install new certificates
All of this should be easy enough if you have some experience setting up an ISE node before. If you haven’t, don’t worry. The initial setup is pretty straight forward, just make sure you have this info at hand:
- Hostname
- Temporary MGMT IP details
- DNS servers
- NTP servers
You can find information on how to set up a repository in my other blog about ISE upgrade. You should also be able to find information about how to apply the latest patch there.
Note: Your current ISE deployment should also have the latest patch installed. If major versions differ, for instance current cluster is 3.2 but the new cluster is 3.3, it should probably work anyway, but please check the release notes. The safest procedure is if they have software parity.
If hostnames haven’t changed, you should be able to export and import the same certificates. If hostnames have changed, you need to generate new certificates.
Migration phase
I won’t go into details as I have already linked to guides that do that. Basically you export your configuration, without ADE-OS config, from the Primary PAN node on the old cluster and import it to the Secondary PAN node on the new cluster.
When you have exported configuration and the optional logs, you can start de-registering the secondary PAN node. When it’s deregistered, you can log into the CIMC of that appliance and power off the VM.
After that you can start restoring the configuration to the Secondary PAN node on the new cluster. In the meantime, you can update the DNS record for the secondary PAN node.
When the restore is finished, you just have to change the temporary MGMT IP to be the original IP used.
Note: Every time you change the IP or otherwise modify the interface, the application restarts.
You can then start:
- Deregistering the old primary MnT node;
- Power that one off from the CIMC interface;
- Change the temporary IP on the Primary MnT Node on the new cluster to the original IP;
- Update DNS records;
- and register the node into the new cluster.
Repeat for the rest of the cluster. I’m using the same recommended order as always:
- Secondary PAN
- Primary MnT
- PSN nodes
- Secondary MnT
- Primary PAN
All of this should work fine, unless…
Caveats discovered during the migration
… some unexpected bugs hit you. During this migration I opened 4 TAC cases. One of them was unnecessary but the others were really annoying bugs that consumed a lot of time. The new cluster was running on Monday morning, with still some unresolved caveats that took a few more days until everything was stable.
TAC #1: Faulty RAID controller
One of the appliances reported a faulty RAID controller from the CIMC interface:
I have seen this before on an APIC-M4 server and that resulted in an RMA. Naturally I started a TAC case and reported a DOA (Dead On Arrival). The engineer insisted on performing a power cycle by pulling the power cable for a few minutes before proceeding with RMA. I did that last time but it didn’t have any effect. But I got it done anyway.
To my surprise, it actually worked this time! No more faults on the system; lesson learned.
TAC #2: Initial setup could not start
This happened on the same appliance. Because of the faulty RAID controller incidence, I did not have time to complete the initial setup before the change window. A good thing was that it was still Friday so I had no problem reaching an engineer.
I sent a screen recording on what happened. I could not start the initial setup wizard:
The TAC engineer said I had to reimage the appliance. I was afraid that would be the answer.
I performed the reimaging on Saturday over the network with the KVM interface. It took 8 hours to complete. Maybe you are asking why I didn’t try with a USB memory stick instead? I have tried that in the past but was unable to get it to work. But if I knew in advance that it would take 8 hours, I would probably have given it another shot.
Note: If you need to perform a reimaging over the network, It’s recommended to increase the session timeout to max. You can do that under Admin > Communication Services
The highest value is 3 hours, so you still have to click around once in a while so the installation won’t stop. If it does, you have to start all over again.
TAC #3: Weird behavior on one of the interfaces
This one was indeed weird.
On one of the PSN nodes, I had problems with configuring an IPv4 address on the GigabitEthernet 2 interface.
I didn’t have any problem on the GigabitEthernet 0 or 4 interface. I had no problem on the other PSN node either.
Here is some of the output (with fake IP’s and MACs):
Note that bond 1 only has an IPv6 link-local address.
Also, the bond1 interface was sometimes listed when typing “show interface”, and sometimes it wasn’t, but never with an IPv4 address. Another thing I noticed was that when configuring the IPv4 address, the application did not restart like it used to do.
The solution was to remove IPv4 and IPv6 configuration from the interface, and re-configure the IPv4 address. Then the application restarted as expected and the IPv4 address was pingable. I also re-applied the IPv6 configuration afterwards without any problem. I still wonder how this could only be a problem on this specific interface on this specific appliance, as IPv6 is activated on all interfaces by default.
The problem with troubleshooting ISE interface configuration is that every time you make a change, it takes like 15 min for the application to restart. The time just flies away.
TAC #4: Could not abort the restore process
I accidentally started the restore process on the wrong PAN node. I quickly entered CTRL+C, which I shouldn’t have done apparently:
Warning: Do not use CTRL+C or close this terminal window until the restore is completed.
OK my bad, but what is the punishment? When it was time to deploy the last appliance, i.e. this one, I could not deploy it because it had a stuck restore process:
I tried everything:
- Reboot
- Powercycle by pulling the power cable
- Reset-config and application reset-config, which is supposed to return everything to factory defaults.
- Entered applcation configure ise and choose Force Backup Cancellation but that didn’t work on a restore operation.
I eventually managed to solve the problem on my own by starting a new bogus restore operation with the same name, but the file was removed from the repository and I entered the wrong password. Then the restore process failed and the process finally died.
I felt afterwards that this was not an intuitive solution and wondered why there isn’t an easier way to cancel a restore process. The TAC engineer agreed and said he would write an enhancement request on my behalf.
Conclusion
Last time I performed an ISE related upgrade it was a disaster, but I can admit I’m mostly to blame. This time I did my due diligence and was well prepared, but I still had many issues because of bugs that I could not have expected.
The only thing I would do different next time is to do all the preparation well in advance. Some of the caveats I would have been able to discover and fix before the change window. But I had a bit of a time constraint so I ended up fixing them during the change window. It wouldn’t have saved me any time or decrease the amount of TAC cases, but at least I wouldn’t have to work all weekend.