Here’s a summary of the lessons learned with Cisco UCS from the Cisco LIVE 2014 session titled, “BRKCOM-3010 – UCS: 4+ Years of Lessons Learned by the TAC”.
#1 – Read the Release Notes
It’s a good practice to read the release notes on any updates with UCS, specifically the Mixed Cisco UCS Release Support Matrix. Also, if you are going to be doing mixed, make sure to also check the “Minimum B/C Bundle…Features” section to ensure you have the right versions for any new features you are adding, otherwise you may get error messages.
#2 – Plan UCS Firmware Upgrades like an Elective Surgery
Before you begin any firmware upgrades, take the time to prepare. Consider doing a proactive TAC update – let them know you are doing a firmware update so they can point out any reminders. As mentioned above, consult the release notes. Also, backup your system and check the compatibility matrices. If you have any critical or major faults, contact the TAC and get the issues addressed before moving forward with any updates. There are video guides on how to do upgrades, so consider reviewing them before upgrading. Finally, check Cisco’s online community and support forums to see how other people are doing with upgrade paths.
According to Cisco, the steps that are most often overlooked in firmware upgrades: not updating the OS drivers to meet the compatibility matrix; forgetting to back up the system prior to upgrade and not upgrading the blade BIOS & Board Controller. It’s important to carefully consider these recommended planning steps because if you run into issues down the road and Cisco finds out that a driver or firmware is out of the support matrix, they won’t be able to help you move forward until you are in compliance. Cisco’s recommendation is to use the UCS HW and SW Interoperability Matrix for a reference on what is supported.
#3 – Use Maintenance Windows for UCS Upgrades
Although you could feasibly do upgrades during the day, it’s not worth the risk. Cisco TAC advises that all upgrades be done in a maintenance window – especially when doing changes to Fabric Interconnects. Doing updates to one blade is fine, but since everything goes through the Fabric Interconnects, wait until you can get a maintenance window. Better to be safe than sorry.
#4 – Backup UCSM
Although you have two Fabric Interconnects and redundancy, you still need to back up UCSM. You have four different options, full state; system configuration, logical configuration and all configuration. It’s recommended to do a full state (encrypted, and intended for Disaster Recovery.) The System Configuration option is XML based (not encrypted) but can be used to export into other Fabric Interconnects as needed. Logical Configuration is similar to the System Configuration but contains details on Service Profiles, VLANs, VSANs, pools & policies.
#5 – Use Fiber Channel Port Channels with Fiber Storage
Individual Fiber Channel uplinks can have high latency issues. Since the HBAs are given fcid’s based on when they come across via round robin, there is no way of distributing the loads – they are equally distributed. This becomes a problem with HBAs using accessing the storage a lot, or if you lose a link. To resolve, you have to manually balance the HBAs. With Fiber Channel Port Channels, all individual links are seen as one logical link allowing heavy workloads are equally distributed and preventing the loss of one down link from impacting the performance.
#5 – Insure Your A-Side and B-Side Fiber Channel Switches Remain Separated
Many people want to put an ISL between Fiber Channel Switches however the zoning goes to both sides and if a mistake is made on one side, it’ll take out the other. Also, don’t connect your Fabric Interconnects to two separate Fiber Channel Switches. Keep FI #1 attached to FC Switch #1 and FI #2 attached to FC Switch #2.
#6 – Don’t Use 3rd Party Transceivers
Pay the premium for Cisco transceivers and avoid unnecessary issues or faults.
#7 – Degraded DIMM Faults May Not Be Accurate
Cisco TAC admitted that Cisco had conservative thresholds for ECC errors on UCS which caused for more alarms than necessary. These false alarms were fixed in firmware versions 2.2(1b) and 2.1(3c). If you are experiencing these issues and are outside your maintenance window, you can safely ignore the ‘degraded DIMM’ faults until you upgrade or RMA the degraded DIMM. Turn on DIMM blacklisting to mark DIMMs with uncorrectable DIMM errors as bad in 2.2(1b).
Kevin Houston is the founder and Editor-in-Chief of BladesMadeSimple.com. He has over 17 years of experience in the x86 server marketplace. Since 1997 Kevin has worked at several resellers in the Atlanta area, and has a vast array of competitive x86 server knowledge and certifications as well as an in-depth understanding of VMware and Citrix virtualization. Kevin works for Dell as a Server Sales Engineer covering the Global Enterprise market.
Disclaimer: The views presented in this blog are personal views and may or may not reflect any of the contributors’ employer’s positions. Furthermore, the content is not reviewed, approved or published by any employer.
I whole heartedly agree with everything said here. I’ve been burned by TwinAx and some other bugs pretty badly. Do not mess around with this! Also I’d say choose your versions very carefully. A good example is that UCS 2.2 is not really fully primetime ready, but you’re force to run it if you want to the v2 processors for a B420M3! We’re sitting stable on 2.1(3b) right now and will not be updating until the 2.2 platform is more mature or is eclipsed by a 3.0/3.1 release that does the same thing. It’s a shame we’re going to be forced to go to a release that I’m not going to have a lot faith in just because of the B420.
Pingback: Deploy a Cisco UCS system – Part 4 – Upgrading the Firmware - @Saintdle