Uc On Ucs B Series Troubleshooting Guide
Uc On Ucs B Series Troubleshooting Guide
Section Links
UCS tools for Troubleshooting Page 3
Blade/Server Troubleshooting Page 36
IOM (FEX) Troubleshooting Page 56
Fabric Interconnect Troubleshooting Page 74
SAN Troubleshooting Page 88
Cisco UCSManager
Embedded in Fabric Interconnect
Chassis Management
UCS6140XP40 Port Fabric Interconnect
Controller (CMC) Operations,
Chassis Discovery, Physical
Cisco UCS2100 Series Fabric Extenders
Connections to Fabric
Logically part of Fabric Switch
Interconnect (FI) and Logical
Inserts into Blade Enclosure
Connections to Adaptor
Cards
Cisco UCS5100 Series Blade Chassis
Flexible bay configurations
Logically part of Fabric Interconnect
Baseboard Management
Controller (BMC) of
Compute nodes, All Compute
node Components (memory,
proc, mezzcards, disk
Management Network
IP #A
Switch-A#
IP #B
Switch-B#
Back
Redundant
management
service
UCSM
switch elements
UCSM
chassis elements
multiple protocol
support
server elements
UCSM access
Client logs for debugging UCSM access & Client KVMaccess are found at this location
on Client system:
C:\Documents and Settings\userid\Application Data\Sun\Java\Deployment\log\.ucsm
Presentation_ID
Cisco Confidential
Statistics breakdown
Live/now
History
Fabric Interconnect A
Fabric Interconnect B
Interface Layer
Interface Layer
UCSM-A
Replicator
DME
UCSM-B
HA
Controller
HA
Controller
Replicator
FSM
FSM
(active)
Persistifier
Chassis 2
CMC
CMC
Chassis 3
CMC
...
CMC
EPROM
CMC
EPROM
Chassis 1
CMC
EPROM
EPROM
CMC
(standby)
Persistifier
flash
flash
DME
chassis
CMC
DME just manages the state of the object and workflow, and then
instructs the AG to perform the activity.
AGs do the real work.
FSM usually have the following notation
FSM <Object><Workflow><Operation><Where-is-it-executed>
Object Blade/Chassis
Processing Node Utility OS
Linux-based pre-boot execution environment that can boot on a
Workflow Discover/Association
processing node to run diagnostics, report inventory, or configure the
firmware state of the Blade
Operation Pnuos-Config
Where is generally , or A or B or Local or Peer
If Where is not specified, it is executed on managing node
FSM
Most every action
done by the
UCSM has a
FSM to verify
operation and
status
View and monitor
each action for
ongoing feedback
and progress
state of an action
Logs kept for
review and
troubleshooting
OBFL
Processor Unit
Memory Mirroring, Sparing, SMILink errors
Motherboard
PCIe, QPIuncorrectable errors, Legacy PCI errors
All these errors are modeled as stats properties. The ones for which thresholds are not
defined get reported as statistics only
Server
CLI navigation
SSH or Telnet to the Cluster IP when possible
You will connect to the Primary FI in the cluster automatically
Cisco UCS 6100 Series Fabric Interconnect
Using keyboard-interactive authentication.
The copyrights to certain works contained herein are owned by
other third parties and are used and distributed under license.
Some parts of this software may be covered under the GNU Public
License or the GNU Lesser General Public License. A copy of
each such license is available at
http://www.gnu.org/licenses/gpl.html and
http://www.gnu.org/licenses/lgpl.html
FarNorth-B#
FarNorth-B# show
chassis
Chassis
cli
CLIcommands
clock
Display current Date
cluster
Clustermode
configuration
Show information about configuration sessions
eth-uplink
Ethernet Uplink
event
EventManager commands
fabric-interconnect Show Fabric Interconnect
fault
Fault
identity
Identity
iom
IO Module
license
Show the contents of all the license files
org
Organizations
security
Securitymode
sel
System Event Log
server
Server
service-profile
Service Profile
system
System-related show commands
timezone
Set timezone
version
System version
vif
Virtual Interfaces
Configuration tools
FarNorth-A# show configuration | ?
cut
Print selected parts of lines.
egrep Egrep-print lines matching a pattern
grep
Grep-print lines matching a pattern
head Display first lines
last Display last lines
less Filter for paging
no-more Turn-off pagination for command output
sort Stream Sorter
tr
Translate, squeeze, and/or delete characters
uniq
Discard all but one of successive identical
lines
vsh
The shell than understands clicommand
wc
Count words, lines, characters
begin Beginwith the line that matches
count Countnumber of lines
end
Endwith the line that matches
exclude Excludelines that match
include Includelines that match
Scope
Scoping movement to different UCS configurationComponents
Details on hardware components done with connect command
Mezzanine Adapter
Chassis
Ethernet Server Domain
Ethernet Uplink
Fabric Interconnect
FC Uplink
Firmware
Host Ethernet Interface
Host FC Interface
Monitor the system
Organizations
Securitymode
Server
Service Profile
Systems
VHBA
Connect NXOS
Connecting from the XML to the Fabric Interconnect
(FI) standard NXOS component.
Used to assist in troubleshooting very familiar to IOS
and Nexus users and all the show commands
Used to run advised debugs
Show switch running config(non server config)
Enable and run ethanalyzer
Clear interface counters found on the FI
Cannot be used to configure UCS (read only)
Connect
Hardware Troubleshooting
Mezzanine Adapter
Baseboard Management Controller (CIMC)
Connect to DMTFCLP
IO Module
Connect to Local Management CLI
Connect to NXOSCLI
Most dangerous
-erase configuration
-reboot
FarNorth-A(local-mgmt)# ?
cd
Change current directory
clear
Reset functions
cluster
Clustermode
connect
Connectto Another CLI
copy
Copya file
cp
Copy a file
delete
Deletemanaged objects
dir
Show content of dir
enable
Enable
end
Go to exec mode
erase
Erase
erase-log-configErase the mgmt logging configfile
exit
Exitfrom command interpreter
install-license Install a license
ls
Show content of dir
mkdir
Create a directory
move
Movea file
mv
Move a file
ping
Test network reachability
pwd
Print current directory
reboot
Reboots Fabric Interconnect
rm
Remove a file
rmdir
Remove a directory
run-script
Run a script
show
Showrunning system information
ssh
SSHto another system
tail-mgmt-log Tail mgmt log file
telnet
Telnetto another system
terminal
Set terminal line parameters
top
Go to the top mode
traceroute
Tracerouteto destination
Connect to NXOS
FarNorth-A# connect nxos <CR>
a
b
Fabric A
Fabric B
FarNorth-A(nxos)# ?
clear
Reset functions
only place you can clear counters today
cli
CLIcommands
debug
Debugging functions
debug-filter Enable filtering for debugging functions
end
Go to exec mode
ethanalyzer Configure ciscofabric analyzer
exit
Exitfrom command interpreter
no
Negate a command or set its defaults
ntp
Execute NTPcommands
pop
Popmode from stack or restore from name
push
Pushcurrent mode to stack or save it under name
show
Showrunning system information
system
Systemmanagement commands
terminal
Set terminal line parameters
test
Test command
undebug
Disable Debugging functions (See also debug)
where
Shows the clicontext you are in
KVM
Tool to snapshot screen for support
Doing Web-ex recording best
Fabric Monitoring
Power)
Thermal Sensors
DIMMs, CPUs, Adapter,
Sensor values available via IPMI
CMC
Per blade totals
Per chassis totals
PSU redundancy state
Vifs
Interface stats
States
Adaptor
Interface stats
Aggregate stats
States
FEX
Interface stats
States
Switch
Interface stats
Vifs stats
States
FarNorth-A(local-mgmt)# dir
16 Oct 30 09:31:03 2009 cores
31 Nov 20 13:14:20 2009 diagnostics
1024 Oct 30 09:29:05 2009 lost+found/
1024 May 17 12:59:47 2010 techsupport/
FarNorth-A(local-mgmt)# cd///techsupport
FarNorth-A(local-mgmt)# ls
2140160 May 17 12:52:58 2010 20100517124544_FarNorth_BC001_all.tar
12871680 May 17 12:59:47 2010 20100517125801_FarNorth_UCSM.tar
Core Dumps
Blade Troubleshooting
Troubleshooting Flow
For rest of the session we will work from Blade servers up toward LAN and
SAN network
End
LAN-SAN
FabricInterconnects
IOM Modules
Blades
Start
Blades
Bad Hardware
Bad/Reseat/Replace Dimm(s)
CPU or other component check logs
Adaptors issues
Connect to Mezzcards to Diagnose issues
BMC Troubleshooting
Command
Description
mctool
network
obfl
Live obfl
messages
alarms
sensors
power
Connect CIMC
Debug Utility
__________________________________________
Debug Firmware Utility
__________________________________________
Command List
__________________________________________
alarms
cores
exit
help [COMMAND]
images
mctools
memory
messages
network
obfl
post
power
sensors
sel
fru
mezz1fru
mezz2fru
tasks
top
update
users
version
__________________________________________
Notes:
"enter Key" will execute last command
"COMMAND ?" will execute help for that command
__________________________________________
VIC M81KR(Palo)
M81KR
-Palo Adaptor
Details of Vif
Adapter Debug
CLI(logs)
Memory errors
Check Server Event Log/Faults
sh sel2/1
5ed| 03/29/2010 02:20:50 | Memory 0x02| Uncorrectable ECC/other uncorrectable memory error | Rank: 0, DIMMSocket: 1, Channel: C, Socket: 0 | Asserted
Reboots
Need to find out reason for reboot of hardware
BMC (CIMC) issue in hardware/firmware on server
UCS Service Profile caused by a profile change/issue
Other Hardware on the blade CPU, Memory
User induced reset button
Blade Reboots
This is a signature of HW failure (power off followed by power on in 4-5 seconds. Intel feature to react on HW failure):
0:2009 Nov 25 11:44:55:BMC:kernel::<0>LPCReset ISR-> ResetState: 1 <---this indicates Reset occurred
4:2009 Nov 25 11:44:55:BMC:kernel:-:<4>/nuova/builds1/ca-ventura_1-build/091027-100438-rev34618-FCSd/bmc/drivers/vdd_pwr_good/
gooding/vdd_pwr_good_cb.c:19:Platformis Gooding: Deasserted
5:2009 Nov 25 11:44:55:BMC:kernel:-:<5>USB FS: VDDPower WAKEUP-Power Good = OFF
5:2009 Nov 25 11:44:55:BMC:kernel:-:<5>USB HS: VDDPower WAKEUP-Power Good = OFF
1:2009 Nov 25 11:44:55:BMC:kernel::<1>/nuova/builds1/ca-ventura_1-build/091027-100438-rev34618-FCSd/bmc/drivers/block_transfer/
block_transfer.c:564:block_transfer_deallocate_entire_list--> Dumped: 0x0000files.
5:2009 Nov 25 11:44:55:BMC:kernel:-:<5>handle_exception: Handling MSD_STATE_DISCONNECTfor interface[0]
5:2009 Nov 25 11:44:55:BMC:kernel:-:<5>handle_exception: Handling MSD_STATE_DISCONNECTfor interface[1]
4:2009 Nov 25 11:44:55:BMC:kernel:-:<4>kbdmouse_write: mouse write aborted for device reset.
5:2009 Nov 25 11:44:55:BMC:IPMI:472: Pilot2SrvPower.c:369:BladePower Changed To: [ OFF ]
5:2009 Nov 25 11:44:55:BMC:IPMI:500: VirtualSEL.c:26:SELEvt[22 02]< 22 02 02 B718 0D4B20 00 04 25 52 08 00 FF FF>
3:2009 Nov 25 11:45:16:BMC:doctor-bmc:584: doctor-bmc.c:1143:Tcp-> Connection between remote ip0xFE00037Fat port 0x86A4
and local ip0x200037Fat port 0xFAAis in TCP_TIME_WAITstate for at least 2 min 30 seconds.
3:2009 Nov 25 11:45:16:BMC:doctor-bmc:584: doctor-bmc.c:1155:Tcp-> Total Errors Found: 1
5:2009 Nov 25 11:45:21:BMC:kernel:-:<5>/nuova/builds1/ca-ventura_1-build/091027-100438-rev34618-FCSd/bmc/drivers/pilot2_power
/pilot2_power.c:266:do_power_on
remote ip0xFE00037F= 254 0 3 127 or 127.0.3.254 (the CMC0interface to the blades) and local ip0x200037F= 2 0 3 127 or 127.3.0.2
Blade Reboots
Blade Reboots
This is IPMI request, coming from UCSM as authorized reboot or a result of having Desired power State as OFF.
5:2009 Dec 23 18:16:58:BMC:mctool@127.5.254.1:1275: mcserver_
ipmi_extensions.c:212:[mcserver_set_vdd_power]
"Power Off"
<---indicator that an IPMI initiated reset has occurred.
5:2009 Dec 23 18:16:58:BMC:kernel:-:<5>/nuova/builds1/ca-ventura_1-build/091027-100438-rev34618-FCSd/bmc/drivers
/pilot2_power/pilot2_power.c:232:do_power_off
0:2009 Dec 23 18:17:03:BMC:kernel::<0>LPCReset ISR-> ResetState: 1 <---this indicates you've entered Reset for whatever reason
4:2009 Dec 23 18:17:03:BMC:kernel:-:<4>/nuova/builds1/ca-ventura_1-build/091027-100438-rev34618-FCSd/bmc/drivers/
vdd_pwr_good/gooding/vdd_pwr_good_cb.c:19:Platformis Gooding: Deasserted
5:2009 Dec 23 18:17:03:BMC:kernel:-:<5>USB FS: VDDPower WAKEUP-Power Good = OFF
5:2009 Dec 23 18:17:03:BMC:kernel:-:<5>USB HS: VDDPower WAKEUP-Power Good = OFF
1:2009 Dec 23 18:17:03:BMC:kernel::<1>/nuova/builds1/ca-ventura_1-build/091027-100438-rev34618-FCSd/bmc/drivers
/block_transfer/block_transfer.c:564:block_transfer_deallocate_entire_list--> Dumped: 0x0000files.
5:2009 Dec 23 18:17:03:BMC:kernel:-:<5>handle_exception: Handling MSD_STATE_DISCONNECTfor interface[0]
5:2009 Dec 23 18:17:03:BMC:kernel:-:<5>handle_exception: Handling MSD_STATE_DISCONNECTfor interface[1]
Also for all Resets the DMElogs should be viewed for more information,
DMElogs are found in the in /var/sysmgr/sam_logs/ inside the .tar file of
the <show tech-support ucsmdetail> svc_sam_dme.log
A# connect local-mgmt
A(local-mgmt)# show tech-support ucsmdetail
(SoL)
http://ipmitool.sourceforge.net/
Management Network
IPMI User
Accessing
BMC
interface
IPMI
IPMI doesnot runon the OS installedon the blade
Totallyindependentof the installedOS; runs evenif OS isdown
DMIDECODE
http://www.nongnu.org/dmidecode/
bios
system
baseboard
chassis
processor
memory
cache
connector
slot
Troubleshooting Flow
We will work from Blade servers up toward LAN and SAN network
End
LAN-SAN
FabricInterconnects
IOM Modules
Blades
Start
IOMconnections: chassisbackplaneview
Chassis
Path A
Path B
Path A
Path A
Blade 2
Blade 1
Path B
Path B
Blade 3
Blade 4
Blade 5
Blade 6
IOM1
IOM2
Blade 7
Half-widthservers: 1 mezzcard(one A and one B path)
Full-widthservers: 2 mezzcards(twoA & B paths)
IOM connections
EachIOM(akaFabricExtender) provides
8+1 internal IO channels(8 slots + 1 internal mgmtnetwork)
4 external ports (10Gbpseach; no Etherchannel in the 1st release)
th
Theseinterfaces
are backplanetraces
EthX/Y/Z where
X= chassisnumber
Y= mezzcardnumber(always1 withhalf-widthblades)
Z = IOMport number(slot wherethe bladeserver resides)
slot 1
slot 2
slot 3
slot 4
slot 5
slot 6
slot 7
slot 8
slot 1
slot 2
slot 3
slot 4
slot 5
slot 6
slot 7
slot 8
I
O
M
I
O
M
I
O
M
1 link
switch
How to read this: with one IOM-to-FIlink, all servers use that link
2 links
switch
How to read this: with two IOM-to-FIlinks, servers in slots 1,3,5,7 use link
number 1 while other slots use link number 2
4 links
switch
How to read this: with four IOM-to-FIlinks, servers in slots 1 and 5 use link 1,
veth1
OS
veth0
vhba0
vhba1
IOM 1
IOM 2
IOM-to-FI link
Vif 1
Vif 2
Fabric A
Vif3
Vif4
Fabric B
Attaching to FEX
FarNorth-A# connect iom?
<1-255> Chassis ID
FarNorth-A# connect iom1
Attaching to FEX 1 ...
To exit type 'exit', to abort type '$.'
Bad terminal type: "xterm". Will assume vt100.
VIFs
Ethernet and FC are muxedon the samephysical
links
concept of virtualinterfaces (vifs) to split
Ethand FC
Twotypes of VIFs: vethand vfc
Vethfor Ethernet ; vfcfor FC traffic
Will show pause frames and drops if looking for performance concerns
RMON
Stats
# IOM
Connected local-mgm<fabric>
Connect iom<chassis_id>
terminal length 0
show platform software redwood sts
show platform software redwood oper
show platform software redwood log
show platform software redwood elog
show platform software redwood ilog
show platform software redwood ints
#Global Info
Show clock
Show platform fwmevent-history errors
Show platform fwmevent-history msgs
Show platform fwmerrors
Show system internal ethpmevent-history errors
Show system internal ethpminfo trace
Show system internal ethpmevent-history msgs
Show platform software sifmgrevent-history errors
Show platform software sifmgrevent-history lock
Show platform software sifmgrinfo trace
Show platform software sifmgrevent-history msgs
Troubleshooting Flow
We will work from Blade servers up toward LAN and SAN network
End
LAN-SAN
FabricInterconnects
IOM Modules
Blades
Start
Troubleshooting 10GBE-
Transition States
DCBXTroubleshooting
Checking for DCBXnegotiation results
In the dump of show platform software dcbxinternal info interface ethernet1/1/1 look
for every feature negotiation result as shown below
feature type 3 sub_type0
feature state variables: oper_version0 error 0 oper_mode1 feature_seq_no0 remote_feature_tlv_present1
remote_tlv_not_present_notification_sent0 remote_tlv_aged_out0
feature register paramsmax_version0, enable 1, willing 0 advertise 1, disruptive_error0 mts_addr_node
0x101mts_addr_sap0x1e5
Desired configcfglength: 1 data bytes:08
Operating configcfglength: 1 data bytes:08
Error
1)Indicates negotiation error.
2) Never expected to happen when connected to CNA adaptor
3) When two N5Ksare connected back-to-back
4) If PFCis enabled on different CoSvalues negotiation error can happen
Operating Config
Indicates negotiation result
Absence of operating configindicates that the peer does not support this DCBXTLVor negotiation error
remote_feature_tlv_present indicates whether the remote peer supports this feature TLVor not
SAN Troubleshooting