PSOD
Welcome to my new page on my blog. It’s called PSOD which stands for “Purple Screen of Death”. This when an ESX vmkernel experiences a critical failure.
They are quite rare I’ve only personally experienced three since I started working with VMware in Jan, 2004.
Anyway, as a light-hearted view of the product. I thought it would be fun and amusing to collect a “rogues” gallery of PSOD’s – starting with my very own – including the build, cause and resolution.
Note:
Some of the images are small and difficult to see. To see more detail – you can right-click and save the images to your
desktop for larger, off-line viewing
Submitted by: Christian Heigl
Date: 29 May 2009
Hardware: FSC RX600 S4, 4 Quad-Cores, 128GB Memory
Build: ESX 3.5, Update 2
Cause: Defective CPU 0
Resolution: Replace CPU
Submitted by: Anon
Hardware: Dell PowerEdge 6850
Build: ESX 3.0.2, build 52542
Cause: As of yet unknown!! (currently under investigation). But this looks like HBA or Switch failure as it cannot find paths to the LUN
Resolution: Reboot – hope it doesn’t happen again!!!. It could be something to do with the wrong multipathing policy being used

Submitted by: Patrick van Rantwijk, CDG Europe B.V.
Date: 11 Apr 2008
Hardware: HP DL385
Build: ESX 3.5.0, build 64607
Cause: Memory chip failure?
Resolution: Replace faulty memory?

Submitted by: Mike Laverick
Date: 29 Oct 2007
Build: ESX 3.0.2 build 52797
Hardware: ProLiant DL385 G1
Cause: Unknown. I had a VM with very full disk (stored on my SAN). I was doing some kind of search for big or malicious files because it was unusually full for what the VM’s purpose is (its a domain controller). Perhaps this increased the I/O in the VM to such a degree that I got disk timeouts on the SAN. My SAN is jurrasic and only runs at 1GB…
Resolution: Shutdown ESX host and power on again

Submitted by: Rene Helweg
Date: 2 oct 2007
Build: ESX 3.0.2 build 52797
Hardware: ProLiant DL380 G3
Cause: Unknown. Memory?
Resolution: After a reboot the machine is still running…

Submitted by: Wouter Wolkers
Date: 19 Sept, 2007
Build: 3.0.2 fully patched as of today
Hardware: Dell PowerEdge 1950
Cause: unknown
Resolution: Welll….. it’s crashing at boot, 100% reproducable… VMWare is currently blaming the disk configuration, so running a hardware test. If that gives no errors, I’ll go back to contacting VMWare to get this machine back up.
Might have to reinstall ESX though.

Submitted by: Wouter Wolkers
Date: 19 Sept, 2007
Build: 3.0.2 fully patched as of today
Hardware: Dell PowerEdge 1950
Cause: Hardware failure
Resolution: Replace CPU’s and Motherboard

Submitted by: Wouter Wolkers
Date: 19 Sept, 2007
Build: ESX 3.0.0 build 27701 unpatched
Hardware: Dell Poweredge 2950
Cause: unknown?
Resolution: Rebooted – appears to be caused by the VM’s that were still running on local storage?

Submitted by: Sandy Bryce
Date: 15th Jan, 2007
Hardware: Dell
Build: 3.0.1
Cause: The ESX server PSOD during a rescan of an HBA connected to a new SAN setup, the server was a fresh build with a Qlogic 2432 HBA dual connected to the SAN Fabric. Great PSOD with a second “COS Error: Opps” in the matter of weeks
Resolution: Unknown

Submitted by: Patrick van der Veen
Date: 8th Jan, 2007
Hardware: HP Proliant DL385
Build: 3.0.0.
Cause: Probably memory. Read notes on ESX patch 2066306
- Virtual machines experiencing high cpu load during a VMotion migration can hang after the migration is complete.
- Virtual machines can crash during a NUMA migration due to memory allocation failures.
- Kernel memory can become corrupted, resulting in a kernel crash when using 64-bit guest operating systems in virtual machines on ESX Server hosts with AMD processors.
- http://www.vmware.com/support/vi3/doc/esx-2066306-patch.html
Resolution: Reboot. In maintenance window relocated memory and it seems to be solved….

Submitted by: Mike Laverick
Date: 13 December, 2006
Hardware: HP DL 385
Build: 3.0.0
Cause: Iscsi PSOD caused by a faulty NIC used with the Software Initiator – if you look closely it has a great message “Console Opps!”
Resolution: Changed NIC behind vmkernel storage port group


Submitted by: Patrick van der Veen
Date: 15 November, 2006
Hardware: DL360G4p
Build: 3.0.0.
Cause: PSOD happened during an upgrade from VMFSv2 to VMFSv3
Resolution: Reboot – no time to try the auxilary file system driver

Submitted by: Andrew Hancock
Date: 13 November, 2006
Hardware: HP ProLiant 5500R
Build: 2.5.4 Build 32233
Cause: ESX server deliberately crashed to generate a PSOD and core dump for VMware Engineering to debug faulty e1000.o driver in 2.5.4 Build 32233 and later.
Resolution: The intention was to generate a PSOD. Don’t deliberately PSOD an ESX server. (this was not a production system!).

Submitted by: Andrew Hancock
Date: 17 May, 2006
Hardware: HP ML530G2
Build: 2.5.3 Build 22981
Cause: Removing iSCSI LUN from ESX 2.5.3 Host server whilst Guest VM was running.
Resolution: Don’t use unsupported drivers in ESX!

Submitted by: Andrew Wafaa
Date: Oct, 2006
Hardware: HP DL585 G1, 4xSingle Core AMD 2.2mhz, 20GB RAM, Qlogic HBA’s, Intel Pro1000 NIC’s
Build: Upgrade from ESX 2.5.3 to 2.5.4
Cause: Caused by powering on too many VM’s simulatanously?
Resolution: Use Auto-start/stop with a gap between each VM power on?

Submitted by: Serge Berserik
Date: Sept, 2006
Hardware: HP DL360, 2GB RAM
Build: ESX 3.0
Cause: Failure of RAM
Resolution: Remove and Replace bad ram

Submitted by: Stephan Gehring
Date: May, 2006
Hardware: Several IBM xseries servers, 2x Intel Xeon CPU, 4GB RAM, QLogic HBA, Intel NICs
Build: ESX 2.5.x
Cause: Having a bunch of students playing around with dd on the Service Console.
Resolution: Don’t make your students have to much time to play

Submitted by: Sven Jenautzke
Date: August, 2006
Hardware: HP DL385 G1, 2 x DualCore Opterons, 16Gb Ram
Build: ESX Build: 3.0.0
Cause: HA-Agent has a problem with FQDNs longer than 29 characters and eating up all Servic Console
Resolution: Add more memory for SC, expanding Swap, changing Hostname (your choice…*g*)
VMware: View Document

Submitted by: Mike Laverick
Date: Feb, 2006
Hardware: Dell 1650 PIII – circa “Juarasic Era” of 1998
Build: ESX 3.0 Release Candidate 1
Cause: Setting up a shared SCSI Bus, not correctly terminated and non-unique SCSI ID’s
Resolution: None, except remove cables – couldn’t change the SCSI ID on Adaptec Controller?

Submitted by: Mike Laverick
Date: Dec, 2005
Hardware: Not sure, definitely an IBM server
Build: ESX 2.5.0
Cause: SAN cables removed – no redundant fabric
Resolution: Plug cable back in, reboot, resolve not to be so silly in future!








