Anyone run a whitebox ESXi Server for Homelab? Purple Screen with Zen2

mda

2[H]4U
Joined
Mar 23, 2011
Messages
2,207
Hi All,

I'm getting purple screens when doing basic IO on my whitebox ESXi server.

The machine is for testing with the following specs:

3700X
X470 CH7 WiFi
4x16GB DDR4 3200 @ XMP
GT710 Passively cooled
NVME WD Blue SN550 1TB (Windows Install)
SATA Devices all running off the MB SATA ports.
1x WD Blue 2TB HDD 2.5"
1x Samsung PM861 SSD 960GB
1x Samsung 860 EVO 2TB
1x Samsung 860 EVO 500GB
1x WD Black 2TB HDD
1x Seagate Barracuda 3TB HDD

Latest ASUS Bios, 4603

ESXI 6.7 Update 3, Free
Patched to the November 2021 Patch

I'm encountering purple screens / kernel panics when:

Copying VM folders over data stores (1 500GB VM, 1 700GB VM, 1 30GB VM).

These VMs are verified to be working (these are copies of our production server VMs at my work) and I scp-ed them into this box.

I was copying these to a different drive with the intention of playing with the new copies and leaving the original VM intact just in case I screw up whatever I'm doing.

I will get purple screens when this happens:
#PF Exception 14 in world xxxxx

The machine is Windows stable, memtest stable and testmem5 (windows) stable. Not sure what's really going on here. Anyone run a similar setup and is doing ok?

I'll try downgrading the BIOS later this evening. Really unsure what is the problem...

Thanks!

Edit: Added a picture just in case someone can decipher this. Sorry for the quality and the bad angle, this picture was never meant for public consumption
 

Attachments

  • esxi.jpg
    esxi.jpg
    463.9 KB · Views: 1
Last edited:
its an ESXi box..but whats this "NVME WD Blue SN550 1TB (Windows Install)"
If you make another test VM, from scratch, then try to move it, does the same thing happen?
 
  • Like
Reactions: mda
like this
its an ESXi box..but whats this "NVME WD Blue SN550 1TB (Windows Install)"
If you make another test VM, from scratch, then try to move it, does the same thing happen?
Just realized last night after another PSOD that the error came around when I was doing IO on the datastores. Not much info past that yet.

The NVME drive is just my Windows 10 boot drive / install since the machine used to be a gaming PC (before I sold the GTX 1070TI in it) and put a GT 710 in the meantime since its mainly going to be doing ESXI work :)

Thanks for the reply, will try your suggestion with a bigger VM just to simulate. also working tonight on extracting and reading the log as per here
 
Try changing the storage controller mode, but that may require reinstalling. I'm seeing VSCSIPoll near the top and that's normally where it list the failing driver based on my experience.
 
These VMs are verified to be working (these are copies of our production server VMs at my work) and I scp-ed them into this box.
You copied live, work VMs to your home, unsecured system? Yikes.
 
You copied live, work VMs to your home, unsecured system? Yikes.
One baby step at a time! I doubt anyone would really want a 300GB database though, over our crappy internet connection. ;)

Try changing the storage controller mode, but that may require reinstalling. I'm seeing VSCSIPoll near the top and that's normally where it list the failing driver based on my experience.

Thanks. I have little experience in really troubleshooting ESXi apart from running it. Will give it a look.

--

Just to update, a BIOS rollback didn't do much. Will really need to work on dumping those logs.

EDIT: Apparently ESXi doesn't dump logs to a file by default. I have to configure this? :/
 
Last edited:
It only happens when copying the vm folders to another drive?
Just put all the vm onto one large nvme drive?
Use snapshot instead of copying the whole folder?
 
Apparently, I also get PSODs while idle too. So at this point I'm not super sure.

Downgrading to ESXi 6.5 didn't seem to help (also with the latest patch)

I'm doing a little digging in the vmkernel-log and based on my limited understanding of reading the log, ESXi may not seem to like either my WD NVME or my WD SATA 2.5" HDD. I've pulled those in the meantime just to check so I'll be trying again today/tomorrow with another extended copy session.

EDIT after a day:

Still no go, purple screen persists.

Moved all the HDD/SSDs into a 5800X/X570 with the same memory and GPU just to rule out ESXI just not liking my 3700X/CH7 board, and the experiment continues!

If purple screen again, will then be tested will be a separate set of memory.

EDIT2:

So far, it seems that the 5800X / X570 + rest of the hard drives combo has survived overnight without purple screening. Not sure how it can be the 3700X+X470 but apparently it does seem like the case :/
 
Last edited:
So after 1.5 days of uninterrupted uptime, it looks like ESXi just doesn't like either my Crosshair VII board or my 3700X. I find this very odd. Just putting this out there just in case anyone else finds it useful.

Thanks for the ideas guys!
 
One baby step at a time! I doubt anyone would really want a 300GB database though, over our crappy internet connection. ;)
Just the usernames, passwords, names and addresses, oh and credit card numbers. They don't want the rest of your silly database 😁
 
Back
Top