Hello! i'm having a real annoying problem with my new server, it randomly freeze the
ESXI host after 14 days, 1 day, 2 day, 5 days etc...
It becomes unresponsive for any keystrokes via the server console directly
or via IPMI java interface, all vm's dies and the only thing i can do is to
restart it via IPMI and then it runs for a couple of days again...
I was never able to re-produce the crash by stressing the box in any way... it just seems die
while idling..
I've been using the same parts in my old server except for CPU,RAM,MB with a E3 1245v3 + ASRock Z87 Extreme 4 Intel Z87 and 16gb Kingston Black HyperX mem which worked
flawless for 1½year with the same psu, ibm m1015 and chassis..
and now it seems that i reached a dead end trying to find the problem and need your help badly..
Hardware:
CPU: XEON E5 2.2ghz (2.5ghz Turbo) 10-core 2011-3, (Engineer sample)
RAM: KINGSTON KVR21R15S4K4/32 KIT (4x8GB)32GB 2133MHz DDR4 ECC Reg CL15 DIMM SR x4 w/TS
MB: Supermicro X10SRH-CLN4F
HDD's connected to mainboard: Kingston V300 120gb, Samsung 830 256gb
PSU: Tagan 2-Force II Series 600W, ATX12V
PCI-E: 3x IBM M1015 IT mode with 15x3tb drives. (i did not test the built-in sas-controller yet)
CHASSI: NORCO RPC-4224 4U
Booting from a 16gb USB-Stick.
Here is my "This is what i tested so far log":
* 28h memtest86, no problem detected (ECC mems probably doesnt work that good in memtest..)
* 6h cpu stress+memory x264 encode, works like a charm.. been running full load
randomly and the box never crashed during full load..
* # esxcli system settings kernel set --setting=iovDisableIR -v TRUE
* Upgraded to 5.5 u2
* Changed/inactivated CoD Cluster-on-die cpu setting to home-snoop in bios settings.
* Changed all vm nic that was set to e1000 to vmxnet3 instead.
* Ran ESXI 5.5 standard version+vms from an SSD with my old setup from the E3-1245v3 instead of Kingston 16gb G4 USB Stick.
* upgraded latest patches:
VMware_bootbank_esx-base_5.5.0-2.62.2718055.vib
VMware_bootbank_net-ixgbe_3.7.13.7.14iov-12vmw.550.2.62.2718055.vib
VMware_bootbank_misc-drivers_5.5.0-2.62.2718055.vib
VMware_locker_tools-light_5.5.0-2.62.2718055.vib
-- ran for 13 days, crash.
18:36 2015-06-09 removed 2x8gb ram.
changed from shielded to unsheilded network cable (desperate).
unconnected usb-reader (desperate).
unconnected monitor dvi-cable (desperate).
* ran for 5 days until crash.
13:07 2015-06-14 removed rams and switched to the other 2x8gb ddr4 sticks.
changed pci-e slot for one of the three ibm m1015 cards from 8 to 2.
*ran for 12 days until crash.
18:31 2015-06-26 Just saw that a new bios firmware had finally been released, updated to BIOS File Name:
X10SRH-CLN4F X10SRH5_518.zi BIOS Revision: R 1.0b
note: Just noticed that my KINGSTON KVR21R15S4K4/32 KIT (4x8GB)32GB 2133MHz DDR4 ECC Reg CL15 DIMM SR x4 w/TS
are recognized as SAMSUNG in bios, compability issue?
*also re-inserted all 4 ram sticks.
19:12 2015-06-29 3 days... crash.... no more suggestions?...
I'm running out of ideas, could anyone please help me out?
Which log-files do i need to attach?
ESXI host after 14 days, 1 day, 2 day, 5 days etc...
It becomes unresponsive for any keystrokes via the server console directly
or via IPMI java interface, all vm's dies and the only thing i can do is to
restart it via IPMI and then it runs for a couple of days again...
I was never able to re-produce the crash by stressing the box in any way... it just seems die
while idling..
I've been using the same parts in my old server except for CPU,RAM,MB with a E3 1245v3 + ASRock Z87 Extreme 4 Intel Z87 and 16gb Kingston Black HyperX mem which worked
flawless for 1½year with the same psu, ibm m1015 and chassis..
and now it seems that i reached a dead end trying to find the problem and need your help badly..
Hardware:
CPU: XEON E5 2.2ghz (2.5ghz Turbo) 10-core 2011-3, (Engineer sample)
RAM: KINGSTON KVR21R15S4K4/32 KIT (4x8GB)32GB 2133MHz DDR4 ECC Reg CL15 DIMM SR x4 w/TS
MB: Supermicro X10SRH-CLN4F
HDD's connected to mainboard: Kingston V300 120gb, Samsung 830 256gb
PSU: Tagan 2-Force II Series 600W, ATX12V
PCI-E: 3x IBM M1015 IT mode with 15x3tb drives. (i did not test the built-in sas-controller yet)
CHASSI: NORCO RPC-4224 4U
Booting from a 16gb USB-Stick.
Here is my "This is what i tested so far log":
* 28h memtest86, no problem detected (ECC mems probably doesnt work that good in memtest..)
* 6h cpu stress+memory x264 encode, works like a charm.. been running full load
randomly and the box never crashed during full load..
* # esxcli system settings kernel set --setting=iovDisableIR -v TRUE
* Upgraded to 5.5 u2
* Changed/inactivated CoD Cluster-on-die cpu setting to home-snoop in bios settings.
* Changed all vm nic that was set to e1000 to vmxnet3 instead.
* Ran ESXI 5.5 standard version+vms from an SSD with my old setup from the E3-1245v3 instead of Kingston 16gb G4 USB Stick.
* upgraded latest patches:
VMware_bootbank_esx-base_5.5.0-2.62.2718055.vib
VMware_bootbank_net-ixgbe_3.7.13.7.14iov-12vmw.550.2.62.2718055.vib
VMware_bootbank_misc-drivers_5.5.0-2.62.2718055.vib
VMware_locker_tools-light_5.5.0-2.62.2718055.vib
-- ran for 13 days, crash.
18:36 2015-06-09 removed 2x8gb ram.
changed from shielded to unsheilded network cable (desperate).
unconnected usb-reader (desperate).
unconnected monitor dvi-cable (desperate).
* ran for 5 days until crash.
13:07 2015-06-14 removed rams and switched to the other 2x8gb ddr4 sticks.
changed pci-e slot for one of the three ibm m1015 cards from 8 to 2.
*ran for 12 days until crash.
18:31 2015-06-26 Just saw that a new bios firmware had finally been released, updated to BIOS File Name:
X10SRH-CLN4F X10SRH5_518.zi BIOS Revision: R 1.0b
note: Just noticed that my KINGSTON KVR21R15S4K4/32 KIT (4x8GB)32GB 2133MHz DDR4 ECC Reg CL15 DIMM SR x4 w/TS
are recognized as SAMSUNG in bios, compability issue?
*also re-inserted all 4 ram sticks.
19:12 2015-06-29 3 days... crash.... no more suggestions?...
I'm running out of ideas, could anyone please help me out?
Which log-files do i need to attach?
Last edited: