Hi guys ! How is it doing ? Are you going to try Threadripper 2 at 32 cores ? hehe !
Our machine has been really good at work and people are really appreciative because of the performance upgrade. It's been running steady 24/7 since April.
Last Friday 2 VMs crashed, one only crashed once and the other crashed multiple times. A day before I tested an imported VM that caused overcommitment to memory, so I stopped it. But the person that had many crashes reported another crash. So I did what I could within its VM, checked Windows power settings, tested RAM and hard drives, but no problem was found. I also checked with BlueScreenView, there was nothing.
I then moved its machine to another SSD. I'm currently waiting to see if other crashes occur.
Do you think the overcommitment caused those crashes ? The user reported that there was lag in Windows before the crash.
If the crash occurs again, our next step will be testing the memory with Memtest86 on the host and checking S.M.A.R.T. data on the SSD and hard drives. ESXI reports SMART temperature of around 70 celcius for the SSD, I don't think that's correct.
2018-09-04T14:56:27Z smartd: [warn] t10.ATA_____KINGSTON_SKC400S37512G__________________50026B768200FD6C____: above TEMPERATURE threshold (72 > 30)
2018-09-04T14:56:27Z smartd: smartmgt: plugin /usr/lib/vmware/smart_plugins/libsmartnvme.so is already loaded
2018-09-04T14:56:27Z smartd: smartmgt: plugin /usr/lib/vmware/smart_plugins/libsmartmicron.so is already loaded
2018-09-04T14:56:27Z smartd: libsmartsata: is_ata_smart_device:5 buf[82]:1 rc:0
2018-09-04T14:56:27Z smartd: libsmartsata: is_ata_smart_enabled ata fd:5 val:1
2018-09-04T14:56:27Z smartd: libsmartsata: ATA SMART device vid:ATA KINGSTON SKC400SA pid:KINGSTON SKC400SA
2018-09-04T14:56:27Z smartd: libsmartsata: closing fd:5
2018-09-04T14:56:27Z smartd: [warn] t10.ATA_____KINGSTON_SKC400S37512G__________________50026B768200FB54____: above TEMPERATURE threshold (72 > 30)