In order of concern:
Power supply is glitching?
Thermals are a serious concern for your kit under load. You can run the CPU steady near 100C with no hazard, but its heat combined with heat from the board FETs that drive it can lead to RAM and GPU problems at full load. It can take a while, half-hour or more under load, for the heat load to build to critical for surrounding components, eg memory controller, RAM. Something that gets overlooked is proper case airflow beyond the CPU/GPU cooling. If memtest is passing but it's croaking under load, power+thermals are indicated. Just because the CPU temps are nominal doesn't mean other components aren't being stressed.
Do you have temp monitor for surrounding components?
Resetting BIOS and any OC config is worthwhile. Retrain if using Asus AI.
Backing off RAM from XMP rating by 200MHz is worth a try.
Re-seating all components and connectors, inc. CPU in its socket is worthwhile. Eye that everything is clean.
Something to keep in mind: Are you in northern hemisphere? Do you make adjustments to your kit and handle the parts without static-safe measures? Winter with very dry air can create static hazard conditions that wreck components, where other seasons / conditions present no problems. The designs tolerate a lot but aren't perfect about this.
It could also be bad luck re any component.
Re GPU reset in log, that seems like a significant clue, but not conclusive due to interdependencies.
Did you update macOS recently? I mention this because we are at a point where any updates from 12 on can wreck a config just as likely as improve it. There's no reason to expect that updates will fix anything anymore, because Intel HW that macOS is qualified upon is now becoming relatively ever more narrow compared with the wide spectrum of hackintosh 1-off configs. The generation gap is growing with every update.
Best regards
So my PSU is a Seasonic Platinum 1300W. I picked it up last year as at full load I was drawing 730W from the wall and my Corsair 850w Gold wasn’t handling it well.
BIOS had been reset recently and I never use the ASUS AI (my BIOS mimics all of the
@kgp settings from the Mojave X299 guide- except the Thunderbolt settings which have been taken from
@djlild7hina via the GitHub page).
The reason I’m suspecting GPU is because before this RAM slot died, I was having issues getting a post with the GPU in previously. I had to do this insane workaround where I was flashing the BIOS with the card inserted then resetting the CMOS with the card removed just to get a post.
It was spitting code 62 with the VGA LED lit. But I finally got the computer up and running from resetting the CMOS repeatedly which is super annoying.
But then Friday, it froze while doing nothing intensive and wouldn’t come back. I noticed code bd and that’s a RAM issue. Downclocking booted it but I didn’t like the error so over the weekend I painstakingly tested all 8 sticks of RAM, one at a time.
They all passed, but when I went to put them all back in, no matter which stick it was, bank DIMM_B2 wouldn’t even register the stick being there. So I’m using 112 GB at the moment and it’s fine- even faster and more responsive than it was previously - which begs the question, how long has that slot been on its way out.
In terms of thermals, I have a Corsair H150i with 3 Noctua NF-12’s, 2 front Case fans and a rear case fan that are clean, all filters on this Fractal Design case are clean and I know it idles with everything closed around 31 Celsius and at load it doesn’t throttle terribly. So thermals and power draw are fine. The GPU even sits around 60 Celsius when loading it.
I don’t have any RAM temp monitoring.
In terms of handling equipment, I’m one of the few people who uses an anti-static strap because as annoying as it is, I don’t want to mess anything up. Need this machine to work lol
Given the GPURestart log and the issues I had getting the machine to post with it, I think I’m answering my own question but it’s still worth asking and I appreciate the time taken to respond. Almost done this memtest and so far no errors have come up. And I did start the test after that crash after a soft reboot.