Contribute
Register

<< Solved >> Hardware issues after KP. Corrupted BIOS or faulty PCIe slot?

Joined
Apr 27, 2020
Messages
224
Motherboard
ASUS ProArt z790 Creator
CPU
i9-13900KF
Graphics
4x RX 6900 XT
Mac
  1. MacBook Pro
Classic Mac
  1. iMac
Mobile Phone
  1. iOS
So, I was ready to post photos and brag about my completed water-cooled z790 ProArt Creator build with quad GPUs…. but I’m having a strange recurring issue with one of the GPU’s that I can’t seem to figure out.

One of the PCIe slots is bugging out after a kernel panic and that particular GPU now doesn’t work in either macOS or Windows. This happened a few weeks ago with another GPU, and I replaced it, thinking it was a hardware failure. But I don’t think it’s a hardware issue because the old GPU is working just fine in another system.

After the replacement everything was fine with the new GPU. However, today it happened again on the same slot: PCIEX16(G4) (x4): the machine KP’d when falling asleep whilst connected to an Apple Studio Display using TB3 (not sure if tb3 is relevant, as TB3 is working fine).

Neither Mac nor Windows would boot after the KP, but after unplugging the GPU and tinkering with BIOS settings I was able to get the OS’s up and running again with the GPU connected. The OS’s are able to see the GPU, but something is wrong, and it’s not usable in either macOS or Windows.

In macOS it doesn’t show up in Activity monitor, or Geekbench. It DOES, however show up in IOReg (although it does not load the full AMD driver tree), and It’s also visible in Hackintool and System Profiler. Even Radeon Gadget is able to pull temps from it.

In windows I get error Code 12 (not enough resources/lanes available for the device) or Code 43 (device deactivated due to unspecified hardware issue), depending on what I tweak in the BIOS: PCIe bifurcation settings, per-slot PCIe Gen selection, PCIe Power Management and ASPM settings.

When error 43 is thrown, I can force Windows to activate the GPU by using a 3rd party program called Driver Booster, but the system starts lagging, and I can see that every other CPU core is getting maxed-out.

I’ve cleared the CMOS, reset NVRAM, re-installed the BIOS, and even flashed the VBIOS on the GPU itself, but nothing seems to be working.

I have tested without the other GPU’s to see if it’s a PCIe lane allocation issue, but I don’t think that’s the case, as everything had been working fine before, and the chipset & CPU should have enough bandwidth.

I’m truly at a loss here. What. In. The. World…

Does anyone have any suggestions?
I think it’s an issue with the PCIe slot itself, or with the CPU…
I managed to get one of the KP reports. The error reads: “GDDR6 Long Training Failed !!!” See below:
Code:
panic(cpu 0 caller 0xffffff7f8d394c8a): "GDDR6 Long Training Failed !!!
" @AmdRadeonController.cpp:2079
Panicked task 0xffffff9a6075a478: 279 threads: pid 0: kernel_task
Backtrace (CPU 0), panicked thread: 0xffffff9a5fc37598, Frame : Return Address
0xffffffa1163878f0 : 0xffffff800046e4fd
0xffffffa116387940 : 0xffffff80005c30a4
0xffffffa116387980 : 0xffffff80005b2b29
0xffffffa1163879e0 : 0xffffff800040e951
0xffffffa116387a00 : 0xffffff800046e7dd
0xffffffa116387af0 : 0xffffff800046de87
0xffffffa116387b50 : 0xffffff8000bdc34b
0xffffffa116387c40 : 0xffffff7f8d394c8a
0xffffffa116387d50 : 0xffffff7f8d34a730
0xffffffa116387d70 : 0xffffff7f8d34a5c1
0xffffffa116387da0 : 0xffffff8000aeb7a1
0xffffffa116387e00 : 0xffffff8000aeb31a
0xffffffa116387ec0 : 0xffffff8000aea3cf
0xffffffa116387f20 : 0xffffff8000aed55a
0xffffffa116387fa0 : 0xffffff800040e19e
      Kernel Extensions in backtrace:
         com.apple.kext.AMDRadeonX6000Framebuffer(4.1.2)[C2C59945-AFF0-33C6-BB68-6FF2576066AE]@0xffffff7f8d346000->0xffffff7f8d5cffff
            dependency: com.apple.AppleGraphicsDeviceControl(7.1.18)[B22B74AE-08E9-3D23-8F7A-EAD3C39EE7AD]@0xffffff7f95103000->0xffffff7f95106fff
            dependency: com.apple.iokit.IOACPIFamily(1.4)[D342E754-A422-3F44-BFFB-DEE93F6723BC]@0xffffff8002a21000->0xffffff8002a22fff
            dependency: com.apple.iokit.IOGraphicsFamily(597)[718E01CF-8B05-3042-88F4-DE3441395D00]@0xffffff7f95c71000->0xffffff7f95c9ffff
            dependency: com.apple.iokit.IOPCIFamily(2.9)[83895531-9463-398B-B769-64D4E50936C3]@0xffffff8002e91000->0xffffff8002ec2fff

Process name corresponding to current thread (0xffffff9a5fc37598): kernel_task
Boot args: -x -v debug=0x100

Mac OS version:
22F82

Kernel version:
Darwin Kernel Version 22.5.0: Thu Jun  8 22:22:22 PDT 2023; root:xnu-8796.121.3~7/RELEASE_X86_64
Kernel UUID: B82210B0-6371-3C15-8D2B-47C6E1FB7879
roots installed: 0
KernelCache slide: 0x0000000000000000
KernelCache base:  0xffffff8000200000
Kernel slide:      0x00000000000dc000
Kernel text base:  0xffffff80002dc000
__HIB  text base: 0xffffff8000100000
System model name: iMacPro1,1 (Mac-7BA5B2D9E42DDD94)
System shutdown begun: NO
Panic diags file available: YES (0x0)
Hibernation exit count: 0

System uptime in nanoseconds: 40225468033
Last Sleep:           absolute           base_tsc          base_nano
  Uptime  : 0x00000009615d2d7d
  Sleep   : 0x0000000000000000 0x0000000000000000 0x0000000000000000
  Wake    : 0x0000000000000000 0x0000000f3619508b 0x0000000000000000
Compressor Info: 0% of compressed pages limit (OK) and 0% of segments limit (OK) with 0 swapfiles and OK swap space
Zone info:
  Zone map: 0xffffff80c2147000 - 0xffffffa0c2147000
  . PGZ   : 0xffffff80c2147000 - 0xffffff80da148000
  . VM    : 0xffffff80da148000 - 0xffffff85a347b000
  . RO    : 0xffffff85a347b000 - 0xffffff873bae1000
  . GEN0  : 0xffffff873bae1000 - 0xffffff8c04e14000
  . GEN1  : 0xffffff8c04e14000 - 0xffffff90ce147000
  . GEN2  : 0xffffff90ce147000 - 0xffffff959747a000
  . GEN3  : 0xffffff959747a000 - 0xffffff9a607ad000
  . DATA  : 0xffffff9a607ad000 - 0xffffffa0c2147000
  Metadata: 0xffffffa0c6157000 - 0xffffffa0e6157000
  Bitmaps : 0xffffffa0e6157000 - 0xffffffa0f6157000
  Extra   : 0 - 0
 
SOLVED: I think the KP somehow corrupted the VBIOS in the GPU. I managed to flash the GPU again with a different tool and it's working again. The .rom that I flashed into the GPU was the same as the old one (same md5 checksum), so idk how or why it worked. Maybe it just "jolted" the motherboard into cooperating once again, who knows.
 
Back
Top