Contribute
Register

Gigabyte X299X - Catalina Support

Status
Not open for further replies.
After spending almost all of my very limited free time over the last six months working on debugging this motherboard's notorious hard reset → boot failure issue, I am happy to be able to finally announce that I have identified the apparent root cause of this issue, reported it to Gigabyte, and developed a prototype modified version of OpenCore with a workaround for this issue.




The root cause of this issue is a rather nasty bug in the motherboard's UEFI firmware. Long story short, any attempt to connect drivers to a specific child handle created by the AMI Generic LPC Super I/O Driver (at PciRoot(0x0)/Pci(0x1F,0x0)/Acpi(PNP0303,0x0)) while booted normally will (re)enable the ICC/OC Watchdog Timer (WDT) with an 8 second timeout.

Rather oddly, attempting to manually reload or disable the WDT via direct register writes (or via calling gBS->SetWatchdogTimer (0, 0, 0, NULL);) will not work at this point. I suspect that there is probably some other mechanism at play here in addition to the WDT that is responsible for the timed forced reset.

Anyways, this ICC/OC Watchdog Timer (WDT) is a watchdog timer provided by the Intel X299 PCH that is primarily used to recover from an unstable overclock and other similar situations. There are a great many technical details and different ways of using this timer and details regarding how it works and interacts with various parts of the UEFI firmware, but for our purposes, all you really need to know is this: if the WDT is enabled and the timeout expires, the computer will immediately shut down, then start back up with a number of overclocking-related settings reset/overridden to alternate settings (these are basically safe defaults).

Note that this does not actually modify the setup settings, which is why you will see a discrepancy between what is shown in the UEFI Setup and what is actually active for various overclocking-related options until you re-apply the settings via "save & exit" and return to a normal boot.

Anyways, at this point, you will be shown the "Boot failure detected" warning dialog, at least on F3b and below (F3c is a whole different dumpster fire that I won't go into). If you enter setup and boot via boot override, you will be operating with the alternate settings — this is why you will e.g. loose your XMP profiles or your CPU overclocks when booting this way.

Until you re-apply settings via "save & exit", you will remain in this "boot failure" mode (I believe the generic AMI term for this may be "safe mode"), even after rebooting. This is why you will continue to encounter the "Boot failure detected" warning dialog on every boot if you keep booting by entering setup and booting via boot override.




Now here's the incredibly weird thing. When you're in this "boot failure" mode, attempts to connect drivers to that specific problematic child handle created by the AMI Generic LPC Super I/O Driver do not cause any issues. I don't have any good explanation for this difference in behavior. My best guess is that it's due to some sort of difference in the firmware initialization between the two cases interacting with poorly written UEFI driver and/or application code, but I can't really make an informed guess, and honestly, even then I am reaching.

At any rate, further speculation on this bizarre behavior is not worth it anymore by this point — only Gigabyte or AMI can fix the problem, as they're the only ones with access to the actual source code (as well as the tools required to properly debug this issue at the necessary level).




I submitted a ticket via Gigabyte's eSupport site on December 1st describing the issue in detail. I am still waiting for a response, but honestly, I'm not sure if or when we'll see a fix for this issue. Gigabyte hasn't released even a single UEFI firmware update for any of their X299X motherboards in almost a full year now, and that's a rather bad sign.




With regards to OpenCore, the primary reason it causes these boot failures is due to the use of recursive gBS->ConnectController calls combined with child path rollups.

The workaround I came up with and implemented is really not that great for a lot of reasons, and I'm not sure that it'd be suitable for inclusion into upstream (at the absolute least it would need to become configuration-based for it to make it into upstream, it also almost certainly needs to be moved into a separate lib module or at least separate functions within the existing lib, etc).

I have not tested it very extensively yet (it's been less than 24 hours since I created the first fully working prototype/proof-of-concept version), and while it has worked fairly well in the limited testing I've done so far (with the exception of a slew of very perplexing errors I ran into when I finally disabled or removed most of the overly-verbose debug printing left over from development, which turned out to be caused by what looks like some stupid timing-related issues that were easily temporarily worked around again, although a more robust solution is probably going to be a headache), initial appearances can be deceiving.

Also, I'm not totally confident that this hardcoded approach will even work universally on this board across all configurations, settings, firmware revisions, etc. It probably will, but that'll need to be tested.

I plan to spend some time on testing this out in greater depth and to further clean up & improve the code before I push this to a public repo.

Anyways, I thought I might as well share what I've discovered and done in here. This has been a very long and painful journey, but the end is finally in sight (at least via the workaround if nothing else), and I've learned many interesting things along the way (although if I hadn't, there's no way I would have kept hacking away at this for so long, lol).
 
Wow @JTR thanks so much for these details and all your work.

I've been investigating the same issue for weeks as well, though I lack your knowledge in exploring it from BIOS or bootloader code perspective. I have just been testing every permutation of option I could think of, and delving into NVRAM variables, ACPI dumps and device path trees to see if I could spot any clues as to what might be different between a successful and unsuccessful boot, at the least to provide information to Acidanthera.

Only just last night I started writing out a detailed bug ticket to submit to Acidanthera with a slew of debug information collected from OpenShell, which I planned to submit to them soon in the hope that they might be able to work out what the issue was, and put some workaround in OpenCore, given that that the issue occurs only with OpenCore, and not with Windows, Linux, FreeBSD, or a direct boot of OpenShell.efi.

Obviously I won't submit that now. Once your public repo is ready, will you discuss with Acidanthera? As you say, I guess there'd need to be a new Quirk for enabling the fix - though perhaps they'd be willing to do that final plumbing, given you'd be presenting them with a confirmed issue and the specific code to fix it. Judging by recent commits and past Issues, they seem fairly willing to fix system-specific issues.

I've found that - at least on F3C BIOS - the issue also occurs with any boot from USB. A normal boot to OpenCore on USB - without using BIOS Boot Override/F12 boot menu/Exit-without-saving - will fail to safe mode in exactly the same way, even when the USB stick is the only connected device.

You mention F3C is a 'dumpster fire' - could you elaborate? Should we all be using F3B? What differences are there between F3B and F3C? I've been using F3C simply because it has the latest microcode, and because I'd not noticed any practical differences between F3C and F3B (which I ran for a few weeks, a month or so ago, when I first got the board). Though I had thought to go back to F3B to test USB, given the fact that it always fails on USB on F3C didn't seem to match what I thought I'd read here of the experience of others.

You mentioned you'd just submitted a ticket to Gigabyte, which is excellent. Perhaps we should all do that now. I'd not bothered before, given I knew dolgarrenan had done so months ago and was told "We only support Windows", and that - as you say - they'd not released a BIOS in a year now, so I figured they weren't going to give a damn. But now you have hard evidence of a specific problem, perhaps if a few of us submit it, it might at least increase the chances of someone reviewing it. I won't be holding out any hope, of course.

Thanks again for your fantastic findings and all the work you've done. Can't wait to try any possible workarounds, and learn more about your method!
 
Last edited:
I've found that - at least on F3C BIOS - the issue also occurs with any boot from USB. A normal boot to OpenCore on USB - without using BIOS Boot Override/F12 boot menu/Exit-without-saving - will fail to safe mode in exactly the same way, even when the USB stick is the only connected device.
I can not confirm this. I'm currently running my EFI just from my USB drive and it works perfectly, without the reset bug while on F3C.

If we're really lucky, Gigabyte may be incentivized to update a lot of their UEFI firmwares to add Smart Access Memory, so they have an additional reason to take another look at this, I hope.
 
Only just last night I started writing out a detailed bug ticket to submit to Acidanthera with a slew of debug information collected from OpenShell, which I planned to submit to them soon in the hope that they might be able to work out what the issue was, and put some workaround in OpenCore, given that that the issue occurs only with OpenCore, and not with Windows, Linux, FreeBSD, or a direct boot of OpenShell.efi.

Actually, the issue seems to occur with Windows as well if you have BitLocker enabled and attempt to boot Windows via boot override (or chainloading) with Windows not set as the first OS to boot. I am not 100% confident that this stems from the same root cause, but I think it's very likely that it does, and I included this detail in my bug report to Gigabyte.

Obviously I won't submit that now. Once your public repo is ready, will you discuss with Acidanthera? As you say, I guess there'd need to be a new Quirk for enabling the fix - though perhaps they'd be willing to do that final plumbing, given you'd be presenting them with a confirmed issue and the specific code to fix it. Judging by recent commits and past Issues, they seem fairly willing to fix system-specific issues.

Honestly I don't know that I am willing to put in the effort to take it all the way to the point where it is just a PR, but if I decide against it, I am definitely willing to open an issue asking them to look at integrating it once it's been cleaned up. In theory this should be fairly easy to integrate in a generic way, you'd just need a new section in the config plist with device paths to veto and some plumbing to iterate and veto the paths in the config plist if they're not blank, although then you'd need board-specific usage examples. It'd be much easier to just integrate it with the hard-coded paths and use a quirk flag specific to our motherboard to enable it, but I doubt that'd be something they'd want to integrate.

I've found that - at least on F3C BIOS - the issue also occurs with any boot from USB. A normal boot to OpenCore on USB - without using BIOS Boot Override/F12 boot menu/Exit-without-saving - will fail to safe mode in exactly the same way, even when the USB stick is the only connected device.

The issue definitely occurs on all non-safe-mode boots regardless of connected devices (I've tested it with no USB devices and OC on a SATA SSD, I've tested it with only the OC USB and zero other devices, etc — I spent a LOT of time testing people's claims in this thread and discovering one by one that they're almost all totally wrong). It might not occur if you can call ExitBootServices fast enough (which could potentially be possible with logging disabled), but I haven't personally tested that assumption for OpenCore, and I have indications from monitoring checkpoint codes with the Windows Bitlocker bug that this probably does not work.

You mention F3C is a 'dumpster fire' - could you elaborate? Should we all be using F3B? What differences are there between F3B and F3C? I've been using F3C simply because it has the latest microcode, and because I'd not noticed any practical differences between F3C and F3B (which I ran for a few weeks, a month or so ago, when I first got the board). Though I had thought to go back to F3B to test USB, given the fact that it always fails on USB on F3C didn't seem to match what I thought I'd read here of the experience of others.

F3c fixes some bugs (most notably the extremely annoying UI lag issue in the UEFI firmware setup GUI), but it's markedly more unstable and seems to partially mask the existence of boot failures while additionally resetting other settings in the process (e.g. setting the setup GUI back to simple mode and the language to French). It was just way too much of a headache to deal with its extra problems and weirdness for the endless permutations I was running through. It's been quite some time since I last tested it though, I'll have to give it a whirl with my modified version and see if it's still as troublesome without any boot failures being triggered by OC.

There's also some universal and very nasty bugs with memory training voltage on this motherboard that took me a while to discover the existence of (and the fix for), and even after successfully implementing the fix for that, I still went back and intentionally kept the memory reverted to stock clocks for the rest of my testing in order to avoid this potentially polluting my testing results in any way.

You mentioned you'd just submitted a ticket to Gigabyte, which is excellent. Perhaps we should all do that now. I'd not bothered before, given I knew dolgarrenan had done so months ago and was told "We only support Windows", and that - as you say - they'd not released a BIOS in a year now, so I figured they weren't going to give a damn. But now you have hard evidence of a specific problem, perhaps if a few of us submit it, it might at least increase the chances of someone reviewing it. I won't be holding out any hope, of course.

I kept the "only support Windows" bit in mind, but the fact that this seems to impact Windows under specific circumstances should hopefully short-circuit that idiotic policy. However, their technical support is notoriously atrocious (and the site itself is no better, given that it mysteriously lost ~1200 characters of my ticket in addition to stripping all formatting & line breaks from it — I actually had to re-submit it again with the ticket text in an attachment because of this problem combined with the inability for users to actually respond to any ticket on there), so we'll see if anything comes of this.

I can't say I have all that much hope for this option though. However, if I get a negative response from their support, I have at least the following fallback options:

  • If you look very carefully through the entire system DSDT/SSDTs, you'll eventually find that the full name and both work/personal email addresses for a firmware engineer at Gigabyte for have been embedded into a SSDT under a custom Name (with his first name as the key!) under a Windows-specific section of the SSDT (when I stumbled across this accidentally for the first time I found it rather hilarious). I plan to contact him directly if support does not follow through — I am not sure if he can help directly, but he should at least be able to forward the issue directly to the appropriate internal team or contacts.
  • While AMI directs users to contact their OEM for support, they still offer a technical support contact form for end-users on their web site. As it's unclear if the bug lies in their driver implementation or a Gigabyte-specific modification/extension/usage of it, it may be worth attempting to contact them directly via this form to notify them of it and request a fix. If nothing else, they should be able to get directly in touch with firmware engineers at Gigabyte and bring the issue to their attention.

I can not confirm this. I'm currently running my EFI just from my USB drive and it works perfectly, without the reset bug while on F3C.

See my comments above regarding F3c, ExitBootServices, etc.

If we're really lucky, Gigabyte may be incentivized to update a lot of their UEFI firmwares to add Smart Access Memory, so they have an additional reason to take another look at this, I hope.

It's possible, but keep in mind that there are multiple major Intel security issues that they have not patched on any X299X motherboard (but which vendors like Asus have repeatedly issued updated firmwares for), so I'm not sure if a new feature is enough of an incentive for them (or if they'd even bother to patch anything else beyond the new feature).

On the bright side, at least I can manually modify the firmware to add some of the security updates myself, although I've postponed any attempt to do so until after I finalize the OC patch (it made no sense to waste time on this while still trying to debug & fix the issue).
 
excellent job and info JTR
i contacted gigabyte a few weeks ago , requesting a new bios version
or asking them if they had any beta available because their latest bios are a bit old
they didn't offered any help at all and it seems like they didn't understand what i was trying to say
i had to keep repeating myself until i got tired and said forget it

i have aorus master x299x and that board has a slower boot time compared to my previous
x299 gaming 7 pro

i never had any problem running mac os on my gaming 7 pro once i figure out the settings
gigabyte offered a beta version to a guy on newegg web site
that is where i bought my x299x motherboard

i updated to the beta bios version f3d
but those bios have a very bad problem with memory
things get corrupted
i lost data for using those bios once i find out that the problem was the beta bios
i reverted back to the latest official bios
f3c

it seems like gigabyte doesn't care that much about those boards
even if they are "new", they are technically a refresh and it seems like they are being treat it like abandonware

i know there are some pages where you can get or request modified version of official bios
i have flashed many boards in the past that didn't have raid support enable
or didn't have AHCI enable on the bios

but i'm not sure if i can post any link , i don't want any trouble
but you can search for bios mod on any search engine

i remember when some board manufacturer didn't release any updated bios to patch intel vulnerabilities and the users and moderators came up with alternatives and solutions

maybe if you go to any of those pages and explain the problem
maybe someone can mod the bios and enable what is disable

the good thing about these boards is that they have or come with dual bios

i regret having sold my gaming 7 pro, these x299x should be better
maybe they are, but it also has some issues that the other board didn't have

i know eventually the bug will be fixed

regards
 
Last edited:
Actually, the issue seems to occur with Windows as well if you have BitLocker enabled and attempt to boot Windows via boot override (or chainloading) with Windows not set as the first OS to boot. I am not 100% confident that this stems from the same root cause, but I think it's very likely that it does, and I included this detail in my bug report to Gigabyte.
Ah, that's very good to know. I'll try and re-create that and submit to them as well.
Honestly I don't know that I am willing to put in the effort to take it all the way to the point where it is just a PR, but if I decide against it, I am definitely willing to open an issue asking them to look at integrating it once it's been cleaned up. In theory this should be fairly easy to integrate in a generic way, you'd just need a new section in the config plist with device paths to veto and some plumbing to iterate and veto the paths in the config plist if they're not blank, although then you'd need board-specific usage examples. It'd be much easier to just integrate it with the hard-coded paths and use a quirk flag specific to our motherboard to enable it, but I doubt that'd be something they'd want to integrate.
That's quite understandable. Just the issue with the correct technical details will be a great help I'm sure. My impression from reading their Github issues is that they do seem fairly receptive to implementing Quirks that are quite specific, though I don't recall seeing any that are quite as specific as ours.

But as you'll be able to pinpoint the exact issue with example code I'd feel quite confident they'll at least investigate - it's the debug and testing aspect, on a board they likely won't have access to, that I always felt would be the major hurdle to them being interested to take it seriously. The plumbing I'd imagine they could implement very quickly.

Even if they're not interested, having your code out in the wild and the ability to attempt patching it for ourselves will be an enormous step forward.
The issue definitely occurs on all non-safe-mode boots regardless of connected devices (I've tested it with no USB devices and OC on a SATA SSD, I've tested it with only the OC USB and zero other devices, etc — I spent a LOT of time testing people's claims in this thread and discovering one by one that they're almost all totally wrong).
OK yes, I get it now. I did a bit more testing on USB and see now that timing is key.

I have never had a failure when booting directly (auto boot to first priority) on SSD or NVMe, and had assumed this meant the issue could not occur in that scenario. And then when the issue did happen on USB, I thought that it being a USB boot was an alternative factor.

Now I see that, as you say, it's the timing of debug logging that was causing it to fail on USB. Lately I've had all my EFIs running DEBUG builds with Target=65. On USB this is extremely slow - many times slower than the same logging on SSD/NVMe. And this must be what was causing my direct USB boots to fail every time, where SSD/NVMe never did.

When I switched a USB to Target = 0, I was able to direct boot from USB without triggering the issue. This explains why @byteminer was saying he had no problems with USB.

So does this mean that the reason that it's guaranteed to happen from Boot Override / F12 / Exit Without Saving is because it alters the timing - the 'clock' so to speak starts from the moment booting starts, and going into the BIOS or bringing up the F12 menu delays reaching ExitBootService by a considerable amount of time, hence the problem always occurs?

It might not occur if you can call ExitBootServices fast enough (which could potentially be possible with logging disabled), but I haven't personally tested that assumption for OpenCore, and I have indications from monitoring checkpoint codes with the Windows Bitlocker bug that this probably does not work.
It's certainly the case that it doesn't happen on a boot that's fast enough, as I've literally never had it happen when direct/auto booting either SSD or NVMe, including OpenCore DEBUG with Target=65 logging. Nor I now realise with USB when debug logging is not enabled.
F3c fixes some bugs (most notably the extremely annoying UI lag issue in the UEFI firmware setup GUI), but it's markedly more unstable and seems to partially mask the existence of boot failures while additionally resetting other settings in the process (e.g. setting the setup GUI back to simple mode and the language to French). It was just way too much of a headache to deal with its extra problems and weirdness for the endless permutations I was running through. It's been quite some time since I last tested it though, I'll have to give it a whirl with my modified version and see if it's still as troublesome without any boot failures being triggered by OC.
Interestingly I've not had that French language reset in a long time. I can't now remember what triggered it specifically, though I seem to recall it happened once when I was applying a saved BIOS profile in F3C. But it's not happened to me during day-to-day usage over the last month or more, which has included regular occurrences of the safe-mode reboot.

So in general F3C has been OK for me thus far. But I have been meaning to do some more detailed comparisons between it and F3B once I'm back to trying heavy overclocking, which I will be once my new water cooling hardware arrives. There were suggestions early in this thread that F3B had more stable performance than F3C, but I don't know if that's accurate or not.
There's also some universal and very nasty bugs with memory training voltage on this motherboard that took me a while to discover the existence of (and the fix for), and even after successfully implementing the fix for that, I still went back and intentionally kept the memory reverted to stock clocks for the rest of my testing in order to avoid this potentially polluting my testing results in any way.
Ah, excellent. I've had a complete failure to overclock RAM timings on this board and I didn't know if it was my lack of experience doing that, or my RAM, or what. I'll watch that video.

EDIT: Yes! That is exactly the issue I had. Thanks for the link! Another GB BIOS bug throwing people off, and another bug report they've ignored for 12+ months. At least this one has a simple fix.
  • If you look very carefully through the entire system DSDT/SSDTs, you'll eventually find that the full name and both work/personal email addresses for a firmware engineer at Gigabyte for have been embedded into a SSDT under a custom Name (with his first name as the key!) under a Windows-specific section of the SSDT (when I stumbled across this accidentally for the first time I found it rather hilarious). I plan to contact him directly if support does not follow through — I am not sure if he can help directly, but he should at least be able to forward the issue directly to the appropriate internal team or contacts.
Haha, that's hilarious! And a great find. I wonder if he/she will respond.
On the bright side, at least I can manually modify the firmware to add some of the security updates myself, although I've postponed any attempt to do so until after I finalize the OC patch (it made no sense to waste time on this while still trying to debug & fix the issue).
That'd be excellent.


Thanks again for all your work and for describing it in detail. Can't wait to see more!
 
Last edited:
I'm falling asleep at the moment, so I'll defer replies to the unanswered new posts to later, but I just wanted to quickly mention that I finally figured out a fix for the quirky/unreliable behavior in my initial prototype — it turns out that I was doing several somewhat stupid things, and it was only working at all because cases with debug prints were re-using some just-freed memory, which happened to invalidate the right data purely by coincidence. In my defense though, this is the first time I've ever worked with EDK2, and also the first time I've used plain C to any significant extent.

After several false starts I finally identified and implemented what appears to be the most optimal approach from both a reliability and performance standpoint. I briefly tested this revised version with minimal debug logging (which would reliably crash with the prior broken approach), and it worked perfectly fine. I significantly cleaned up the source as well, although I haven't gotten rid of all of the commented debug code just yet. This is all based on OpenCore 0.6.3, and seeing as 0.6.4 just came out, my next step will be to try to update this to 0.6.4 (might be a simple merge/rebase, this depends on if there are conflicting changes), test it on 0.6.4 (at least after updating my config for 0.6.4), then I'll push the code to GitHub and maybe publish some pre-compiled binaries as well.

At that point I'll need to ask as many of you as are willing to give it a try and report back with observations and on any issues discovered. Once I can get some feedback on how well it works for other people with the same board, have integrated any necessary changes that this process reveals the need for, and am confident that the fix is reasonably robust for more than just myself, then I'll move onwards to inquiring with vit9696 regarding merging this into mainline OpenCore in some form.
 
Wonderful to hear, @JTR - can't wait to try it out!
 
Hi @TheBloke,

I have a very minor issue/question regarding SSDT for TB3. I flashed my TB3 firmware and if I use your EFI from post #619, everything is fantastic except in the 'Thunderbolt' section, I see the speed is 20Gb x2 rather than 40Gb x1. Do you also have 20x2 instead of 40x1? Btw, Hotplug works fine.

And if I use the SSDT from HackinDROM method in @CaseySJ 's thread, I can't even boot.
 

Attachments

  • SSDT-TB3-HackinDROM.aml
    2.1 KB · Views: 39
Hi @TheBloke,

I have a very minor issue/question regarding SSDT for TB3. I flashed my TB3 firmware and if I use your EFI from post #619, everything is fantastic except in the 'Thunderbolt' section, I see the speed is 20Gb x2 rather than 40Gb x1. Do you also have 20x2 instead of 40x1? Btw, Hotplug works fine.

And if I use the SSDT from HackinDROM method in @CaseySJ 's thread, I can't even boot.
Hi @oreoapple - I'm afraid I'm not a good person to help with anything related to TB3. I still don't actually have any TB3 devices, and for that reason I haven't yet tried the firmware flashing. I've tested that the TB3 ports work as USB3 ports, and that's it. And because I haven't yet done the firmware flashing, my About This Mac -> Thunderbolt section is empty.

The SSDT in the EFI I uploaded is identical to the one provided by @dolgarrenan in his OP. I can see from the screenshot in the first post that his Thunderbolt section does say "40Gb/s x1", so yes I suppose seeing 20Gb x 2 is different and perhaps wrong. But I'm afraid I couldn't tell you why.

The only thing that does occur to me is to check your Thunderbolt BIOS settings. Do you have "SL0-No Security" set? And what values do you have for the other parameters? I believe "No security" is definitely important, but I don't know about the other options - but when I applied dolgarrenan's BIOS profiles I did notice he had some non-default values, eg I seem to recall he had Reserved I/O set to 8 (default is 0).

I have no idea if they could affect this, but you could try:
  1. Saving your current BIOS settings to a profile.
    1. I normally save two: one to internal memory, one to USB; the internal memory profile is more convenient, but it gets lost if you swap to the secondary BIOS or re-flash the BIOS, whereas the USB copy can be retrieved any time.
  2. Loading dolgarrenan's BIOS profile - he's uploaded two in this thread, one for BIOS F3B, one for F3C.
  3. Taking a note of his Thunderbolt settings.
    1. Or capture a screenshot by hitting F12 (requires a FAT16/FAT32 formatted USB stick.)
  4. Reload your profile, and adjust your Thunderbolt settings to match his.
As and when I get a TB3 device I'll try the firmware flash and report back, but that's unlikely to be until the new year so hopefully you'll have it fixed long before then!
 
Last edited:
Status
Not open for further replies.
Back
Top