Uh no! Gnome Failed

For a long while now, since Gnome3 (gnome shell) felt kinda stable, I've been using that as my main windowing environment on my Gentoo laptop for what feels like a decade now. Something that I don't think I've seen in a while, though I probably should have way more often is a notorious screen that comes up when something somwhere between the hardware, drivers, X, mutter, any number of applications and clients, gtk, maybe wayland, all the way to the gnome base, goes wrong and it gets trapped, the equivalent of the Windows BSOD (Blue Screen Of Death) appears.

The variation of this you'd typically get includes a "Logout" button.

When this happens, you're almost certainly toast and need to log out or restart. Either way, losing all of your windows and applications you've had running and if you're anything like me, would need to spend 10-15 minutes starting everything back up and getting them all placed in just the right places on just the right workspace and get the rxvt terminal connected to the right remote host.

Some Background

I think this used to happen to me pretty regularly but then it went away. It would always be triggered by Chromium, or Google-Chrome, sometimes Discord, but usually when Chrome is trying to play an animation or embedded video, it pukes and the screen freezes for a moment, everything goes black, gnome-shell dumps some satisfying core, and restarts itself. Fortunately when that happens, all my applications are still running since X didn't restart. This would go on for a while, and it would be ok since I never lost any work and the worst thing that happened was that my rxvt terminals would sometimes get shifted slightly and shrunk a little, depending on where they were on any given workspace (more on this). I always thought that to be kind of weird. I usually run 2 terminals per workspace, about 32 terminals in all across 16 workspaces, and since you can sort of do that "split" in a workspace, the rxvt terminals simply took up half of the real estate each. And it looked nice because I don't use any title bars on any of my windows, just solid black screen and 2 shell prompts. But once gnome (or something under it) lost its will to live, these perfectly placed black terminals became floating windows exposing bits of the background around the edges. Weird, but never really thought much about it, and sometimes I'd even do the [Super]+[left] or [right] (in Gnome settings, the descriptoin says: "View split on left" and "View split on right") to make them take up the full screen again.

Fast forward to a few days ago, where I must have updated something, and upgraded things like ruby, gcc, binutils, and some other stuff to use the latest slot so I can clean out the older versions. Mind you, I didn't upgrade gnome itself, but I did have to rebuild a lot of related stuff since my USE flags changed. Now suddenly, when something gave up on life, I'd see the very dreaded Uh no! screen. I couldn't get around it, I couldn't close it, if I clicked logout it would obviously kill all my windows and applications so I didn't want to do that. One thing that I did notice was that if I tried to hit [Super] to get to the overview, I'd see all my applications and all my workspaces, and they looked like they were fine (just like before). But this Uh no! thing was in the way and I could get rid of it.

Turns out, that was spun up by something called gnome-session-failed, on my Gentoo system it resides in /usr/libexec and when it's invoked it includes the --allow-logout, the logout button is also displayed. If I kill this process, the window goes away but then gdm takes over and restarts my session, including all my applications and clients; very bad. Why does it do that when the session is just fine? I can even interact with my terminals while this Oh no screen is in my way, I just can't see what I'm doing. The reason, of course, is because this is the generic BSOD for Gnome, anything catastrophic could have caused this and it's just trapped and says "You lose, play again" without telling you anything about what actually went wrong and how you go about fixing it. I've spent a good amount of time reading bug reports, blogs, Stack exchange sites, etc about what is causing people to see this BSOD, and they're all pretty different with all kinds of different causes. Essentailly like I mentioned before, any number of things from a hardware failure, all the way up to something misconfigured with gnome-shell that can jack itself and make you see this BSOD.

The Workaround

So now we get to how I make this go away (without fixing the underlying problem). So we know that:
  1. gnome-session-failed gets invoked when shit hits the fan.
  2. When gnome-session-failed stops running, whatever invoked it continues to restart gdm, and everything's gone
  3. And something that I didn't mention before, if it can't invoke gnome-session-failed, it simply restarts gdm anyway, just without the courtesy of the BSOD
So what's the workaround? Obviously, replace /usr/libexec/gnome-session-failed with:

sleep infinity

So we replace it with a shell script that simply never exits, and thus allowing us to return back to the weird state where all my split terminals are no longer perfectly split and show background around its edges.

jonlin 235464 0.0 0.0 7468 3584 ? Ss Nov11 0:00 /bin/sh /usr/libexec/gnome-session-failed --allow-logout

yep, that thing's gonna be running for the next month or two if I can help it.

So I have my life back before I ran all these updates where Chrome occasionally caused my laptop to black out and return with my terminals slightly smaller than before, what next?

A small update here. I've found where you can change the invocation in the Gnome shell X11 service. Gnome-shell uses systemd and OnFailure is a Systemd Section Option. It's possible you can just remove that or change the OnFailureJobMode (See: Job Modes) to simply make this all go away without having to run a shell script. Since I'm not in a position to play around with this right now, I'll update this page later when I do end up having to restart my laptop (which could be months). If you're playing around with this, make sure that the Restart option is set to always.

Additionally, in the gnome-session-failed.service file, you can remove the line:

ExecStopPost=-/usr/libexec/gnome-session-ctl --shutdown

Which is actually what restarts the session and logs you out. None of this is preferred as it could make it so legitimate errors keep you from logging out and prevents this BSOD from doing what it's supposed to do. This is, again, just a workaround.


So I tried turning off all my gnome extensions to make sure that none of those was the culprit (they weren't). Then I tried adding MUTTER_SYNC, and any number of things that link X errors to gnome-shell core dumps. One thing for sure is something along the way causes gnome-shell to core dump, as I see this before the BSOD every time. MUTTER_SYNC doesn't tell me much more, and though I do see other things dumping core, Discord, Chrome, Microsoft Teams, I don't think they are the cause of gnome-shell dumping core. Most of the debugging lead me to starting applications with strace but nothing meaningful ever came of them.

gnome-shell[41144]: Received an X Window System error.
        This probably reflects a bug in the program.
        The error was 'BadMatch (invalid parameter attributes)'.
        Details: serial 368000 error_code 8 request_code 146 (unknown) minor_code 6)
        Note to programmers: normally, X errors are reported asynchronously;
        that is, you will receive the error a while after causing it.
        To debug your program, run it with the MUTTER_SYNC environment
        variable to change this behavior. You can then get a meaningful
        backtrace from your debugger if you break on the meta_x_error() function.)

This is then followed by Chrome, Discord, and or Teams all dumping core if they were still running.

That's when I noticed that these nicely split, all black, sans titlebar rxvt terminals that I had on 10 of my 16 workspaces would be the only ones that got affected when gnome-shell restarts, and there was some xrandr related errors that I would see in the Xorg logs. I thought that maybe it had something to do with that? I don't know how that's related to Chrome or Discord or Teams, these 3 applications that I run 100% is what triggers the BSOD. Even before I was seeing the BSOD again, it was most definitely me doing something in those 3 applications that caused the laptop to freeze momentarily (while gnome-shell takes a most pleasureable core dump) and then everything sort of arrives back to where I left off. But perhaps it's caused because of some state of the terminals in the "split" mode, for lack of a better (or correct) term. I will mention that I've seen posts online about this where it was caused by playing video from mpv or something similar and they tracked it to something xrandr related.

I mention this because if I never go back and [Super]+[Left] or [Right] those terminals and just leave them sort of hanging out without any stickyness to the left or right edge of the screen, I never see the BSOD again. This was something I mildly suspected because as I was trying anything and everything to get my laptop back in a state where I wouldn't see the BSOD anymore, knowing Chrome/Discord/Teams was the cuplrit, or at least the catalyst, I'd start up a new session, launch Chrome, Discord, and Teams, but never touch any of the rxvt terminals that autostarted when I logged into my gnome session. I'd happily be using the session for hours, and even a whole day, without seeing the Oh no! screen. And every time I'd think that "Ok, maybe whatever it was I last did fixed it, I'll go ahead and put all my terminals there they belong" and once they were all in the right places, soon enough one of the 3 catalyst applications would cause some core dumps and I'd be greeted with the BSOD again.

I don't know what else to do at this point as anything further would be much outside of my knowledge domain and comfort zone, and because of the workaround I feel less inclined to dive into a possibly very deep rabbit hole. But I'm putting this post up in case someone else ran into the same problem and at the very least could use the workaround, and maybe, possibly someone who knows what they're doing may find the bit that's broken (not likely). Or... it could have already been fixed with the new mutter version 45 along with the newer version of gnome-shell.

Here's some other snippets of errors that occur around the core dump:

google-chrome.desktop[63815]: [63808:63808:1111/095607.382014:ERROR:feature_processor_state.cc(40)] Processing error occured: model WebAppInstallationPromo failed with UkmEngineDisabled, message:

(There was no message)

/usr/libexec/gdm-x-session[62885]: (EE) event9 - SynPS/2 Synaptics TouchPad: kernel bug: Touch jump detected and discarded.

kernel: [drm:lspcon_init] *ERROR* Failed to probe lspcon
kernel: [drm:intel_dp_detect] *ERROR* LSPCON init failed on port D

Definitely don't think that's related.

gnome-shell[41144]: Can't update stage views actor [:0x557d771f14c0] is on because it needs an allocation.
gnome-shell[41144]: Can't update stage views actor [:0x557d771f18a0] is on because it needs an allocation.
gnome-shell[41144]: Can't update stage views actor [:0x557d771f14c0] is on becauseit needs an allocation.
gnome-shell[41144]: Can't update stage views actor [:0x557d771f18a0] is on because it needs an allocation.

kernel: gnome-shell[59055]: segfault at 28 ip 00007f883e170f64 sp 00007fff488e84a0 error 4 in libmutter-clutter-12.so.0.0.0[7f883e105000+8c000]
kernel: Code: 62 bf 03 00 48 8d 3d 81 48 02 00 e8 96 72 f9 ff 31 c0 eb 9c 66 90 41 56 41 55 49 89 cd 41 54 49 89 f4 55 48 89 d5 53 48 89 fb <4c> 8b 77 28 e8 83 74 f9 ff 48 89 c6 48 8b 03 48 85 c0 74 05 48 39

In Summary

  • Something happens in gnome-shell when rxvt terminals are in the "View split" mode and with the applications Chrome (or chromium), Discord, or Teams. But, mostly Chrome. This causes gnome-shell to dump some sweet core
  • gnome-session-failed is launched and prevents any interaction with the system
  • Killing the process causes gdm to restart and logs you out
  • Overwriting the /usr/libexec/gnome-session-failed command with a shell script that never exits keeps you logged in

Filed under: Computers