A big part of my work on Fedora/RHEL is to troubleshoot and do root cause analysis across the software stack. Because many of these projects are decades old, this usually feels like being stuck somewhere between being an archaeologist and a detective.
Many bugs are boring but some are interesting, either because the investigation made me learn something new or due to the amount of effort that was sunk into figuring out the problem. So I thought that it would be a nice experiment to share a little about the ones that are worth mentioning. This is the first of such posts, I may write more in the future if have time and remember to do it:
It was a dark and stormy night when Peter Robinson mentioned a crime to me, a Rockpro64 board was found to not boot when the
CONFIG_PCIE_ROCKCHIP_HOST option was enabled. He also already had found the criminal, it was the
I have to admit that I only knew
CONFIG_DEBUG_SHIRQ by name and that it was a debug option for shared interrupts, but didn’t even know what this option was about. So the first step was to read the help text of the Kconfig symbol to learn more on this option.
Enable this to generate a spurious interrupt just before a shared interrupt handler is deregistered (generating one when registering is currently disabled). Drivers need to handle this correctly. Some don’t and need to be caught.
This was the smoking gun: a spurious interrupt!
We now knew what was the weapon used but we still had questions, why would triggering an interrupt lead to the board being hung? The next step then was to figure out where exactly this was happening, it certainly would have to be somewhere in the driver’s IRQ handler code path.
By looking at the pcie-rockchip-host driver code, we see two IRQ handlers registered:
rockchip_pcie_subsys_irq_handler() for the
"pcie-sys" IRQ and
rockchip_pcie_client_irq_handler() for the
Adding some debug printouts to both would show us that the issue was happening in the latter, when calling to
rockchip_pcie_read(). This function just hangs indefinitely and never returns.
With CONFIG_DEBUG_SHIRQ enabled, the irq tear down routine
would still access the irq handler register as a shared irq.
Per the comment within the function of __free_irq, it says
"It’s a shared IRQ — the driver ought to be prepared for
an IRQ event to happen even now it’s being freed". However
when failing to probe the driver, it may disable the clock
for accessing the register and the following check for shared
irq state would call the irq handler which accesses the register
w/o the clk enabled. That will hang the system forever.
The proposed solution was to check in the
rockchip_pcie_read() function if a
rockchip->hclk_pcie clock was enabled before trying to access the PCIe registers’ address space. But that wasn’t accepted because it was solving the symptom and not the cause.
But it did confirm our findings, that the problem was an IRQ handler being called before it was expected and that the PCIe register access hangs due to a clock not being enabled.
With all of that information and reading once more the pcie-rockchip-host driver code, we could finally reconstruct the crime scene:
"pcie-sys"IRQ is requested and its handler registered.
"pcie-client"IRQ is requested and its handler registered.
- probe later fails due to
readl_poll_timeout()returning a timeout.
"pcie-sys"IRQ handler is unregistered.
CONFIG_DEBUG_SHIRQtriggers a spurious interrupt.
"pcie-client"IRQ handler is called for this spurious interrupt.
- IRQ handler tries to read
PCIE_CLIENT_INT_STATUSwith clocks gated.
- the machine hangs because
rockchip_pcie_read()call never returns.
The root cause of the problem then is that the IRQ handlers are registered too early, before all the required resources have been properly set up.
Our proposed solution then is to move all the IRQ initialization into a later stage of the probe function. That makes it safe for the IRQ handlers to be called as soon as they are registered.
Until the next mystery!