A lethal spurious interrupt

A big part of my work on Fedora/RHEL is to troubleshoot and do root cause analysis across the software stack. Because many of these projects are decades old, this usually feels like being stuck somewhere between being an archaeologist and a detective.

Many bugs are boring but some are interesting, either because the investigation made me learn something new or due to the amount of effort that was sunk into figuring out the problem. So I thought that it would be a nice experiment to share a little about the ones that are worth mentioning. This is the first of such posts, I may write more in the future if have time and remember to do it:

It was a dark and stormy night when Peter Robinson mentioned a crime to me, a Rockpro64 board was found to not boot when the CONFIG_PCIE_ROCKCHIP_HOST option was enabled. He also already had found the criminal, it was the CONFIG_DEBUG_SHIRQ option.

I have to admit that I only knew CONFIG_DEBUG_SHIRQ by name and that it was a debug option for shared interrupts, but didn’t even know what this option was about. So the first step was to read the help text of the Kconfig symbol to learn more on this option.

Enable this to generate a spurious interrupt just before a shared interrupt handler is deregistered (generating one when registering is currently disabled). Drivers need to handle this correctly. Some don’t and need to be caught.

This was the smoking gun: a spurious interrupt!

We now knew what was the weapon used but we still had questions, why would triggering an interrupt lead to the board being hung? The next step then was to figure out where exactly this was happening, it certainly would have to be somewhere in the driver’s IRQ handler code path.

By looking at the pcie-rockchip-host driver code, we see two IRQ handlers registered: rockchip_pcie_subsys_irq_handler() for the "pcie-sys" IRQ and rockchip_pcie_client_irq_handler() for the "pcie-client" IRQ.

Adding some debug printouts to both would show us that the issue was happening in the latter, when calling to rockchip_pcie_read(). This function just hangs indefinitely and never returns.

Peter wrote in the filled bug that this issue was reported before and there was even an RFC patch posted by a Rockchip engineer, who mentioned the assessment of the problem:

With CONFIG_DEBUG_SHIRQ enabled, the irq tear down routine
would still access the irq handler register as a shared irq.
Per the comment within the function of __free_irq, it says
"It’s a shared IRQ — the driver ought to be prepared for
an IRQ event to happen even now it’s being freed". However
when failing to probe the driver, it may disable the clock
for accessing the register and the following check for shared
irq state would call the irq handler which accesses the register
w/o the clk enabled. That will hang the system forever.

The proposed solution was to check in the rockchip_pcie_read() function if a rockchip->hclk_pcie clock was enabled before trying to access the PCIe registers’ address space. But that wasn’t accepted because it was solving the symptom and not the cause.

But it did confirm our findings, that the problem was an IRQ handler being called before it was expected and that the PCIe register access hangs due to a clock not being enabled.

With all of that information and reading once more the pcie-rockchip-host driver code, we could finally reconstruct the crime scene:

  1. "pcie-sys" IRQ is requested and its handler registered.
  2. "pcie-client" IRQ is requested and its handler registered.
  3. probe later fails due to readl_poll_timeout() returning a timeout.
  4. the "pcie-sys" IRQ handler is unregistered.
  5. CONFIG_DEBUG_SHIRQ triggers a spurious interrupt.
  6. "pcie-client" IRQ handler is called for this spurious interrupt.
  7. IRQ handler tries to read PCIE_CLIENT_INT_STATUS with clocks gated.
  8. the machine hangs because rockchip_pcie_read() call never returns.

The root cause of the problem then is that the IRQ handlers are registered too early, before all the required resources have been properly set up.

Our proposed solution then is to move all the IRQ initialization into a later stage of the probe function. That makes it safe for the IRQ handlers to be called as soon as they are registered.

Until the next mystery!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s