A framebuffer hidden in plain sight

Soon after I set up my Rockpro64 board, Peter Robinson told me about an annoying bug that happened on machines with a Rockchip SoC.

The problem was that the framebuffer console just went away after GRUB booted the Linux kernel. We started looking at this and Peter mentioned the following data points:

  • Enabling early console output on the framebuffer registered by the efifb driver (earlycon=efifb efi=debug) would get some output but at some point everything would just go blank.
  • The display worked when passing fbcon=map:1 and people were using that as a workaround.
  • Preventing the efifb driver to be loaded (modprobe.blacklist=efifb) would also make things to work.

So the issue seemed to be related to the efifb driver somehow but wasn’t clear what was happening.

What this driver does is to register a framebuffer device that relies on the video output configured by the firmware/bootloader (using the EFI Graphics Output Protocol) until a real driver takes over an re-initializes the display controller and other IP blocks needed for video output.

I read The Framebuffer device and The Framebuffer Console sections in the Linux documentation to get more familiar about how this is supposed to work.

What happens is that the framebuffer console is bound by default to the first framebuffer registered, which is the one registered by the efifb driver.

Later, the rockchipdrm driver is probed and a second framebuffer registered by the DRM fbdev emulation layer but the frame buffer console is still bound to the first frame buffer, that’s using the EFI GOP but this gets destroyed when the kernel re-initializes the display controller and related IP blocks (IOMMU, clocks, power domains, etc).

So why are users left with a blank framebuffer? It’s because the framebuffer is registered but it’s not attached to the console.

Once the problem was understood, it was easy to solve it. The DRM subsystem provides a drm_aperture_remove_framebuffers() helper function to remove any existing drivers that may own the framebuffer memory, but the rockchipdrm driver was not using this helper.

The proposed fix (that landed in v5.14-rc1) then is for the rockchipdrm driver to call the helper to detach any existing early framebuffer before registering its own.

After doing that, the early framebuffer is unbound from the framebuffer console and the one registered by the rockchipdrm driver takes over:

[   40.752420] fb0: switching to rockchip-drm-fb from EFI VGA

The curious case of the ghostly modalias

I was finishing my morning coffee at the Fedora ARM mystery department when a user report came into my attention: the tpm_tis_spi driver was not working on a board that had a TPM device connected through SPI.

There was no /dev/tpm0 character device present in the system, even when the driver was built as a module and the Device Tree (DT) passed to the kernel had a node with a "infineon,slb9670" compatible string.

Peter Robinson chimed in and mentioned that he had briefly looked at this case before. The problem, he explained, is that the module isn’t auto-loaded but that manually loading it make things to work.

At the beginning he thought that this was just a common issue of a driver not having module alias information. This would lead to kmod not knowing that the module has to be loaded, when the kernel reported a MODALIAS uevent as a consequence of the SPI device being registered.

But when checking the module to confirm that theory, he found that there were alias entries:

$ modinfo drivers/char/tpm/tpm_tis_spi.ko | grep alias
alias:          of:N*T*Cgoogle,cr50C*
alias:          of:N*T*Cgoogle,cr50
alias:          of:N*T*Ctcg,tpm_tis-spiC*
alias:          of:N*T*Ctcg,tpm_tis-spi
alias:          of:N*T*Cinfineon,slb9670C*
alias:          of:N*T*Cinfineon,slb9670
alias:          of:N*T*Cst,st33htpm-spiC*
alias:          of:N*T*Cst,st33htpm-spi
alias:          spi:cr50
alias:          spi:tpm_tis_spi
alias:          acpi*:SMO0768:*

Since the board uses DT to describe the hardware topology, the TPM device should had been registered by the Open Firmware (OF) subsystem. And should cause the kernel to report a "MODALIAS=of:NspiTCinfineon,slb9670", which should had matched the "of:N*T*Cinfineon,slb9670" module alias entry.

But when digging more on this issue, things started to get more strange. Looking at the uevent sysfs entry for this SPI device, he found that the kernel was not reporting an OF modalias but instead a legacy SPI modalias: "MODALIAS=spi:slb9670".

But how come? a user asked, the device is registered using DT, not platform code! Where is this modalias coming from? Is this legacy SPI device a ghost?

Peter said that he didn’t believe in paranormal events and that there should be a reasonable explanation. So armed with grep, he wanted to get to the bottom of this but got preempted by more urgent things to do.

Coincidentally, I had chased down that same ghost before many moons ago. And it’s indeed not a spirit from the board files dimension but only an incorrect behavior in the uevent logic of the SPI subsystem.

The reason is that the SPI uevent handler always reports a MODALIAS of the form "spi:foobar" even for devices that are registered through DT. This leads to the situation described above and it’s better explained by looking at the SPI subsystem code:

static int spi_uevent(struct device *dev, struct kobj_uevent_env *env)
{
	const struct spi_device		*spi = to_spi_device(dev);
	int rc;

	rc = acpi_device_uevent_modalias(dev, env);
	if (rc != -ENODEV)
		return rc;

	return add_uevent_var(env, "MODALIAS=%s%s", SPI_MODULE_PREFIX, spi->modalias);
}

Conversely, this is what the platform subsystem uevent handler does (which properly reports OF module aliases):

static int platform_uevent(struct device *dev, struct kobj_uevent_env *env)
{
	struct platform_device	*pdev = to_platform_device(dev);
	int rc;

	/* Some devices have extra OF data and an OF-style MODALIAS */
	rc = of_device_uevent_modalias(dev, env);
	if (rc != -ENODEV)
		return rc;

	rc = acpi_device_uevent_modalias(dev, env);
	if (rc != -ENODEV)
		return rc;

	add_uevent_var(env, "MODALIAS=%s%s", PLATFORM_MODULE_PREFIX,
			pdev->name);
	return 0;
}

Fixing the SPI core would be trivial, but the problem is that there are just too many drivers and Device Trees descriptions that are relying on the current behavior.

It should be possible to change the core, but first all these drivers and DTs have to be fixed. For example, the I2C subsystem had the same issue but has already been resolved.

A workaround then in the meantime could be to add to the legacy SPI device ID table all the entries that are found in the OF device ID table. That way, a platform using for example a DT node with compatible "infineon,slb9670" will match against an alias "spi:slb9670", that will be present in the module.

And that’s exactly what the proposed fix for the tpm_tis_spi driver does.

$ modinfo drivers/char/tpm/tpm_tis_spi.ko | grep alias
alias:          of:N*T*Cgoogle,cr50C*
alias:          of:N*T*Cgoogle,cr50
alias:          of:N*T*Ctcg,tpm_tis-spiC*
alias:          of:N*T*Ctcg,tpm_tis-spi
alias:          of:N*T*Cinfineon,slb9670C*
alias:          of:N*T*Cinfineon,slb9670
alias:          of:N*T*Cst,st33htpm-spiC*
alias:          of:N*T*Cst,st33htpm-spi
alias:          spi:cr50
alias:          spi:tpm_tis_spi
alias:          spi:slb9670
alias:          spi:st33htpm-spi
alias:          acpi*:SMO0768:*

Until the next mystery!

A lethal spurious interrupt

A big part of my work on Fedora/RHEL is to troubleshoot and do root cause analysis across the software stack. Because many of these projects are decades old, this usually feels like being stuck somewhere between being an archaeologist and a detective.

Many bugs are boring but some are interesting, either because the investigation made me learn something new or due to the amount of effort that was sunk into figuring out the problem. So I thought that it would be a nice experiment to share a little about the ones that are worth mentioning. This is the first of such posts, I may write more in the future if have time and remember to do it:

It was a dark and stormy night when Peter Robinson mentioned a crime to me, a Rockpro64 board was found to not boot when the CONFIG_PCIE_ROCKCHIP_HOST option was enabled. He also already had found the criminal, it was the CONFIG_DEBUG_SHIRQ option.

I have to admit that I only knew CONFIG_DEBUG_SHIRQ by name and that it was a debug option for shared interrupts, but didn’t even know what this option was about. So the first step was to read the help text of the Kconfig symbol to learn more on this option.

Enable this to generate a spurious interrupt just before a shared interrupt handler is deregistered (generating one when registering is currently disabled). Drivers need to handle this correctly. Some don’t and need to be caught.

This was the smoking gun: a spurious interrupt!

We now knew what was the weapon used but we still had questions, why would triggering an interrupt lead to the board being hung? The next step then was to figure out where exactly this was happening, it certainly would have to be somewhere in the driver’s IRQ handler code path.

By looking at the pcie-rockchip-host driver code, we see two IRQ handlers registered: rockchip_pcie_subsys_irq_handler() for the "pcie-sys" IRQ and rockchip_pcie_client_irq_handler() for the "pcie-client" IRQ.

Adding some debug printouts to both would show us that the issue was happening in the latter, when calling to rockchip_pcie_read(). This function just hangs indefinitely and never returns.

Peter wrote in the filled bug that this issue was reported before and there was even an RFC patch posted by a Rockchip engineer, who mentioned the assessment of the problem:

With CONFIG_DEBUG_SHIRQ enabled, the irq tear down routine
would still access the irq handler register as a shared irq.
Per the comment within the function of __free_irq, it says
"It’s a shared IRQ — the driver ought to be prepared for
an IRQ event to happen even now it’s being freed". However
when failing to probe the driver, it may disable the clock
for accessing the register and the following check for shared
irq state would call the irq handler which accesses the register
w/o the clk enabled. That will hang the system forever.

The proposed solution was to check in the rockchip_pcie_read() function if a rockchip->hclk_pcie clock was enabled before trying to access the PCIe registers’ address space. But that wasn’t accepted because it was solving the symptom and not the cause.

But it did confirm our findings, that the problem was an IRQ handler being called before it was expected and that the PCIe register access hangs due to a clock not being enabled.

With all of that information and reading once more the pcie-rockchip-host driver code, we could finally reconstruct the crime scene:

  1. "pcie-sys" IRQ is requested and its handler registered.
  2. "pcie-client" IRQ is requested and its handler registered.
  3. probe later fails due to readl_poll_timeout() returning a timeout.
  4. the "pcie-sys" IRQ handler is unregistered.
  5. CONFIG_DEBUG_SHIRQ triggers a spurious interrupt.
  6. "pcie-client" IRQ handler is called for this spurious interrupt.
  7. IRQ handler tries to read PCIE_CLIENT_INT_STATUS with clocks gated.
  8. the machine hangs because rockchip_pcie_read() call never returns.

The root cause of the problem then is that the IRQ handlers are registered too early, before all the required resources have been properly set up.

Our proposed solution then is to move all the IRQ initialization into a later stage of the probe function. That makes it safe for the IRQ handlers to be called as soon as they are registered.

Until the next mystery!

Automatic LUKS volumes unlocking using a TPM2 chip

I joined Red Hat a few months ago, and have been working on improving the Trusted Platform Module 2.0 (TPM2) tooling, towards having a better TPM2 support for Fedora on UEFI systems.

For brevity I won’t explain in this post what TPMs are and their features, but assume that readers are already familiar with trusted computing in general. Instead, I’ll explain what we have been working on, the approach used and what you might expect on Fedora soon.

For an introduction to TPM, I recommend Matthew Garret’s excellent posts about the topic, Philip Tricca’s presentation about TPM2 and the official Trusted Computing Group (TCG) specifications. I also found “A Practical Guide to TPM 2.0” book to be much easier to digest than the official TCG documentation. The book is an open access one, which means that’s freely available.

LUKS volumes unlocking using a TPM2 device

Encryption of data at rest is a key component of security.  LUKS provides the ability to encrypt Linux volumes, including both data volumes and the root volume containing the OS. The OS can provide the crypto keys for data volumes, but something has to provide the key for the root volume to allow the system to boot.

The most common way to provide the crypto key to unlock a LUKS volume,  is to have a user type in a LUKS pass-phase during boot. This works well for laptop and desktop systems, but is not well suited for servers or virtual machines since is an obstacle for automation.

So the first TPM feature we want to add to Fedora (and likely one of the most common use cases for a TPM) is the ability to bind a LUKS volume master key to a TPM2. That way the volume can be automatically unlocked (without typing a pass-phrase) by using the TPM2 to obtain the master key.

A key point here is that the actual LUKS master key is not present in plain text form on the system, it is protected by TPM encryption.

Also, by sealing the LUKS master key with a specific set of Platform Configuration Registers (PCR), one can make sure that the volume will only be unlocked if the system has not been tampered with. For example (as explained in this post), PCR7 is used to measure the UEFI Secure Boot policy and keys. So the LUKS master key can be sealed against this PCR, to avoid unsealing it if Secure Boot was disabled or the used keys were replaced.

Implementation details: Clevis

Clevis is a plugable framework for automated decryption that has a number of “pins”, where each pin implements an {en,de}cryption support using a different backend. It also has a command line interface to {en,de}crypt data using these pins, create complex security policies and bind a pin to a LUKS volume to later unlock it.

Clevis relies on the José project, which is an C implementation of the Javascript Object Signing and Encryption (JOSE) standard. It also uses the LUKSMeta project to store a Clevis pin metadata in a LUKS volume header.

On encryption, a Clevis pin takes some data to encrypt and a JSON configuration to produce a JSON Web Encryption (JWE) content. This JWE has the data encrypted using a JSON Web KEY (JWK) and information on how to obtain the JWK for decryption.

On decryption, the Clevis pin obtains a JWK using the information provided by a JWE and decrypts the ciphertext also stored in the JWE using that key.

Each Clevis pin defines their own JSON configuration format, how the JWK is created, where is stored and how to retrieve it.

As mentioned, Clevis has support to bind a pin with a LUKS volume. This means that a LUKS master key is encrypted using a pin and the resulting JWE is stored in a LUKS volume meta header. That way Clevis is able to later decrypt the master key and unlock the LUKS volume. Clevis has dracut and udisks2 support to do this automatically and the next version of Clevis will also include a command line tool to unlock non-root (data) volumes.

Clevis TPM2 pin

Clevis provides a mechanism to automatically supply the LUKS master key for the root volume. The initial implementation of Clevis has support to obtain the LUKS master key from a network service, but we have extended Clevis to take advantage of a TPM2 chip, which is available on most servers, desktops and laptops.

By using a TPM, the disk can only be unlocked on a specific system – the disk will neither boot nor be accessed on another machine.

This implementation also works with UEFI Secure Boot, which will prevent the system from being booted if the firmware or system configuration has been modified or tampered with.

To make use of all the Clevis infrastructure and also be able to use the TPM2 as a part of more complex security policies, the TPM2 support was implemented as a clevis tpm2 pin.

On encryption the tpm2 pin generates a JWK, creates an object in the TPM2 with the JWK as sensitive data and binds the object (or seals if a PCR set is defined in the JSON configuration) to the TPM2.

The generated JWE contains both the public and wrapped sensitive portions of the created object, as well as information on how to unseal it from the TPM2 (hashing and key encryption algorithms used to recalculate the primary key, PCR policy for authentication, etc).

On decryption the tpm2 pin takes the JWE that contains both the sealed object and information on how to unseal it,  loads the object into the TPM2 by using the public and wrapped sensitive portions and unseals the JWK to decrypt the ciphertext stored in the JWE.

The changes haven’t been merged yet, since the pin is using features from tpm2-tools master so we have to wait for the next release of the tools. And also there are still discussions on the pull request about some details, but it should be ready to land soon.

Usage

The Clevis command line tools can be used to encrypt and decrypt data using a TPM2 chip. The tpm2 pin has reasonable defaults but one can configure most of its parameters using the pin JSON configuration (refer to the Clevis tpm2 pin documentation for these), e.g:

$ echo foo | clevis encrypt tpm2 '{}' > secret.jwe

And then the data can later be decrypted with:

$ clevis decrypt < secret.jwe
foo

To seal data against a set of PCRs:

$ echo foo | clevis encrypt tpm2 '{"pcr_ids":"8,9"}' > secret.jwe

And to bind a tpm2 pin to a LUKS volume:

$ clevis luks bind -d /dev/sda3 tpm2 '{"pcr_ids":"7"}'

The LUKS master key is not stored in raw format, but instead is wrapped with a JWK that has the same entropy than the LUKS master key. It’s this JWK that is sealed with the TPM2.

Since Clevis has both dracut and udisks2 hooks, the command above is enough to have the LUKS volume be automatically unlocked using the TPM2.

The next version of Clevis also has a clevis-luks-unlock command line tool, so a LUKS volume could be manually unlocked with:

$ clevis luks unlock -d /dev/sda3

Using the TPM2 as a part of more complex security policies

One of Clevis supported pins is the Shamir Shared Secret (SSS) pin, that allows to encrypt a secret using a JWK that is then split into different parts. Each part is then encrypted using another pin and a threshold is chose to decide how many parts are needed to reconstruct the encryption key, so the secret can be decrypted.

This allows for example to split the JWK used to wrap the LUKS mater key in two parts. One part of the JWK could be sealed with the TPM2 and another part be stored in a remote server. By sealing a JWK that’s only one part of the needed key to decrypt the LUKS master key, an attacker obtaining the data sealed in the TPM won’t be able to unlock the LUKS volume.

The Clevis encrypt command for this particular example would be:

$ clevis luks bind -d /dev/sda3 sss '{"t": 2, "pins": \
  {"http":{"url":"http://server.local/key"}, "tpm2": \
  {"pcr_ids":"7"}}}'

Limitations of this approach

One problem with the current implementation is that Clevis is a user-space tool and so it can’t be used to unlock a LUKS volume that has an encrypted /boot directory. The boot partition still needs to remain unencrypted so the bootloader is able to load a Linux kernel and an initramfs that contains Clevis, to unlock the encrypted LUKS volume for the root partition.

Since the initramfs is not signed on a Secure Boot setup, an attacker could replace the initramfs and unlock the LUKS volume. So the threat model meant to protect is for an attacker that can get access to the encrypted volume but not to the trusted machine.

There are different approaches to solve this limitation. The previously mentioned post from Matthew Garret suggests to have a small initramfs that’s built into the signed Linux kernel. The only task for this built-in initramfs would be to unseal the LUKS master key, store it into the kernel keyring and extend PCR7 so the key can’t be unsealed again. Later the usual initramfs can unlock the LUKS volume by using the key already stored in the Linux kernel.

Another approach is to also have the /boot directory in an encrypted LUKS volume and provide support for the bootloader to unseal the master key with the TPM2, for example by supporting the same JWE format in the LUKS meta header used by Clevis. That way only a signed bootloader would be able to unlock the LUKS volume that contains /boot, so an attacker won’t be able to tamper the system by replacing the initramfs since it will be in an encrypted partition.

But there is work to be done for both approaches, so it will take some time until we have protection for this threat model.

Still, having an encrypted root partition that is only automatically unlocked on a trusted machine has many use cases. To list a few examples:

  • Stolen physical disks or virtual machines images can’t be mounted on a different machine.
  • An external storage media can be bind to a set of machines, so it can be automatically unlocked only on trusted machines.
  • A TPM2 chip can be reset before sending a laptop to repair, that way the LUKS volume can’t be automatically unlocked anymore.
  • An encrypted volume can be bound to a TPM2 if there is no risk of someone having physical access to the machine but unbound again when there is risk. So the machine can be automatically unlocked on safe places but allow to require a pass-phrase on unsafe places.

Acknowledgements

I would like to thanks Nathaniel McCallum and Russell Doty for their feedback and suggestions for this article.