A framebuffer hidden in plain sight

Soon after I set up my Rockpro64 board, Peter Robinson told me about an annoying bug that happened on machines with a Rockchip SoC.

The problem was that the framebuffer console just went away after GRUB booted the Linux kernel. We started looking at this and Peter mentioned the following data points:

  • Enabling early console output on the framebuffer registered by the efifb driver (earlycon=efifb efi=debug) would get some output but at some point everything would just go blank.
  • The display worked when passing fbcon=map:1 and people were using that as a workaround.
  • Preventing the efifb driver to be loaded (modprobe.blacklist=efifb) would also make things to work.

So the issue seemed to be related to the efifb driver somehow but wasn’t clear what was happening.

What this driver does is to register a framebuffer device that relies on the video output configured by the firmware/bootloader (using the EFI Graphics Output Protocol) until a real driver takes over an re-initializes the display controller and other IP blocks needed for video output.

I read The Framebuffer device and The Framebuffer Console sections in the Linux documentation to get more familiar about how this is supposed to work.

What happens is that the framebuffer console is bound by default to the first framebuffer registered, which is the one registered by the efifb driver.

Later, the rockchipdrm driver is probed and a second framebuffer registered by the DRM fbdev emulation layer but the frame buffer console is still bound to the first frame buffer, that’s using the EFI GOP but this gets destroyed when the kernel re-initializes the display controller and related IP blocks (IOMMU, clocks, power domains, etc).

So why are users left with a blank framebuffer? It’s because the framebuffer is registered but it’s not attached to the console.

Once the problem was understood, it was easy to solve it. The DRM subsystem provides a drm_aperture_remove_framebuffers() helper function to remove any existing drivers that may own the framebuffer memory, but the rockchipdrm driver was not using this helper.

The proposed fix (that landed in v5.14-rc1) then is for the rockchipdrm driver to call the helper to detach any existing early framebuffer before registering its own.

After doing that, the early framebuffer is unbound from the framebuffer console and the one registered by the rockchipdrm driver takes over:

[   40.752420] fb0: switching to rockchip-drm-fb from EFI VGA

The curious case of the ghostly modalias

I was finishing my morning coffee at the Fedora ARM mystery department when a user report came into my attention: the tpm_tis_spi driver was not working on a board that had a TPM device connected through SPI.

There was no /dev/tpm0 character device present in the system, even when the driver was built as a module and the Device Tree (DT) passed to the kernel had a node with a "infineon,slb9670" compatible string.

Peter Robinson chimed in and mentioned that he had briefly looked at this case before. The problem, he explained, is that the module isn’t auto-loaded but that manually loading it make things to work.

At the beginning he thought that this was just a common issue of a driver not having module alias information. This would lead to kmod not knowing that the module has to be loaded, when the kernel reported a MODALIAS uevent as a consequence of the SPI device being registered.

But when checking the module to confirm that theory, he found that there were alias entries:

$ modinfo drivers/char/tpm/tpm_tis_spi.ko | grep alias
alias:          of:N*T*Cgoogle,cr50C*
alias:          of:N*T*Cgoogle,cr50
alias:          of:N*T*Ctcg,tpm_tis-spiC*
alias:          of:N*T*Ctcg,tpm_tis-spi
alias:          of:N*T*Cinfineon,slb9670C*
alias:          of:N*T*Cinfineon,slb9670
alias:          of:N*T*Cst,st33htpm-spiC*
alias:          of:N*T*Cst,st33htpm-spi
alias:          spi:cr50
alias:          spi:tpm_tis_spi
alias:          acpi*:SMO0768:*

Since the board uses DT to describe the hardware topology, the TPM device should had been registered by the Open Firmware (OF) subsystem. And should cause the kernel to report a "MODALIAS=of:NspiTCinfineon,slb9670", which should had matched the "of:N*T*Cinfineon,slb9670" module alias entry.

But when digging more on this issue, things started to get more strange. Looking at the uevent sysfs entry for this SPI device, he found that the kernel was not reporting an OF modalias but instead a legacy SPI modalias: "MODALIAS=spi:slb9670".

But how come? a user asked, the device is registered using DT, not platform code! Where is this modalias coming from? Is this legacy SPI device a ghost?

Peter said that he didn’t believe in paranormal events and that there should be a reasonable explanation. So armed with grep, he wanted to get to the bottom of this but got preempted by more urgent things to do.

Coincidentally, I had chased down that same ghost before many moons ago. And it’s indeed not a spirit from the board files dimension but only an incorrect behavior in the uevent logic of the SPI subsystem.

The reason is that the SPI uevent handler always reports a MODALIAS of the form "spi:foobar" even for devices that are registered through DT. This leads to the situation described above and it’s better explained by looking at the SPI subsystem code:

static int spi_uevent(struct device *dev, struct kobj_uevent_env *env)
{
	const struct spi_device		*spi = to_spi_device(dev);
	int rc;

	rc = acpi_device_uevent_modalias(dev, env);
	if (rc != -ENODEV)
		return rc;

	return add_uevent_var(env, "MODALIAS=%s%s", SPI_MODULE_PREFIX, spi->modalias);
}

Conversely, this is what the platform subsystem uevent handler does (which properly reports OF module aliases):

static int platform_uevent(struct device *dev, struct kobj_uevent_env *env)
{
	struct platform_device	*pdev = to_platform_device(dev);
	int rc;

	/* Some devices have extra OF data and an OF-style MODALIAS */
	rc = of_device_uevent_modalias(dev, env);
	if (rc != -ENODEV)
		return rc;

	rc = acpi_device_uevent_modalias(dev, env);
	if (rc != -ENODEV)
		return rc;

	add_uevent_var(env, "MODALIAS=%s%s", PLATFORM_MODULE_PREFIX,
			pdev->name);
	return 0;
}

Fixing the SPI core would be trivial, but the problem is that there are just too many drivers and Device Trees descriptions that are relying on the current behavior.

It should be possible to change the core, but first all these drivers and DTs have to be fixed. For example, the I2C subsystem had the same issue but has already been resolved.

A workaround then in the meantime could be to add to the legacy SPI device ID table all the entries that are found in the OF device ID table. That way, a platform using for example a DT node with compatible "infineon,slb9670" will match against an alias "spi:slb9670", that will be present in the module.

And that’s exactly what the proposed fix for the tpm_tis_spi driver does.

$ modinfo drivers/char/tpm/tpm_tis_spi.ko | grep alias
alias:          of:N*T*Cgoogle,cr50C*
alias:          of:N*T*Cgoogle,cr50
alias:          of:N*T*Ctcg,tpm_tis-spiC*
alias:          of:N*T*Ctcg,tpm_tis-spi
alias:          of:N*T*Cinfineon,slb9670C*
alias:          of:N*T*Cinfineon,slb9670
alias:          of:N*T*Cst,st33htpm-spiC*
alias:          of:N*T*Cst,st33htpm-spi
alias:          spi:cr50
alias:          spi:tpm_tis_spi
alias:          spi:slb9670
alias:          spi:st33htpm-spi
alias:          acpi*:SMO0768:*

Until the next mystery!

A lethal spurious interrupt

A big part of my work on Fedora/RHEL is to troubleshoot and do root cause analysis across the software stack. Because many of these projects are decades old, this usually feels like being stuck somewhere between being an archaeologist and a detective.

Many bugs are boring but some are interesting, either because the investigation made me learn something new or due to the amount of effort that was sunk into figuring out the problem. So I thought that it would be a nice experiment to share a little about the ones that are worth mentioning. This is the first of such posts, I may write more in the future if have time and remember to do it:

It was a dark and stormy night when Peter Robinson mentioned a crime to me, a Rockpro64 board was found to not boot when the CONFIG_PCIE_ROCKCHIP_HOST option was enabled. He also already had found the criminal, it was the CONFIG_DEBUG_SHIRQ option.

I have to admit that I only knew CONFIG_DEBUG_SHIRQ by name and that it was a debug option for shared interrupts, but didn’t even know what this option was about. So the first step was to read the help text of the Kconfig symbol to learn more on this option.

Enable this to generate a spurious interrupt just before a shared interrupt handler is deregistered (generating one when registering is currently disabled). Drivers need to handle this correctly. Some don’t and need to be caught.

This was the smoking gun: a spurious interrupt!

We now knew what was the weapon used but we still had questions, why would triggering an interrupt lead to the board being hung? The next step then was to figure out where exactly this was happening, it certainly would have to be somewhere in the driver’s IRQ handler code path.

By looking at the pcie-rockchip-host driver code, we see two IRQ handlers registered: rockchip_pcie_subsys_irq_handler() for the "pcie-sys" IRQ and rockchip_pcie_client_irq_handler() for the "pcie-client" IRQ.

Adding some debug printouts to both would show us that the issue was happening in the latter, when calling to rockchip_pcie_read(). This function just hangs indefinitely and never returns.

Peter wrote in the filled bug that this issue was reported before and there was even an RFC patch posted by a Rockchip engineer, who mentioned the assessment of the problem:

With CONFIG_DEBUG_SHIRQ enabled, the irq tear down routine
would still access the irq handler register as a shared irq.
Per the comment within the function of __free_irq, it says
"It’s a shared IRQ — the driver ought to be prepared for
an IRQ event to happen even now it’s being freed". However
when failing to probe the driver, it may disable the clock
for accessing the register and the following check for shared
irq state would call the irq handler which accesses the register
w/o the clk enabled. That will hang the system forever.

The proposed solution was to check in the rockchip_pcie_read() function if a rockchip->hclk_pcie clock was enabled before trying to access the PCIe registers’ address space. But that wasn’t accepted because it was solving the symptom and not the cause.

But it did confirm our findings, that the problem was an IRQ handler being called before it was expected and that the PCIe register access hangs due to a clock not being enabled.

With all of that information and reading once more the pcie-rockchip-host driver code, we could finally reconstruct the crime scene:

  1. "pcie-sys" IRQ is requested and its handler registered.
  2. "pcie-client" IRQ is requested and its handler registered.
  3. probe later fails due to readl_poll_timeout() returning a timeout.
  4. the "pcie-sys" IRQ handler is unregistered.
  5. CONFIG_DEBUG_SHIRQ triggers a spurious interrupt.
  6. "pcie-client" IRQ handler is called for this spurious interrupt.
  7. IRQ handler tries to read PCIE_CLIENT_INT_STATUS with clocks gated.
  8. the machine hangs because rockchip_pcie_read() call never returns.

The root cause of the problem then is that the IRQ handlers are registered too early, before all the required resources have been properly set up.

Our proposed solution then is to move all the IRQ initialization into a later stage of the probe function. That makes it safe for the IRQ handlers to be called as soon as they are registered.

Until the next mystery!

Collabora contributions to Linux kernel 4.1

Linux 4.1 was released last week and like previous kernel releases, this version again contains contributions made by Collabora engineers as a part of our current projects.

On this release, not only Collabora contributed several patches for different subsystems but also for the first time made it to LWN list of most active employers for a Linux kernel release and Tomeu Vizoso was listed as one of the most active developers by changed lines.

In total 68 patches were contributed to the 4.1 release. These were for:

  • Fixes and improvements to the DRM core atomic support.
  • Fix for Exynos DRM FIMD buffer size calculation.
  • More cleanup and fixes for Exynos DRM in preparation to finish porting the driver to support Atomic Mode Settings for v4.2
  • Add MMC/SDIO power sequencing to make the WiFi chip work on Snow and Peach Pit/Pi Chromebooks.
  • Various fixes and improvement for the ChromeOS Embeded Controller drivers.
  • Add default console serial port configuration for Exynos Chromebooks to avoid having to define a tty in the kernel command line.
  • Enable needed drivers in the Exynos default configuration file.
  • Fix error code propagation in the MMC power sequencing core.
  • Restore clocks needed during suspend for Exynos5 machines to prevent failing to resume.
  • Make the Mwifiex chip on Exynos Chromebooks to keep power during suspend to prevent the driver not allow the system to enter into a suspend state.
  • Fix output disable race on the Samsung PWM driver.
  • Split out touchpad initialisation logic on the Atmel MaxTouch driver.
  • Various enhancements for the Tegra ASoC driver.
  • Add a Device Tree to support the Nyan Blaze Chromebook and factor out common snippets with the Nyan Big Chromebook Device Tree.
  • Add trackpad, WiFi and GPIO restart support for the Nyan Chromebooks.
  • Add support for the Tegra124 Activity Monitor (ACTMON).
  • Fill EMC timings for Tegra Nyan Chromebooks.
  • Many fixes and improvements for the Tegra device frequency scaling driver.
  • Fix the Tegra DRM driver by resetting the SOR to a known state.
  • Enable many drivers in multi_v7 default configuration, that are needed by Tegra Chromebooks.
  • Fix the cros_ec keyboard driver to avoid loosing the key pressed during suspend to resume the system.
  • Fix USB not working on Tegra124 based boards.

Following is the complete list of patches merged in this kernel release:

Collabora contributions to Linux kernel 4.0

Linux 4.0 was released a couple of weeks ago and like previous kernel releases, this version again contains contributions made by Collabora engineers as a part of our current projects.

In total 59 patches were contributed to the 4.0 release. These were for:

  • Fix graphics DP and HDMI display for Exynos DRM.
  • More preparation work to add Atomic Mode Settings support to the Exynos DRM driver.
  • Add support for Power, Lid keys and built-in USB camera to Peach Pi/Pit and Snow Chromebooks.
  • Configure regulators operating modes on suspend for Peach Pi/Pit Chromebooks to reduce power consumption.
  • Add DISP1 power domain and related clocks to have proper display support on Exynos5420 machines.
  • Extend the MMC simple power sequencing provider to support a reference clock and more than one reset GPIO.
  • Fix various regressions in the common clock framework exposed by the per-user clock changes.
  • Fix S3C Real-Time-Clock that was not working on many Exynos SoCs.
  • Fix a bug in the regulator framework that tried to enable regulators that were already enabled.
  • Fix a reboot and poweroff hang on Exynos machines cause by a hang in the samsung serial driver.
  • Various fixes to the Samsung MFC driver.
  • Add support for the Exynos5422 Odroid XU3 board.
  • Enable needed Kconfig symbols on the exynos, omap2plus and multi_v7 defconfigs.
  • Add a devfreq driver for the Tegra Activity Monitor

Following is the complete list of patches merged in this kernel release:

Collabora contributions to Linux kernel 3.19

Linux 3.19 was released last week and this version again contains contributions made by Collabora engineers as a part of our current projects.

In total 60 patches were contributed to the 3.19 release. These were for:

  • Work in the Intel i915 DRM driver to support atomic plane updates.
  • Fixes and removal of unnecessary layers in the Exynos DRM driver as a preparation to add atomic mode settings support.
  • Add support to the regulator framework to define regulators initial and suspend operating modes.
  • Changes to the max77802 regulator driver to support regulator operating modes changes.
  • Fix the Real-Time-Clock in the Snow, Peach Pit and Pi Chromebooks.
  • Enable kernel config options to have display panel working on Exynos boards.
  • Allow the regulator drivers to be enabled and disabled when Exynos system enters and leave suspend states.
  • Many cleanup and fixes to the common clock framework and clock drivers as a preparation for the per-user clock API change.

Following is the complete list of patches merged in this kernel release:

Collabora contributions to the Linux kernel 3.18

Linux 3.18 was released this week and like in previous kernel releases, it contains contributions made by Collabora engineers as a part of the projects involving the Linux kernel.

In total 45 patches were contributed to the 3.18 release. These were for:

  • Cleanups for the Intel i915 DRM driver
  • Various cleanups for the max77686 clock, rtc and regulator drivers.
  • Adding max77802 clock, rtc and regulator drivers.
  • Fixing regmap DT endianess parsing logic.
  • Improving the power model in the Exynos5 Peach Pit and Pi Chromebooks DT.
  • Using the regulator_get_voltage() function to get the mmc OCR mask.
  • Adding max77802 PMIC, ISL29018 sensor and atmel touchpad for Peach boards DT.
  • Setting the correct clock rate for i2c7 in Exynos5 Peach Pit and Pi DT.
  • Enable atmel touchpad, cgroups, sbs battery and atmel touchpad in exynos defconfig.
  • Fixing variable initialization for different regulator drivers.
  • Fixing MFC v5 support in the s5p-mfc driver.
  • Making module autoloading to work for i2c cros-ec-tunnel and cros_ec_keyb drivers.
  • Explicitly configure USB dual role mode as host for Exynos boards.
  • Adding a PM_QOS_MEMORY_BANDWIDTH pm_qos class.
  • Enabling gcov-based kernel profiling for ARM

Following is the complete list of patches merged in this kernel release:

mmap support for Raspberry Pi bcm2835 ALSA driver

Because I work on an awesome company I spent last week improving the Raspberry Pi ALSA driver.

A long standing issue was that the driver did not support memory-mapped I/O mode for audio stream transfers.

ALSA supports two transfers methods for PCM playback: Read/Write transfer where samples are written to the device using standard read and write functions and Direct Read/Write transfers where samples can be written directly to a mapped memory area and the driver is signaled once this has been done.

ALSA provides an API for both cases and each application using ALSA to access the audio device can choose which API to use. But since mmap was not supported by the bcm2835 driver, applications using the mmap API did not work on the Raspberry Pi.

To bypass hardware limitation such as the lack of mmap support, ALSA provides a set of PCM plug-ins that can be used to extend functionality and features of PCM devices.

These plug-ins are used by defining a custom /etc/asound.conf or .asoundrc and one of them is the mmap emulation plug-in (mmap_emul) which emulates mmap access using a set of read/write transfers. This plug-in was needed on the Raspberry Pi to allow applications using ALSA mmap API to work. But of course it introduces some latency and requires a custom configuration to enable it.

So, the best approach was to just add mmap support to the ALSA driver and get rid of the mmap emulation plug-in. Normally this is straightforward since the kernel allows memory areas to be mapped to user-space address space so applications can have direct access to device memory.

This was not the case on the Raspberry Pi since the PCM samples are processed by its VideoCore IVI GPU. The ARM11 CPU communicates with the VideoCore co-processor using a message passing interface (vchiq) which means that the CPU is not able to directly address the audio device hardware buffers and audio samples have to be sent to the device using vchiq messages.

Since hardware buffers can’t be directly mapped to user-space memory, an intermediate buffer is needed. This intermediate buffer is mapped to user-space so applications can store the audio samples there. Once user-space has finished writing the PCM samples they are pushed to VideoCore as vchiq messages.

Fortunately, the kernel ALSA subsystem provides a PCM indirect API which are a set of helper functions and a data structure to manage the intermediate and hardware buffers. It is really helpful so you don’t have to write all the buffer management.

So, after figuring out all of this I wrote a patch to add mmap support to the Raspberry Pi ALSA driver and send a pull request that also contained other improvements to the driver.

Since it has been merged on the Raspberry Pi kernel, this feature will be available on the next raspbian release but if you want to use it now you can do the following:

$ git clone git://github.com/raspberrypi/tools.git
$ git clone git://github.com/raspberrypi/linux.git
$ cd linux
$ git checkoub -b rpi-3.6.y origin/rpi-3.6.y
$ export ARCH=arm
$ export CROSS_COMPILE=../tools/arm-bcm2708/arm-bcm2708-linux-gnueabi/bin/arm-bcm2708-linux-gnueabi-
$ make bcmrpi_defconfig
$ make prepare modules_prepare
$ make M=sound/arm
$ sudo cp sound/arm/snd-bcm2835.ko /media/rootfs/lib/modules/3.6.11+/kernel/sound/arm/
$ sudo rm /media/rootfs/etc/asound.conf

The last step is very important since using both a mmap capable ALSA driver and the PCM mmap emulation plug-in does not work.

I want to thank my employer Collabora for allowing me to work on this:

Device Tree support for IGEPv2 (and IGEP COM Module)

Because I work on an awesome company, I could spend the last few days doing upstream kernel work for the OMAP3 SoC based embedded board I maintain.

Now that SPL/U-Boot and the kernel supports the latest version of the board, the natural next step was to add support for Device Tree (DT).

I didn’t know too much about DT besides it was the new shiny way to describe hardware (instead of the infamous board files) and that it could allow to boot most ARM board using a single kernel image.

So, after spending some time learning about DT, I posted a patch-set that adds DT support for IGEP technology devices (IGEPv2 and IGEP COM Module). Hopefully the patches won’t need too much love before they get merged.

DT is a hot topic on the Linux kernel community right now and it’s been actively developed. Support for it is still premature on some platforms (such as OMAP3). So, these DT don’t have all the hardware support for IGEP that is already implemented using board files.

Currently is working (tested with latest Linux v3.7-rc7 kernel on an IGEPv2 board):
– MMC/SD
– UARTs
– GPIO LEDs
– TWL4030 codec audio
– pinmux/pinconf pinctrl

There are still peripherals not yet supported such as Ethernet, Flash memory and Wifi/BT. Some of these are connected to the OMAP processor through its General Purpose Memory Controller (GPMC) interface and patches to add DT bindings for OMAP GPMC are still flying around. So support for them should be easily added once these patches hit mainline.

It should also work on an IGEP COM Module but since I don’t have this hardware it has been just “compile tested”.

Many thanks to Enric who provided me his initial DT that booted the board, reviewed my patches and gave me lots of feedback and suggestions. Enric’s initial DT and a great explanation on how to boot a kernel with DT can be found on his blog entry.

Also, I’ve updated the oe-core/poky/yocto meta-igep layer to include a mainline Linux kernel recipe. This will make easier to test IGEP support on upstream and to add any remaining hardware enablement bits.

Adding IGEPv2 Rev.C support to mainline u-boot and kernel

I own an IGEPv2 Rev.C board which is a Texas Instrument OMAP3 DM3730 ARM embedded platform very similar to the Beagleboard-MX board but with more peripherals and a 512 NAND flash memory. This board is part of a family of embedded boards known as IGEP-based boards (another popular board from this family is the IGEP COM module).

IGEP’s manufacturer (ISEE) provides a very complete SDK and documentation on how to build a bootable image for the device and use all its features. The SDK is composed of an custom boot-loader (based on TI x-loader and u-boot), a custom kernel (based on v2.6.37 kernel from TI’s arago project), the Linaro ARM cross-toolchain and a root file system with all the necessary software and example code on how to make the best use of the board features. This makes the IGEPv2 a really competitive platform for embedded DIY projects.

Probably ISEE has its reasons to ship a custom kernel and boot-loader instead of their mainline counterparts but I (being a FOSS hacker) wanted to use upstream software on my board. So I started yet another hobby project to do on my (lately very scarce) free time to hack the kernel and u-boot and make them able to boot my IGEPv2 Rev.C board.

Fortunately, ISEE engineers have been pushing support for their IGEP board family to the mainline Linux kernel even though this kernel is not supported by their company. They do this on their free time because they understand the benefits of having their code as close as upstream as possible thus minimizing the size of the patch-set they need to carry on and forward port every time they move to a newer kernel release. But since they do on their spare time, it usually doesn’t support the latest version of their boards (such as the latest IGEPv2 Rev.C I own).

Also, ISEE used to ship u-boot as the IGEP family boot-loader before developing their own boot-loader (igep-x-loader). So it turned out that I only needed to add a few patches to the upstream kernel and u-boot to make them work with my board.

The first issue I had is that previous IGEP boards came with an OneNAND flash memory while newer boards use a NAND flash memory. So the first patch to the kernel was to to give NAND support to the IGEP. With this patch I was able to boot an upstream kernel using ISEE’s custom boot-loader.

The next step was to add the needed support to u-boot. This was a little more complicated since ISEE has not used u-boot for a while and u-boot has added recently a major feature which eliminates the need of a first stage loader (such as TI’s x-loader). Now u-boot has a framework known as SPL (Second Program Loader) that uses the same code base to generate both the first stage and second stage boot-loaders.

So, besides adding NAND support and a compile config option to choose the flash memory type to be used on an IGEP board I added SPL support for IGEP boards and updated the IGEP default kernel command line arguments to use the OMAP-specific serial driver and the EXT4 file system

With these patches my board booted with mainline u-boot and kernel but I found an issue on the IGEP kernel platform code that assumed the boot-loader would set the OMAP mux pin connected to the LAN9221i ethernet chip IRQ line as GPIO input mode (which was true with igep-x-loader but not with u-boot anymore). This should be made on the kernel anyways so the fix was just a trivial one-liner patch to the kernel.

All these patches are upstream now, so to test it just use the latest v3.6-rc4 kernel from Linus and u-boot master branch (git://git.denx.de/u-boot.git).

I had a lot of fun hacking this and my next step is to migrate the IGEP platform code to device trees. So, if anyone else owns an IGEPv2 or an IGEP COM module and is interested on using mainline software instead of ISEE custom one, please contact me so we can share the efforts 🙂