Scientific Computing with QEMU on Raspberry Pi

Here’s the deal. Let’s say you’ve heard of BOINC–a tool for donating CPU time to distributed computing projects, typically for the advancement of the sciences. Big names in distributed computing applications like Rosetta, MilkyWay, Climate Prediction, and World Community Grid hunger for your CPU cycles in exchange for the promise of advancing our understanding of the world. Serious number crunchers might build machines for the express purpose of maximizing computational throughput, cranking out hundreds of work units per hour. You can even join teams and compete!

Let’s also say that you’re a technology geek, for which owning or at least toying with some kind of Single Board Computer (SBC) is almost a rite of passage, especially if you’re thinking about getting into the “maker” world. The most widely used SBC is the Raspberry Pi. I usually have a few of these devices lying around idle at any given moment. I’m one of the priviledged who is more limited by time than resources.

Why not put the Raspberry Pi to work computing for science?

BOINC logo.
Logo by MindZiper, Wikimedia Commons. CC 3.0 attrib, share-alike license.

WARNING!

  • Always make backups of your data before making system changes.
  • I’m not responsible if this burns down your house or triggers the apocolypse. This is entirely at your own risk.
  • This pushes your Pi to its thermal limits. Ensure you have an adequte cooling solution or else your CPU will throttle and your device may get quite hot.
  • I only tried this on the Raspberry Pi 4. YMMV.

The First Hurdle: Requesting Work

These devices are surprisingly powerful yet extremely energy efficient in no small part thanks to the ARM architecture popularized in smartphones. However, if you install boinc-client and begin registering for projects, you’ll quickly realize that almost no projects offer applications that can actually run on ARM. Even when there is an application, I’ve had the RPi sit idle for months waiting for work units that never come. There’s just not a lot of community interest in supporting this hardware. Nevertheless, this is an important precursor to using BOINC, so let’s see how it’s done.

If you want to request a native ARM application, click here for a list of valid platform strings. You probably want to select aarch64-unknown-linux-gnu and enter this into your /etc/boinc/cc_config.xml like:

<cc_config>
  <log_flags>
    <task>1</task>
    <file_xfer>1</file_xfer>
    <sched_ops>1</sched_ops>
  </log_flags>
  <options>
    <alt_platform>aarch64-unknown-linux-gnu</alt_platform>
  </options>
</cc_config>

In my experience, the platform string often takes on strange values if you rely on auto detection because it’s not optimized for SBCs. You’ll know if it worked by reviewing the BOINC logs for the following log message:

Config: alternate platform: aarch64-unknown-linux-gnu

If you’re running headless, you can read the logs like so:

boinccmd --get_messages

I’m running a 64-bit Ubuntu Server on my Pi, and it does actually provide all 64-bit binaries with the requisite runtime libraries to support the above platform string. If you’re using Raspbian OS, be aware that system uses a 32-bit userspace along with a 64-bit kernel. Therefore, you would need to instead choose arm-unknown-linux-gnueabihf for 32-bit ARM hardfloat userspace to receive compatible applications.

This doesn’t do you any good by itself because there’s basically no work available for ARM-based platforms, but we need to understand how to configure BOINC first.

The Second Hurdle: Getting Work

We have established the futility of trying to request work in the native binary format of the Pi and the configuration needed to pull these apps in the past. However–in principle–one could write software to interpret a binary in any format and indirectly perform the exact same operations. This is the same thing that happens when you run a game console emulator on your PC or phone.

The QEMU project is better known for running entire operating systems inside a virtual machine while taking advantage of hardware acceleration facilities, but QEMU also ships tools with the unique capability of emulating any supported CPU on any supported host platform! This means that we can ask QEMU to translate whatever binary formats we need to issue our SBC useful work.

We can make our Pi a contributing member of BOINC society by running:

apt install qemu-user qemu-user-binfmt

The above command is all I needed in Ubuntu Server to have the default suite of usermode emulators installed and automatically register the emulators to handle foreign binaries via the binfmt_misc facility. Now, in cc_config I can set platform x86_64-pc-linux-gnu and be immediately eligible to do work for practically every single project on BOINC! Boinc will download the work unit, and binfmt registered with the Linux kernel automatically redirects the task of executing the application to QEMU-user for binary translation. QEMU takes care of translating the foreign app into executable ARM code snippets as the application runs, and it automatically translates communication to and from the kernel so that it looks like a regular ARM binary to the rest of the operating system.

The result looks something like this (output of ps aux) in the terminal:

/usr/libexec/qemu-binfmt/x86_64-binfmt-P ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.82_x86_64-pc-linux-gnu__mt

The app is indeed a 64-bit x86 binary running on my 64-bit ARM CPU.

The Third Hurdle: Performance

Dynamic binary translation performed by QEMU is additional work that the device performs in order to do the actual useful work for BOINC. Anecdotally, you’re going to see only 20-40% of the performance achievable by a real x86_64 processor. Plus, there’s an additional memory requirement to run QEMU and store translated code. QEMU goes to great lengths to speed up execution, but friction between the binary langauge of the app and the langauge the CPU remains. You can read here about the tricks QEMU uses to speed up the process if you’d like to learn more about how this works.

We can do even better. In fact, speed up to almost 100% of native speeds is possible! (Update: see below. We probably don’t want to do this for scientific applications).

Box64 and FEX aim to do just that. QEMU focuses on compatibility first over performance. TCG (the QEMU binary translation backend) usually emits multiple ARM instructions to perform the equivalent of just one x86_64 instruction. Box64 and FEX go one step further by generating optimized ARM64 code and patching out calls to libraries, replacing the code with already optimized native equivalents to reduce the amount of translated code.

The reason I demonstrated using QEMU-user instead of box64 or FEX is because QEMU is the only solution that actually works today. Box64 crashes because it’s incomplete and hits a system call that it doesn’t know how to emulate. FEX breaks because it doesn’t support the ease of use of having the binfmt automatically handle triggering QEMU. I submitted bug reports to both projects to hopefully increase their suitability for BOINC in the future, but for today these projects really focus on games while QEMU focuses on being complete for any use case.

UPDATE: Use QEMU

Scientific applications rely heavily on floating point operations, which is how computers represent fractional numeric values in a flexible way such as 2.5 or e. There is a standard governing this data representation for interoperability: IEEE754, but the standard has edge cases that complicate implementation as with any nontrivial technical standard. These differences lie in rounding, the handling of NaN values, status flags on which a program might depend, and sometimes in the hidden precision used in the internals of real floating point hardware. For some programs, such as games, minor divergence in floating point outputs is tolerable and might not meaningfully impact program behavior, but QEMU cannot assume that all applications will behave the same if the emulated floating point results differ. For example, is the value the width of an atom, or the number of stars in the universe? It’s easy to imagine a program designed to depend on arbitrarily precise behavior.

Each floating point instruction (usually executes in a single cycle on consumer PC processors) is replaced with a call to handler code that checks if the host floating point hardware can be used to produce the same result. If the conditions are satisfied, we simply execute the related host floating point instruction and provide the result, but we often must perform extra adjustments or even perform the entire calculation in software to maintain correctness for the program.

Consider the simplest case imaginable. We jump to a handler, we perform a check using one instruction, we execute the equivalent host instruction, and we return to the TCG block currently executing. This is four instructions vs one instruction, leading to a theoretical performance limit of 25% of native at best. It should be clear that even with optimal code, any divergence over exactly the same floating point hardware invariably leads to slowdown when processing floating point values.

Rosetta 2 actually cheats by just not addressing this problem in software. Instead, Apple provides exactly correct floating point matching the target architecture as a non-standard modification to their SoC. Box and FEX cheat by compromising on correctness. QEMU is as reasonably fast as could be expected to maintain correct behavior.

Conclusion

It’s very possible to get work for BOINC for almost arbitrary platforms, but it comes at a performance hit. Still, some contribution is better than none, and I live in an area with cheap electricity. Thus, I intend to number crunch in emulation for months to come!


Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Recents from Henfred