Reproducible Builds in May 2022

View all our monthly reports


Welcome to the May 2022 report from the Reproducible Builds project. In our reports we outline the most important things that we have been up to over the past month. As ever, if you are interested in contributing to the project, please visit our Contribute page on our website.


Repfix paper

Zhilei Ren, Shiwei Sun, Jifeng Xuan, Xiaochen Li, Zhide Zhou and He Jiang have published an academic paper titled Automated Patching for Unreproducible Builds:

[..] fixing unreproducible build issues poses a set of challenges [..], among which we consider the localization granularity and the historical knowledge utilization as the most significant ones. To tackle these challenges, we propose a novel approach [called] RepFix that combines tracing-based fine-grained localization with history-based patch generation mechanisms.

The paper (PDF, 3.5MB) uses the Debian mylvmbackup package as an example to show how RepFix can automatically generate patches to make software build reproducibly. As it happens, Reiner Herrmann submitted a patch for the mylvmbackup package which has remained unapplied by the Debian package maintainer for over seven years, thus this paper inadvertently underscores that achieving reproducible builds will require both technical and social solutions.

Python variables

Johannes Schauer discovered a fascinating bug where simply naming your Python variable _m led to unreproducible .pyc files. In particular, the types module in Python 3.10 requires the following patch to make it reproducible:

--- a/Lib/types.py
+++ b/Lib/types.py
@@ -37,8 +37,8 @@ _ag = _ag()
 AsyncGeneratorType = type(_ag)
 
 class _C:
-    def _m(self): pass
-MethodType = type(_C()._m)
+    def _b(self): pass
+MethodType = type(_C()._b)

Simply renaming the dummy method from _m to _b was enough to workaround the problem. Johannes’ bug report first led to a number of improvements in diffoscope to aid in dissecting .pyc files, but upstream identified this as caused by an issue surrounding interned strings and is being tracked in CPython bug #78274.

New SPDX team to incorporate build metadata in Software Bill of Materials

SPDX, the open standard for Software Bill of Materials (SBOM), is continuously developed by a number of teams and committees. However, SPDX has welcomed a new addition; a team dedicated to enhancing metadata about software builds, complementing reproducible builds in creating a more secure software supply chain. The “SPDX Builds Team” has been working throughout May to define the universal primitives shared by all build systems, including the “who, what, where and how” of builds:

  • Who: the identity of the person or organisation that controls the build infrastructure.

  • What: the inputs and outputs of a given build, combining metadata about the build’s configuration with an SBOM describing source code and dependencies.

  • Where: the software packages making up the build system, from build orchestration tools such as Woodpecker CI and Tekton to language-specific tools.

  • How: the invocation of a build, linking metadata of a build to the identity of the person or automation tool that initiated it.

The SPDX Builds Team expects to have a usable data model by September, ready for inclusion in the SPDX 3.0 standard. The team welcomes new contributors, inviting those interested in joining to introduce themselves on the SPDX-Tech mailing list.

Talks at Debian Reunion Hamburg

Some of the Reproducible Builds team (Holger Levsen, Mattia Rizzolo, Roland Clobus, Philip Rinn, etc.) met in real life at the Debian Reunion Hamburg (official homepage). There were several informal discussions amongst them, as well as two talks related to reproducible builds.

First, Holger Levsen gave a talk on the status of Reproducible Builds for bullseye and bookworm and beyond (WebM, 210MB):

Secondly, Roland Clobus gave a talk called Reproducible builds as applied to non-compiler output (WebM, 115MB):


Supply-chain security attacks

This was another bumper month for supply-chain attacks in package repositories. Early in the month, Lance R. Vick noticed that the maintainer of the NPM foreach package let their personal email domain expire, so they bought it and now “controls foreach on NPM and the 36,826 projects that depend on it”. Shortly afterwards, Drew DeVault published a related blog post titled When will we learn? that offers a brief timeline of major incidents in this area and, not uncontroversially, suggests that the “correct way to ship packages is with your distribution’s package manager”.

Bootstrapping

“Bootstrapping” is a process for building software tools progressively from a primitive compiler tool and source language up to a full Linux development environment with GCC, etc. This is important given the amount of trust we put in existing compiler binaries. This month, a bootstrappable mini-kernel was announced. Called boot2now, it comprises a series of compilers in the form of bootable machine images.

Google’s new Assured Open Source Software service

Google Cloud (the division responsible for the Google Compute Engine) announced a new Assured Open Source Software service. Noting the considerable 650% year-over-year increase in cyberattacks aimed at open source suppliers, the new service claims to enable “enterprise and public sector users of open source software to easily incorporate the same OSS packages that Google uses into their own developer workflows”. The announcement goes on to enumerate that packages curated by the new service would be:

  • Regularly scanned, analyzed, and fuzz-tested for vulnerabilities.

  • Have corresponding enriched metadata incorporating Container/Artifact Analysis data.

  • Are built with Cloud Build including evidence of verifiable SLSA-compliance

  • Are verifiably signed by Google.

  • Are distributed from an Artifact Registry secured and protected by Google.

(Full announcement)

A retrospective on the Rust programming language

Andrew “bunnie” Huang published a long blog post this month promising a “critical retrospective” on the Rust programming language. Amongst many acute observations about the evolution of the language’s syntax (etc.), the post beings to critique the languages’ approach to supply chain security (“Rust Has A Limited View of Supply Chain Security”) and reproducibility (“You Can’t Reproduce Someone Else’s Rust Build”):

There’s some bugs open with the Rust maintainers to address reproducible builds, but with the number of issues they have to deal with in the language, I am not optimistic that this problem will be resolved anytime soon. Assuming the only driver of the unreproducibility is the inclusion of OS paths in the binary, one fix to this would be to re-configure our build system to run in some sort of a chroot environment or a virtual machine that fixes the paths in a way that almost anyone else could reproduce. I say “almost anyone else” because this fix would be OS-dependent, so we’d be able to get reproducible builds under, for example, Linux, but it would not help Windows users where chroot environments are not a thing.

(Full post)

Reproducible Builds IRC meeting

The minutes and logs from our May 2022 IRC meeting have been published. In case you missed this one, our next IRC meeting will take place on Tuesday 28th June at 15:00 UTC on #reproducible-builds on the OFTC network.

A new tool to improve supply-chain security in Arch Linux

kpcyrd published yet another interesting tool related to reproducibility. Writing about the tool in a recent blog post, kpcyrd mentions that although many PKGBUILDs provide authentication in the context of signed Git tags (i.e. the ability to “verify the Git tag was signed by one of the two trusted keys”), they do not support pinning, ie. that “upstream could create a new signed Git tag with an identical name, and arbitrarily change the source code without the [maintainer] noticing”. Conversely, other PKGBUILDs support pinning but not authentication. The new tool, auth-tarball-from-git, fixes both problems, as nearly outlined in kpcyrd’s original blog post.


diffoscope

diffoscope is our in-depth and content-aware diff utility. Not only can it locate and diagnose reproducibility issues, it can provide human-readable diffs from many kinds of binary formats. This month, Chris Lamb prepared and uploaded versions 212, 213 and 214 to Debian unstable.

Chris also made the following changes:

  • New features:

    • Add support for extracting vmlinuz Linux kernel images. []
    • Support both python-argcomplete 1.x and 2.x. []
    • Strip sticky etc. from x.deb: sticky Debian binary package […]. []
    • Integrate test coverage with GitLab’s concept of artifacts. [][][]
  • Bug fixes:

    • Don’t mask differences in .zip or .jar central directory extra fields. []
    • Don’t show a binary comparison of .zip or .jar files if we have observed at least one nested difference. []
  • Codebase improvements:

    • Substantially update comment for our calls to zipinfo and zipinfo -v. []
    • Use assert_diff in test_zip over calling get_data with a separate assert. []
    • Don’t call re.compile and then call .sub on the result; just call re.sub directly. []
    • Clarify the comment around the difference between --usage and --help. []
  • Testsuite improvements:

    • Test --help and --usage. []
    • Test that --help includes the file formats. []

Vagrant Cascadian added an external tool reference xb-tool for GNU Guix  [] as well as updated the diffoscope package in GNU Guix itself [][][].


Distribution work

In Debian, 41 reviews of Debian packages were added, 85 were updated and 13 were removed this month adding to our knowledge about identified issues. A number of issue types have been updated, including adding a new nondeterministic_ordering_in_deprecated_items_collected_by_doxygen toolchain issue [] as well as ones for mono_mastersummary_xml_files_inherit_filesystem_ordering [], extended_attributes_in_jar_file_created_without_manifest [] and apxs_captures_build_path [].

Vagrant Cascadian performed a rough check of the reproducibility of core package sets in GNU Guix, and in openSUSE, Bernhard M. Wiedemann posted his usual monthly reproducible builds status report.


Upstream patches

The Reproducible Builds project detects, dissects and attempts to fix as many currently-unreproducible packages as possible. We endeavour to send all of our patches upstream where appropriate. This month, we wrote a large number of such patches, including:


Reproducible builds website

Chris Lamb updated the main Reproducible Builds website and documentation in a number of small ways, but also prepared and published an interview with Jan Nieuwenhuizen about Bootstrappable Builds, GNU Mes and GNU Guix. [][][][]

In addition, Tim Jones added a link to the Talos Linux project [] and billchenchina fixed a dead link [].


Testing framework

The Reproducible Builds project runs a significant testing framework at tests.reproducible-builds.org, to check packages and other artifacts for reproducibility. This month, the following changes were made:

  • Holger Levsen:

    • Add support for detecting running kernels that require attention. []
    • Temporarily configure a host to support performing Debian builds for packages that lack .buildinfo files. []
    • Update generated webpages to clarify wishes for feedback. []
    • Update copyright years on various scripts. []
  • Mattia Rizzolo:

    • Provide a facility so that Debian Live image generation can copy a file remotely. [][][][]
  • Roland Clobus:

    • Add initial support for testing generated images with OpenQA. []

And finally, as usual, node maintenance was also performed by Holger Levsen [][].


Misc news

On our mailing list this month:


Contact

If you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. However, you can get in touch with us via:




View all our monthly reports