Reproducible Summit III event documentation

Berlin, Germany. October 31 – November 2, 2017

Event documentation

This is work in progress: currently this very document still contains notes, which shall all be moved to separate pages at which point this URL will vanish and everything will be accessible via the Agenda

Session Notes

Day 1

Agenda brainstormin

Working sessions I:

Working sessions II

Day 2

Working sessions III
Working sessions IV
Day 3
Working sessions V

Working sessions VI

####### Working sessions VII


Brainstorming the reproducible builds logo design

Session Notes

Day 1

How bootstrapping relates to reproducible builds and how to improve it

If we have a exe at the top and a lib it depends on, do we call the exe reproducible even if the lib is not reproducible? is trust binary / black & white? no…

overlap:

diff: reproducible builds: done when 100% of packages are building reproducibly

bootstrappable: (C-part) done when one can take any C-compiler and compile the production C-compiler and with that get bit-identical binaries of anything

Requirement for doing bootstrappable builds:

Note: Trust is not transitive (unlike a=b=c meaning a=c) so if the sister of a friend knows someone who verified this it is not as much trust as “I verified this”. Possibly also because trusting someone very much translates to a factor of 0.9x thus for every level of indirection you lose some trust.

F-Droid: using Debian binaries as much as possible because they are built from source and thus more trustworthy.

guix: build archive with checksums of everything with 218MB bootstrap binaries

openSUSE: uses Ring-0

Goal: come up with very small set of auditable binaries+sources

https://gitlab.com/janneke/mes is close

https://savannah.nongnu.org/projects/stage0

Goal: need zero trust in the seed set of binaries - cannot be fully reached, but we can get to very small (maybe infinitessimal) values of trust needed.

How to distinguish trusted bootstrap binaries from other binaries?
Identify important next steps:
What documentation is still to be created for developers who are new to reproducible builds?

Newbie docs


Missing: hands-on-guide, good examples

Guide:

Good examples:
what’s reproducible?
new document (wanted): “how to contribute”
Day 2
Working sessions III

Improving reproducible builds in Java

BUT: Gradle since version 3.0 depends on groovy >= 2.0 which depends on gradle.


Javadoc –> produces unreproducable output based on filesystem order

maven 3->groovy

groovy -> gradle

Gradle depends on over 300 previous versions of itself.

Bootstrapping: Mapping the problem space

Transcription of the poster

Graph:

slow simple lisp -> mes -> guile -> nyac + mescc.]

This graph may require:

Best practices and open issues in regards to engaging upstreams

solution: ask on IRC or via direct mail on how to contribute

“what can I do to help you merge this” in the case of unmerged or untouched patches.

problem: deprecated branches.

-> adopt in https://github.com/distropatches/

problem: patch sent during deep freeze, so patch is ignored/delayed

solution: need to ping at the right time

idea: keep and share template-snippets of well formulated commit messages or bug report, include reference to documentation so it can be more verbose and useful to upstreams than when writing from scratch every time.

problem: have to sign CLA.

-e.g. Qt, google, GNU, facebook, python

How can policies help the end user to define what they want in terms of reproducibility?

Policies:

Questions:

Example policies

Example 1:

REBUILDERS BUILD INFO FILES

rebuilder A :pkgX matches digest N.

rebuilder B:pkgX matches digest N.

rebuilder C:pkgX matches digest N.

MACHINE:


policy configured to:

Local admin could pick a set of rebuilders and pick build info files from the 3 and check if they match.

If the three build info files don’t match we could.

This is what we would need to define in the policy.

Rebuilders:

EASY GUI:

ADVCANCED GUI:

REPOSITORY X

Trusted rebuilders INCLUDED REQUIRED


Number of rebuilders needed for consensus (m out of n): 2


By default weight is binary (0/1), and in advanced mode we could use non-binary weights.

Consensus fail policy

Log policy

It seems doable! But what’s the cost/benefit ratio of it?

not as part of the installation process

→ Let’s extract the smallest useful subset of this in another session!

Working sessions IV

Mapping out archive formats

| | | tar | zip | cpio | git | casync | ostree | squashfs | ar (text) | iso |
|--|--|--|--|--|--|--|--|--|--|--|
| | has canonical packed form (e.g. no implicit traversal order) | not by default | yes (not sure how stable) | yes | N/A | | | | |
| | Can be canonicalized (with enough flags) | yes (with arbitrary choices | | | | | | | no | |
| | seekable | no | yes | | sorta (packs: no, and don't ask) | catar | | | yes | |
| | sparse files (kernel API is very recent) | yes | | | | long story (will store zeros efficiently and…) | | yes | no | |
| abilities: | mmap'able | yes, but not useful due to no seek | | | | catar is | | | | |
| abilities: | unpack related trees can conserve disk space | no | no | | no | hardlink or reflinks | hardlink | | no | |
| abilities: | pack related tree can dedup | no | no | | yes (it's complicated) | rabin blocking (very good) | file-scale (no chunking) | | | |
| abilities: | SPECIAL applications | popular for src releases | | | source | | | | | |
| | dev (maj/min) | yes | | yes | no | yes | | yes | | |
| | fifo | yes | | | no | yes | | | no | |
| | sockets | yes | | | no | yes | | | no | |
| | posix &0111 bits (\|x) | yes | yes | yes | yes | yes | | yes | yes | |
| | posix &0777 (rwx) | yes | yes | yes | no | yes | | yes | yes | |
| | mtime (any? nano? 1sec? 2sec?) mtime ZONEs? | 1s (unix) *gz may add another | 2s (timezones) * per file compression may have another | 1s | no | yes | | 1s | 1s | |
| | xattr | yes | not | no (patches floating) | not | yes | | yes | no | |
| | arbitrarily long filenames | yes, but three or more encoding variations | | | yes | yes (to linux's max…dissent) | | | no | |
| metadata: | symlinks | yes | yes (extension) | | yes | yes | | | no | |
| metadata: | hardlinks | yes (but weird) | | | not | no (maybe someday) | | | no | |
| metadata: | uid/gid (int) | yes (BOTH‽) | yes (extension) | | yes | yes | | yes | yes | |
| metadata: | user/group name | yes (BOTH‽) | yes (extension) | | not | yes | | not | not | |
| metadata: | suid bit | yes | yes (extension) | | not | yes | | | yes | |
| metadata: | expanded file size | | | | | yes | | | | |

Building a system image from existing binaries

Building a system image from existing binaries

versions of the external packages that are used.

Downloading binaries from remote places

Binaries

Build environment

similar concept to .buildinfo, which describes build-dependencies

“is this reproducible”

FS images

Setting SOURCE_DATE_EPOCH / timestamps

image is

Cache files / other post-installation products:

Other sources of unreproducibility

Use hash information in .buildinfo to reproduce d-installer output (images)

Where we’re coming from

What can law enforcement force you to do?

Side question:

How can we defend ourselves?

Dealing with jurisdictions is very difficult, and many of us live in countries with a passport issued in another, making it even more complex.

Report

Very complex topic, which raised more questions than answers - for now at least. We focused on which situations we can individually be pushed to do things detrimental to the integrity of software projects, and ways to defend ourselves. Besides being employed by big companies, foundations may manage biggest projects, but not always. We need to ask them which steps they would and could take for us. With this we could or should come up with a more general guide on how to defend ourselves?

Marketing: Why is it valuable to support the reproducible builds work and who is our audience?

We identified four clusters that marketing relates to: users, developers, management, and the free software community. There is agreement that marketing is important to become visible, thereby moving the burden of discovering RB from the audience to us. It is also important to increase diversity in the RB community.

what is the message we have for them?
People often follow scientists, so publishing something on RB in journals might help motivate compiler builders as well.

Transcription of posters:

Marketing

Poster 1: WHY?

Managemnent:

Poster 2: WHO?

TOP THREE AUDIENCES TO FOCUS ON:

Day 3

Working sessions V

Defining terminology: reproducible, bootstrappable, reliable

Pad1

Directed graph we came up with in dot:

}

We had concepts:

We identified goals:

Pad2

Reproducible, deterministic:

Replicability:

Reliable:

Trust

Bootstrappable

SOURCE_DATE_EPOCH specification: Overview and improvements needed

SOURCE_DATE_EPOCH

however generated files that end up in final output

-> clamping: old files stay old, but newer files

get set to source date epoch

This is existing practise already, so we should.

Update the spec to reflect this.

“The output of the build should look as if the build

had happened instantly at SOURCE_DATE_EPOCH.”

-> Will use that sentence as motivational summary in the introduction (but not as part of the definition)

is a problem here; forbidding clamping would make

this a non-problem)

What they do is take current time, but store the value in their .buildinfo file; so the .buildinfo file effectively becomes part of the source file.

Discussion on whether this seems fine (the spec allows a .buildinfo file to be considered source; but the majority of this discussion group considers it preferable to take the time from version control).

to still work

have better caching. OK? Does the spec have to allow

it? (Technically, a suprocess does not see a value

set to a parent process, nor a newer value).

properly insolated

-> set to the maximum of all dependencies?.

Setting up build environments for reproducibility
Complex Build Environments

-which installs build-dependencies at build time(sbuild, schroot, pbuilder). guix/nix have a similar concept by design.

(verified built from source elsewhere).

What is needed to run rebuilders?


Example scenario

Documentation:

Actually we don’t need this: apart of the signing key and stuff in the buildinfo we don’t leak much.

What does the concept of rebuilding mean?

For example, do we try to reproduce a previous build in a build env. that’s as close as possible to the previous one, or do we “just” rebuild and compare the results? Or do we do the latter and then if it doesn’t match, we retry with a build env. closer to the buildinfo?

How do we rebuild?

How/where do we store results of rebuilds?

(both matching and non-matching)

Prior art:

Do we want a shared database of rebuilds? Standardize? Centralize?
How do we lookup reports?

Marketing

How do we convince orgs to run rebuilders?
Bonus features & ideas

Working sessions VI

** Mapping out our short- and long-term goals **

(by decreasing order of priority)

####### One year


Two years

Distant goals

Unknown
How to onboard new contributors

Related: https://pad.riseup.net/p/reproduciblebuildsIII-newcomerdocumentation

-> problem: spread out resources ML, git, alioth, wiki

more advertising of toolchain patches

For developers, wanting to get reproducible results from make

describe your build environment

Q: how to make my software reproducible?

Maybe: How to make sure that it remains fixed?

Add “for developers” / “how to get involved” page to r-b.org
People without own software:

What can I do to help?

News blog under Debian.org instead of r-b.org, news on r-b.o is not updated, can appear orphaned move ? separate marketing updates from developer information?

Identifying next actionable steps for marketing outreach
For each audience we want to identify:
Developers in general

Developers:
Values:
Action items:
Toolchain developers

They have 2 key issues:

Values:
Example of success story:
Action items
Academics

We have some scare stories that prove that non-reproducible software can lead to non-reproducible papers:

Benchmarks are only reproducible if the benchmarking software can itself be built reproducibly a version string is not enough to describe what software one shall use to reproduce results: e.g. two R binaries built from the same source, but with different versions of build-dependencies, can produce different results

ACTION: write “The Unreproducible Paper” (similar to “the unreproducible package”) that shows how so-called reproducible research results can’t be actually reproduced if they rely on non-reproducible software.

Companies

Value of RBs to them

What has not changed?

Software i.e. competitive argument vs. other FOSS vendors (this requires the software to be FOSS in the first place) E.g. RHEL vs. CentOS: why should I trust the RHEL binaries if I can’t reproduce them?

Success stories

ACTION: quotable reference needed

ACTION: find out if they use reproducible builds; if yes, why?

We get a success story and probably new selling points; if not, then it’s a good case for arguing in favor of RBs.

Funding reproducible builds work

Funding for RB - Mapping Session

3 steps: Mapping the status quo of resources currently flowing into RB; identifying what is working well and where there is room for improvement, finding solutions to the needs that need to be addressed

Funding consists of 4 major groups:
Solutions:
Money:
Bounty Hunters

Time & Community

delegate
Working sessions VII

Enabling cross-distro reproducibility

repro-builds

DDC: Diverse Double Compilation

ecomendation: gcc 4.7 (which doesn’t require C++ support to compile)


Goal:
Current state:
Goal:
Current state:
Current state:

Current state:

New goal:



This is DDC (Diverse double compilation)

|-------------| |-------------| |-------------|
|tinycc source| == |tinycc source| == |tinycc source|
|-------------| |-------------| |-------------|

↓ ↓ ↓

|-------------| |-------------| |-------------|
|any c compile| != |any c compile| != |any c compile| (any c compiler)
|-------------| |-------------| |-------------|


↓ ↓ ↓


|-------------| |----------------| |----------------|
|tinycc bin 1 | != |tinycc bin 2 | != |tinycc bin 3 |
|-------------| |----------------| |----------------|

|-------------| |-------------| |-------------|
tinycc source| == |tinycc source| == |tinycc source|
|-------------| |-------------| |-------------|


↓ ↓ ↓


|-------------| |-------------| |-------------|
|tinycc bin 1 | != |tinycc bin 2 | != |tinycc bin 3 |
|-------------| |-------------| |-------------|


↓ ↓ ↓


|-------------| |-------------| |-------------|
| tinycc bin | == | tinycc bin | == | tinycc bin |
|-------------| |-------------| |-------------|




End goal: Create a common bootstrap method and start comparing hashes.

Exploring reproducibility for Mac and Windows

Cross-building reproducibly

For Windows
Instructions can be found at:

https://git-rw.torproject.org:builders/tor-browser-build.git

(there is a README file)

For Mac OS X and iOS
However, it is possible to:

In the same repository as above, there are also instructions for cross-building for Mac OS X. It downloads and leverages the official SDK for Mac OS X.

It is technically possible to run Mac OS X in a VM, although it is apparently illegal. The Mac OS X installer can be leveraged to generate bootable removable media to install it. VirtualBox supports it explicitly: there is a host profile for Mac OS X on Intel 64-bits.

A container technology is also available for Mac OS X.

General notes

What does code-signing means in terms of reproducibility?
code signing

Prioritizing the minimum viable set of tools needed for end users
Even nicer:
OPEN QUESTIONS:
ACTION ITEMS:
Transcription of poster notes

day 3 PM

User Policy/Implementation

(@left top)

(@middle) (@right top)

Signed .buildinfos (ftp master mirror)

buildinfo.Debian.net

(@left corner)

buildinfo query


(@right corner) (@also right corner)

Discussing current status and potential improvements in regards to .buildinfo files for RPM and iso

Task: find someone who do RPM development or someone from redhat, Fedora, opensuse

Task: show RPM .buildinfo.Add checksums (sha256) of the inputs into the .buildinfo files.

iso

The iso contains of:

“Classic” Build Environment