Collaborative Working Sessions - Measuring RB

Measuring, Comparing, Organizing of reproducibility issues

why are we here

prioritizing projects to improve
- sheer package count is fine but do all incremental steps matter equally? Probably not!
- download counts? can easily be misleading (one company can download their own package a million times via CI or etc).
- package impact by downstream uses?
- some ecosystems have stuff like debian popcomp
- different kinds of prioritization:
  - rank by how difficult to fix?
  - rank by how impactful to fix?
explaining
- understanding what basis we can have for comparison between ecosystems.
- it can be a problem for communicating our successes (and our todos and distance ahead) when definitions of reproduciblity don’t have a clear rubrik.
taxonomy & categorization
- finding common problems… so we can find common solutions effectively.
- would be great to be able to look at some decision tree for what kind of problem you’re having and see a flowchat for what a probably solution is.
comparablity
- and for competition / incentivizing.
  - for example, distros comparing to each other … would be good so there is a collective incentive to improve!
- a public leaderboard would be cool – right now we can’t realistically compare things that clearly.

discussion

guidance information vs success criteria
- we think it is useful to have tools that try to determine what reproducibility failure reasons are.
  - -> useful for deciding what to work on next
  - -> useful for guessing what to do next
  - break it down by factors
  - (diffoscope already does some of this!)
  - if it’s by post-build analysis: this is heuristic – meaning we have to be VERY careful how literally we take this. (e.g., don’t!)
    - (diffoscope is here.)
  - if it’s by controlled build environment variation, and a variation shifts between bit-identical and fail – that’s pretty good knowledge. (but this can be a bit expensive to run: and it definitely requires full automation of the build before trying this.)
    - (reprotest is here!)
- don’t confuse guidance info with success!
  - we DON’T want to get to a situation where an organization says something like: “we’re 97% reproducible” but what they really mean is “100% of our packages are not bit-for-bit identical… but it’s just timestamps, we swear! it’s all low priority!”.
    - (see further notes below about how we feel that “severity” cannot be usefully evaluated!)
about comparability…
- is the buildinfo made before the first build? or is it just a log?
  - we like it better if it’s made first, because it’s a clearer path that the intention of the original builder has been stored.
  - but maybe the log approach is fine… if hash (see following).
- is the buildinfo including a hash, or just package names and version names?
  - -> big impact: without a hash, then the “build environment” has to be considered to encompass all the services involved in resolving those names to real content!
    - -> are ecosystems actually including this extended “build environment” management in their models?
- things a hash in the buildinfo saves from from:
  - availability!
    - servers can disappear.
      - some of this has very nearly happened recently…
      - storage can also be trimmed…
    - if you have a hash, you can get content from somewhere else.
  - security!
    - a compromise of the name resolution service becomes irrelevant if hashes are distributed with buildinfo.
  - safety against mere accidents!
    - and this is very real and practical: time since this last happened? (a couple days?) (see event on mailing list where a package name accidentally mapped to different content in two different builder staging environments…)
- also good to let distros document what parts of controlled variation they don’t care about.
  - e.g. a distro can say they don’t care about cross-platform artifact convergence.
  - e.g. some distros have decided they expect builds to converge on reproducible artifacts even if the build path is varied… and other ecosystems have decided they don’t care.
  - controlled variation in general can be seen as additional to reproducibility:
    - if something can be reproduced with less specific build environment: we love it!
    - if something can be reproduced with fairly specific build environment: as long as that’s clearly stated as part of the build environment… the defn of reproducibility we have today says yep, that’s repro.
the value of comparability can… vary.
- for programming language ecosystems that have only one major build tool:
  - the value is a little more limited.
  - cladistic tree stuff can still help them, if they want to improve.
  - but we can’t manifest competition out of thin air :)
- for ecosystems with multiple build systems (python; linux distros vs each other; etc):
  - comparable metrics lets users choose systems based on how much they respect that system’s emphasis on reproducibility.
  - comparable metrics lets distros meaningfully compare their successes.
(discussion of wish to revisit the “achieve deterministic builds” page on the website)
- there’s a list of issues there – great
- it’s not very categorized – could be improved
- some things are problem descriptions, some things are solutions – could be improved
- possible inputs:
  - paper by Goswami (study on npm) has some cladistics that might be useful
    - “SoK: Towards Reproducibility of Software Packages in Scripting Language Ecosystems”
    - check out Fig 1.
  - Timo’s research
  - debian has a bunch of bug categories (unclear how much this gets specific to debian’s toolchain, but still probably lots of good reference)
  - ismypackagereproducibleyet (bernhard effort)
  - more?
places that comparisons haven’t done us so great…
- some distros are including tons of packages that… are ancient origins. Before the goal was engaged with! So this gives them “worse” scores – in a way most would agree is not meaningful.
package prioritization: sometimes an ecosystem has made a pick for us.
- e.g. Arch has an overall repro stat… but also they have distributed a container image that is 100% repro, which includes a subset of packages they call a reasonable core set.
- some ecosystems have concepts like “build-essential” (by some name or other).
- (is this interesting? Maybe.)
  - (for the purpose of choosing what to prioritize next? limited use.)
  - (doesn’t seem to pop up in language package manager ecosystems (what’s “core” is a typically an even less clear question there).)
rubrik vs tagging…?
- categories that we understand of repro failures will refine over time…!
  - so: it may be better to use tags for understood problems with a package… vs using a rubrik about what’s successful… because the latter won’t be revisited as much and as usefully.
would it be useful to have a “CVE-like” system for repro failure reasons?
- a coordinated, public, shared resource.
- “CWE” – common weakness enumeration.
- several attributes could be useful:
  - describing what packages have known issues.
  - tagging what kinds of known problems they have -> can hint towards remediations an ecosystem could apply to that package.
  - possibly even known exact remediations could be shared.
  - (more?)
- universal agreement in the session: do not try to invent a score for “severity”.
  - -> this depends on context of usage, so getting prescriptive about it makes no sense.
  - (and nobody likes what happened in CVSS system for this – numbers are invented and vibe-based.)
  - very very hard to say even something like “a timestamp variation could never be exploited by an attacker”.
    - -> several examples in the wild where very small info leaks are used to trigger larger more subtle pieces of malicious logic.
  - there are examples of code changes that are 1 character in source, and 1 bit in binary output… that are the difference between an ssh server giving you remote root, or not. So the size of any diff is clearly inadmissible as a severity heuristic: 1 bit can be everything.
some package ecosystems have phases of their distribution pipeline that aren’t reproducible… even when the contents ultimately are.
- we’re kind of okay with this – provided that it’s clear, and there exists a clearly measurable point.
- e.g. signatures tend to cause non-reproducibility in practical ways – but if the pipeline had a phase where there’s a reproducible artifact, that’s still okay.
- e.g. distributing stuff with a gzip wrapping… and that’s not considered in the reproducibility – probably not a problem, as long as the artifact inside was observed.
- critical check: the unreproducible part (e.g. gzip header or etc) must not be visible by the time the package is installed or used.