Measuring, Comparing, Organizing of reproducibility issues
why are we here
prioritizing projects to improve
sheer package count is fine but do all incremental steps matter equally? Probably not!
download counts? can easily be misleading (one company can download their own package a million times via CI or etc).
package impact by downstream uses?
some ecosystems have stuff like debian popcomp
different kinds of prioritization:
rank by how difficult to fix?
rank by how impactful to fix?
explaining
understanding what basis we can have for comparison between ecosystems.
it can be a problem for communicating our successes (and our todos and distance ahead) when definitions of reproduciblity don’t have a clear rubrik.
taxonomy & categorization
finding common problems… so we can find common solutions effectively.
would be great to be able to look at some decision tree for what kind of problem you’re having and see a flowchat for what a probably solution is.
comparablity
and for competition / incentivizing.
for example, distros comparing to each other … would be good so there is a collective incentive to improve!
a public leaderboard would be cool – right now we can’t realistically compare things that clearly.
discussion
guidance information vs success criteria
we think it is useful to have tools that try to determine what reproducibility failure reasons are.
-> useful for deciding what to work on next
-> useful for guessing what to do next
break it down by factors
(diffoscope already does some of this!)
if it’s by post-build analysis: this is heuristic – meaning we have to be VERY careful how literally we take this. (e.g., don’t!)
(diffoscope is here.)
if it’s by controlled build environment variation, and a variation shifts between bit-identical and fail – that’s pretty good knowledge. (but this can be a bit expensive to run: and it definitely requires full automation of the build before trying this.)
(reprotest is here!)
don’t confuse guidance info with success!
we DON’T want to get to a situation where an organization says something like: “we’re 97% reproducible” but what they really mean is “100% of our packages are not bit-for-bit identical… but it’s just timestamps, we swear! it’s all low priority!”.
(see further notes below about how we feel that “severity” cannot be usefully evaluated!)
about comparability…
is the buildinfo made before the first build? or is it just a log?
we like it better if it’s made first, because it’s a clearer path that the intention of the original builder has been stored.
but maybe the log approach is fine… if hash (see following).
is the buildinfo including a hash, or just package names and version names?
-> big impact: without a hash, then the “build environment” has to be considered to encompass all the services involved in resolving those names to real content!
-> are ecosystems actually including this extended “build environment” management in their models?
things a hash in the buildinfo saves from from:
availability!
servers can disappear.
some of this has very nearly happened recently…
storage can also be trimmed…
if you have a hash, you can get content from somewhere else.
security!
a compromise of the name resolution service becomes irrelevant if hashes are distributed with buildinfo.
safety against mere accidents!
and this is very real and practical: time since this last happened? (a couple days?) (see event on mailing list where a package name accidentally mapped to different content in two different builder staging environments…)
also good to let distros document what parts of controlled variation they don’t care about.
e.g. a distro can say they don’t care about cross-platform artifact convergence.
e.g. some distros have decided they expect builds to converge on reproducible artifacts even if the build path is varied… and other ecosystems have decided they don’t care.
controlled variation in general can be seen as additional to reproducibility:
if something can be reproduced with less specific build environment: we love it!
if something can be reproduced with fairly specific build environment: as long as that’s clearly stated as part of the build environment… the defn of reproducibility we have today says yep, that’s repro.
the value of comparability can… vary.
for programming language ecosystems that have only one major build tool:
the value is a little more limited.
cladistic tree stuff can still help them, if they want to improve.
but we can’t manifest competition out of thin air :)
for ecosystems with multiple build systems (python; linux distros vs each other; etc):
comparable metrics lets users choose systems based on how much they respect that system’s emphasis on reproducibility.
comparable metrics lets distros meaningfully compare their successes.
(discussion of wish to revisit the “achieve deterministic builds” page on the website)
there’s a list of issues there – great
it’s not very categorized – could be improved
some things are problem descriptions, some things are solutions – could be improved
possible inputs:
paper by Goswami (study on npm) has some cladistics that might be useful
“SoK: Towards Reproducibility of Software Packages in Scripting Language Ecosystems”
check out Fig 1.
Timo’s research
debian has a bunch of bug categories (unclear how much this gets specific to debian’s toolchain, but still probably lots of good reference)
ismypackagereproducibleyet (bernhard effort)
more?
places that comparisons haven’t done us so great…
some distros are including tons of packages that… are ancient origins. Before the goal was engaged with! So this gives them “worse” scores – in a way most would agree is not meaningful.
package prioritization: sometimes an ecosystem has made a pick for us.
e.g. Arch has an overall repro stat… but also they have distributed a container image that is 100% repro, which includes a subset of packages they call a reasonable core set.
some ecosystems have concepts like “build-essential” (by some name or other).
(is this interesting? Maybe.)
(for the purpose of choosing what to prioritize next? limited use.)
(doesn’t seem to pop up in language package manager ecosystems (what’s “core” is a typically an even less clear question there).)
rubrik vs tagging…?
categories that we understand of repro failures will refine over time…!
so: it may be better to use tags for understood problems with a package… vs using a rubrik about what’s successful… because the latter won’t be revisited as much and as usefully.
would it be useful to have a “CVE-like” system for repro failure reasons?
a coordinated, public, shared resource.
“CWE” – common weakness enumeration.
several attributes could be useful:
describing what packages have known issues.
tagging what kinds of known problems they have -> can hint towards remediations an ecosystem could apply to that package.
possibly even known exact remediations could be shared.
(more?)
universal agreement in the session: do not try to invent a score for “severity”.
-> this depends on context of usage, so getting prescriptive about it makes no sense.
(and nobody likes what happened in CVSS system for this – numbers are invented and vibe-based.)
very very hard to say even something like “a timestamp variation could never be exploited by an attacker”.
-> several examples in the wild where very small info leaks are used to trigger larger more subtle pieces of malicious logic.
there are examples of code changes that are 1 character in source, and 1 bit in binary output… that are the difference between an ssh server giving you remote root, or not. So the size of any diff is clearly inadmissible as a severity heuristic: 1 bit can be everything.
some package ecosystems have phases of their distribution pipeline that aren’t reproducible… even when the contents ultimately are.
we’re kind of okay with this – provided that it’s clear, and there exists a clearly measurable point.
e.g. signatures tend to cause non-reproducibility in practical ways – but if the pipeline had a phase where there’s a reproducible artifact, that’s still okay.
e.g. distributing stuff with a gzip wrapping… and that’s not considered in the reproducibility – probably not a problem, as long as the artifact inside was observed.
critical check: the unreproducible part (e.g. gzip header or etc) must not be visible by the time the package is installed or used.