Collaborative Working Sessions - Source mirrors

Reproducible Builds Summit 2022

The problem: Building old software and/or building current software far in to the future

This isn’t strictly a build reproducibility problem, but more of a practical problem. If you have software that builds reproducibly, but loose the source material, then you can’t build it.

Current approaches (with disadvantages):

  • Specific project maintained stores of source material
    • Some projects tweak or handle source material differently
    • Every project ends up doing this
  • Generic internet archives (e.g. archive.org)
    • Retention and availability might not be great
  • Software heritage
    • Lacks some tarballs, disarchive can help with this (GNU Guix is using disarchive)

Proposed basic API interface:

  • Source archives should provide and communicate a way to query by a hash (treated as a simple string)
  • A query should either result in a message about this hash being unknown, or a single stream of data. The stream of data must be able to be verified to match the hash used in the query.

This leaves quite a few things undefined, but is simple enough that it should be possible that archives supporting this simple interface are generally useful

Following on from this, here are some related problems:

  • A standard way of computing hashes over checkouts of version control repositories (e.g. Git) would be useful, and allow
  • How do you know what archives exist?
  • Is there a consistent API that many/all archives support?
  • A transparency log for source code would be useful
  • Being able to track new source code releases would be useful
  • Mirroring an archive would be useful

We consider that a well-designed standard for a transparency log with serial messaging format may solve several of these problems at once time.

  • a serialized transparency log is a consistent API.
  • following a transparency log’s appends is a way to listen for new source code releases.
  • a transparency log (especially if it refers to a root hash of a merkle tree containing current state) provides a list of all content tracked, which makes mirroring of the complete set of metadata possible.
  • a transparency log has an additional sociological effect: it should greatly disincentivize people in the community at large from “re-releasing” things with the same names, which is a generally chaotic thing that we would like to discourage!
  • a transparency log means that even if some of the fullsize content bodies for some release are dropped from retainment, we do at least persistently know that that content existed, which may remove some uncertainty from any future archeology.

In a meta-analysis of the conversation we had: we notice that in general, a great deal of conversation and interesting topics was generated after we identified the requirement of easily mirroring subsets of the archive. For projects in this space, we would recommend that a central consideration be designing (and clearly communicating) the mechanism for supporting mirroring of the primary index (both complete, partial, and incremental).