Collaborative Working Sessions - Language ecosystems
Markdowning by: Timo
Language ecosystems
Takeaways
- How to find the source?
- How to find the build instructions?
- How to make it easy for developers and rebuilders to make the packages reproducible?
Notes
- Taking lessons learned from one ecosystem and apply to others
- Java rebuilds / maven gradle and more
- Python and npm rebuilds
Some issues people have already encountered:
- Finding the right commit
- There might be multiple commits across multiple repositories (multirepo)
- There might be multiple projects in a single repo (monorepo)
- Finding the source material
- Intended fields for linking the source is not always used for the actual source
- Finding the build instructions
- attached signatures
- “poor deployment hygiene”, i.e. publishing irrelevant files like DS store or vscode directories
- Split of internal and public repos, where you publish from an internal repo which may have slight differences
- unreproducible automation of build processes
- cache files
- finding the correct build tool in ecosystems that have multiple build tools available
- tooling that does not solve existing reproducibility problems, even if solutions are known, so the developers have to create their own implementations of these solutions
- non-declarative bulid specifications (e.g. setup.py)
-
custom build scripts
- Nondeterministic obfuscation?
- Nondeterministic minification? Are they reproducible even if you know the exact build tool versions? Or are they generally non-deterinistic?
- Perhaps more general, optimization techniques for size or for performance, transformations like compilation, transpilation, …
-
On-the-fly patching java versions during the build process, apparently python sometimes does it as well.
-
Sometimes build processes create local commits, use this as the commit sha from which the artifact was produced, and then publish it, but never push the respective commit into the public VCS.
- Lack of automation (or “weird” automation) creates intransparent publishing processes.
-
Missing link between source and artifact.
-
What trust do we put into a successfully rebuilt project? Do we assume the repository is the correct repo for future versions as well?
-
Sometimes there is the same code in different locations, so it’s hard to find what is the “real” source.
- Developers often don’t care about reproducibility so they don’t give rebuilders a lot of hints on how to build their package.
- Developers have no incentive to make reproducibility happen because they don’t see any benefit.
- Should reproducibility be enforced?
- Perhaps automation, which may be a good way to facilitate more reproducible tooling, would be an incentive that makes things easier for developers.
-> what would be a good incentive for devs to want to use reproducible builds?
-
big players might be able to make the process more reproducible, like people developing the build tools, so that developers don’t have to care about it.
-
Should perhaps sources have a metadata field pointing to the canonical location of the artifact?
- Sometimes for builds the very specific build version is important, but some other time it’s fine to do it in some range of the build tool, and the less specific you can be the better you can scale up.
-
How would you identify which version is acceptable?
- Perhaps the definition needs a bit of work so we can better differentiate between the environment, source and build instructions in order to make clear guidelines how to provide all of these three as a developer.