BUILD_PATH_PREFIX_MAP specification

Revision History
Revision 1.0-DRAFT-DO-NOT-USE-YET	24 February 2017
Initial version.

Introduction

This specification describes the environment variable BUILD_PATH_PREFIX_MAP for build tools to exchange information about the build-time filesystem layout, to generate reproducible output where all embedded paths are independent of that layout.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

The canonical URI for this document is: https://reproducible-builds.org/specs/build-path-prefix-map/.

Motivation

Software packages are often unreproducible because they embed build-time paths into generated files. At the time of writing, this affects an estimated 15-20% of 25000 Debian packages being unreproducible when the build path is varied.

These paths do not exist at runtime, nor do they exist in the source code. An improvement is to use relative paths that depend only on the source layout. These paths can also be easily used as runtime paths, for example to access source-code packages installed into standard paths, for IDEs and debuggers.

This requires knowledge of where the top-level source directories are, and where (if any) the top-level out-of-tree-build directories are. However, some toolchains were not designed to make this information available to the lower-level tools that generate those build outputs with embedded paths.

This specification therefore defines a distribution-agnostic standard for lower-level tools to consume this information from higher-level tools.

For more discussion on the mechanism of environment variables, see our page on Standard Environment Variables. Other mechanisms were also explored, but they were not completely effective and only fixed some cases of the issue whilst generating new cases. For example, automatically setting GCC's -fdebug-prefix-map option does not fix other sources of irreproducibility such as __FILE__, and in some packages the option itself is saved into other parts of the build output by another tool.

For more information on Reproducible Builds in general, include its motivation and its uses, see https://reproducible-builds.org/.

Specification

Overview

The environment variable BUILD_PATH_PREFIX_MAP represents an ordered map that associates reproducible path prefixes that may appear in the build output, with path prefixes that exist on the build-time filesystem that must not appear in the build output. We use the following terms:

A producer is a program that knows how to determine appropriate values for the map, and can pass this information to lower-level build tools. For example, a distribution's top-level package builder, or a high-level buildsystem.

A consumer is a program that generates output that contains paths, but does not by itself know enough information about the filesystem layout to be able to appropriately strip the build-time specific parts of those paths. Instead, it relies on another tool to pass this information in. For example, a C compiler.

The value of the variable MUST NOT be saved as-is into any output meant to form part of a reproducible binary artefact.

Encoding and decoding the variable

This section describes a data structure encoding, from a list-of-pairs where each pair holds two strings, into a single string.

We use the phrases left- and right end of the list, to refer to the parts of the list that respectively correspond to the left (start) and right (end) ends of the string that it was parsed from, and vice versa.

On POSIX systems these strings are a sequence of 8-bit bytes. On Windows systems these strings are a sequence of 16-bit wchar_t words. On both platforms, these string types are the types of both filesystem paths and environment variables on that platform.

When implementing this data structure encoding, either (a) you MUST directly operate on the system string types described above without also decoding or encoding them using a character encoding such as UTF-8 or UTF-16; or (b) if you must use a character encoding e.g. because your language's standard libraries force you to, then either it is total and injective over the system string type ^[1], or you MUST raise a parse error for inputs where it is undefined or not injective. (See ^[2] for further details and guidance on how to do this.)

The encoding is as follows:

The : character separates encoded list items from each other.

Empty subsequences between : characters, or between a : character and either the left or right end of the value, are valid and are ignored. ^[3]
Each encoded list item contains exactly one = character, that separates encoded pair elements.

If there are zero or more than one = characters, this is a parse error. ^[4]

The encoded pair elements may be empty; this does not need special-casing if the rest of the document is implemented correctly.
Each encoded pair element is encoded with the following mapping:
1. % → %#
2. = → %+
3. : → %.
When decoding, % characters at the end of a string are a parse error, as are %X substrings where X is any character not in #+..

This encoding allows paths containing %, =, : to be mapped; since users may want to run their builds under such paths. However as a producer, if this is not possible for your consumers, for example because you directly restrict the possible build paths, then you may omit this encoding logic.

Our choice of characters means there is flexibility in the order in which these mappings can be applied. The only restriction is that the % → %# mapping for encoding must not be applied on already-encoded %-substrings; and that the %+ → =, %. → : mappings for decoding must not be applied on already-decoded %-substrings. This is meant to ease implementation in a variety of programming languages.

Our recommended approach for a high-level language with string replace:
1. Decoding:
  1. check that elem does not match the regex /%[^#+.]|%$/g, then
  2. elem.replace("%.", ":").replace("%+", "=").replace("%#", "%")
2. Encoding:
  1. elem.replace("%", "%#").replace("=", "%+").replace(":", "%.")
Our recommended approach for a low-level language without string replace:
1. Decoding:
  - one single left-to-right pass with lookahead (e.g. see our C example), or
  - one single left-to-right pass with lookbehind (e.g. see our Rust example)
2. Encoding:
  - We don't anticipate this to be a major use-case

In the event of parse errors, the whole value of the variable should be treated as invalid rather than silently using only the "good" parts. The program should exit with an error code appropriate for the context, or if this is not possible then the parser must communicate the error in some way to the caller.

Setting the encoded value

Producers SHOULD NOT overwrite existing values; instead they should append their new mappings onto the right of any existing value.

Producers that expect reproducible output MUST append at least one distinct mapping for each top-level directory that the build is expected to take place under. Each mapping has the reproducible target path prefix on the left, and the unreproducible source path prefix on the right.

Producers such as top-level distribution package builders, that expect arbitrary consumers to read this variable, MUST NOT expect any additional constraints or special contracts on the output emitted by these consumers ― only that their output be reproducible when the build path changes and the value of this variable is changed to match the new paths.

On the other hand, producers that only support a limited set of consumers, may expect that these consumers apply these mappings in particular ways.

See also the requirements for consumers in the next part for guidance.

Applying the decoded structure

Consumers MUST ensure that, at minimum: for all (target, source) prefix pairs in the parsed list, with rightmost pairs taking priority: strings in the final build output, that represent build-time paths derived from source, instead appear to represent potential run-time paths derived from target.

As a consequence, consumers MUST apply mappings as above, regardless of whether the source prefix ends with a directory separator or not.

We do not define "derived from" more specifically, since this may be different for different consumers (languages, buildsystems, etc), and a more specific definition might conflict with their idea of what that means. Generally, consumers SHOULD implement one of the following algorithms:

For each (target, source) prefix pair in the list-of-pairs, going from right to left: if the subject path starts with the source prefix, then replace this occurence with the target prefix, and return this new path, ignoring any pairs further left in the list.
As in (1) but with "starts with" replaced by "starts with, restricted to whole-path components". So for example,
- /path/to/a/b/c "starts with" /path/to/a
- /path/to/aa/b/c does not "start with" /path/to/a
This has more robust semantics but is slightly more complex to implement.

Consumers MAY for historical reasons internally store the map with the prefix pairs flipped as in (source, target), instead of (target, source) as described above. New code should prefer the latter representation.

See the appendix for implementation notes on case-insensitive filesystems.

Notes and links

^[1] In practice, this means any two byte sequences that are invalid UTF-8, or wchar_t sequences that are invalid UTF-16, must be decoded into distinct application-level character string values. This is not satisfied by most standard Unicode decoding strategies, which is to replace all invalid input sequences with U+FFFD REPLACEMENT CHARACTER. For these cases, the easiest way to be compliant is to check the decode output for this character and raise an "unsupported" exception if it is found.

^[2] Detailed implementation notes and advice are available on our wiki page. Example source code is also available there, as well as in runnable form in our git repository. The test vectors from this document's appendix are also available there.

^[3] This is to make it easier for producers to append values, e.g. as in old_value += ":" + encoded_pair which would be valid even if the value is originally empty.

^[4] This is to "fail early" in case a naive producer does not encode characters like = but the build path or target path does actually contain them.

References

POSIX system strings

Definitions (no HTTPS) -- see "Pathname", "String" and "Byte".
Environment Variables (no HTTPS) for the type of environ.
limits.h - implementation-defined constants (no HTTPS) for the definition of CHAR_BIT.

Windows system strings

Windows strings are commonly advertised as "UTF-16", however for environment variable values and filesystem paths the system APIs do not enforce validity of the 16-bit strings passed to it. In other words, it is UCS-2, but this term is deprecated (no HTTPS).

So in practice, user code should not assume that these system strings are valid UTF-16, and should be able to deal with invalid UTF-16 strings. The easiest way to do this, is to treat them as opaque 16-bit sequences with no encoding.

File Management > About File Management > Creating, Deleting, and Maintaining Files
Visual C++ / Documentation / C Runtime Library / [..] / CRT Alphabetical Function Reference / getenv_s, _wgetenv_s
... > C Language Reference > ... > C Identifiers > Multibyte and Wide Characters Note that what Microsoft calls "Wide Characters" and "Unicode" is actually valid-or-invalid UTF-16 as described above, not decoded Unicode code points.
Unicode and Character Sets > About Unicode and Character Sets > Character Sets This often-cited page is not actually relevant to filesystem paths or environment variable values, and rather instead refers to how Windows applications deal with userland, not system, character data.

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contributors

Daniel Kahn Gillmor

Daniel Shahaf

HW42

Ian Jackson

Ximin Luo

Appendix

Case-insensitive filesystems

As per the section Applying the decoded structure above, programs are free to choose their own interpretation of "derived from" to deal with these.

In practice, the following may be an effective approach:

After decoding into a list-of-pairs, canonicalise all source paths in this list.
When applying the mapping, first canonicalise the path to be mapped (the "subject path"), then apply it on the canonicalised map from step (1).

The exact method for canonicalisation depends on the filesystem. It may involve (e.g.) converting to upper case for non-existing files and path components, and converting to the preserved case for existing files and path components, for filesystems that store this information. Hopefully your system already has a utility function that does this.

Test vectors

Here are test vectors for implementations to check their correctness. They are also available as part of an executable test suite in our git repository. They are intended for guidance and not as a substitute to the above specification. In particular, it may be possible to match the behaviour described below exactly but still violate the specification.

In the prescriptions below, statements of the form "E maps I to O" mean that when E is set as the value of BUILD_PATH_PREFIX_MAP, then a compliant consumer implementing either algorithm 1 or 2 from Applying the decoded structure, maps line-separated input paths I to line-separated output paths O. Where E is wrapped by b' and ', it is to be interpreted as a sequence of bytes, represented in the same way as Python byte literals.

Case "0.basic", decode valid: ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy maps:

/a/d
/b/1234
/a/yyy/xxx

to:

lol/d
foo/1234
bar/xxx

Case "0.ordering", decode valid: ERROR=/a/b%+yyy:lol=/a:ERROR=/b/1234:foo=/b:libbar-3-bison++_41%.10.5-3~rc1pre3+dfsg1.1-3nmu1+b4=/a/b%+yyy maps:

/a/d
/b/1234
/a/b=yyy/xxx

to:

lol/d
foo/1234
libbar-3-bison++_41:10.5-3~rc1pre3+dfsg1.1-3nmu1+b4/xxx

Case "pecsplit.0.allbytes-ok", decode valid: b'=\x01\x02\x03\x04\x05\x06\x07\x08 \x0b\x0c \x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$&\'()*+,-./0123456789;<>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe' maps:

(same as the input for Case "0.ordering")

to:

(same as the input for Case "0.ordering")

Case "pecsplit.0.empty-ok", decode valid: :ERROR=/a/b%+yyy:lol=/a::foo=/b:libbar-3-bison++_41%.10.5-3~rc1pre3+dfsg1.1-3nmu1+b4=/a/b%+yyy: maps:

(same as the input for Case "0.ordering")

to:

(same as the output for Case "0.ordering")

Case "pecsplit.0.non-utf8", decode valid: b'result\xf1=/a/b%+yyy:lol%#%#=/a:foo%#%#=/b%#:result\xf1=/a/b%+yyy:sec%.reteh=/a/b%+yyy\xf1' maps:

/a/d
/b/1234
/b%/1234
b'/a/b=yyy\xf1/xxx'
/a/b=yyy/xxx

to:

lol%%/d
/b/1234
foo%%/1234
sec:reteh/xxx
b'result\xf1/xxx'

Case "pecsplit.1.long-pc-1", decode invalid: %%#ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy

Case "pecsplit.1.long-pc-2", decode invalid: ERROR%%#=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy

Case "pecsplit.1.long-pc-3", decode invalid: ERROR=/a/zzz:lol=%%#/a:ERROR=/b/1234:foo=/b:bar=/a/yyy

Case "pecsplit.1.long-pc-4", decode invalid: ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b%%#:bar=/a/yyy

Case "pecsplit.1.long-pc-5", decode invalid: ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy%%#

Case "pecsplit.1.many-=-not-ok", decode invalid: ERROR=/a/b=yyy:lol=/a:foo=/b:bar=/a/byyy

Case "pecsplit.1.plain-pc-1", decode invalid: %sERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy

Case "pecsplit.1.plain-pc-2", decode invalid: ERROR%s=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy

Case "pecsplit.1.plain-pc-3", decode invalid: ERROR=/a/zzz:lol=%s/a:ERROR=/b/1234:foo=/b:bar=/a/yyy

Case "pecsplit.1.plain-pc-4", decode invalid: ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b%s:bar=/a/yyy

Case "pecsplit.1.plain-pc-5", decode invalid: ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy%s

Case "pecsplit.1.short-pc-2", decode invalid: ERROR%=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy

Case "pecsplit.1.short-pc-4", decode invalid: ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b%:bar=/a/yyy

Case "pecsplit.1.short-pc-5", decode invalid: ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy%

Case "pecsplit.1.zero-=-not-ok", decode invalid: /a/byyyERROR:lol=/a:foo=/b:bar=/a/byyy