The fact that queryPathInfo() is synchronous meant that we needed a
thread for every concurrent binary cache lookup, even though they end
up being handled by the same download thread. Requiring hundreds of
threads is not a good idea. So now there is an asynchronous version of
queryPathInfo() that takes a callback function to process the
result. Similarly, enqueueDownload() now takes a callback rather than
returning a future.
Thus, a command like
nix path-info --store https://cache.nixos.org/ -r /nix/store/slljrzwmpygy1daay14kjszsr9xix063-nixos-16.09beta231.dccf8c5
that returns 4941 paths now takes 1.87s using only 2 threads (the main
thread and the downloader thread). (This is with a prewarmed
CloudFront.)
The inner lambda was returning a SQLite-internal char * rather than a
std::string, leading to Hydra errors liks
Caught exception in Hydra::Controller::Root->narinfo "path âø£â is not in the Nix store at /nix/store/6mvvyb8fgwj23miyal5mdr8ik4ixk15w-hydra-0.1.1234.abcdef/libexec/hydra/lib/Hydra/Controller/Root.pm line 352."
This makes it easier to create a diverted store, i.e.
NIX_REMOTE="local?root=/tmp/root"
instead of
NIX_REMOTE="local?real=/tmp/root/nix/store&state=/tmp/root/nix/var/nix" NIX_LOG_DIR=/tmp/root/nix/var/log
This is primarily to subsume the functionality of the
copy-from-other-stores substituter. For example, in the NixOS
installer, we can now do (assuming we're in the target chroot, and the
Nix store of the installation CD is bind-mounted on /tmp/nix):
$ nix-build ... --option substituters 'local?state=/tmp/nix/var&real=/tmp/nix/store'
However, unlike copy-from-other-stores, this also allows write access
to such a store. One application might be fetching substitutes for
/nix/store in a situation where the user doesn't have sufficient
privileges to create /nix, e.g.:
$ NIX_REMOTE="local?state=/home/alice/nix/var&real=/home/alice/nix/store" nix-build ...
This allows commands like "nix verify --all" or "nix path-info --all"
to work on S3 caches.
Unfortunately, this requires some ugly hackery: when querying the
contents of the bucket, we don't want to have to read every .narinfo
file. But the S3 bucket keys only include the hash part of each store
path, not the name part. So as a special exception
queryAllValidPaths() can now return store paths *without* the name
part, and queryPathInfo() accepts such store paths (returning a
ValidPathInfo object containing the full name).
Caching path info is generally useful. For instance, it speeds up "nix
path-info -rS /run/current-system" (i.e. showing the closure sizes of
all paths in the closure of the current system) from 5.6s to 0.15s.
This also eliminates some APIs like Store::queryDeriver() and
Store::queryReferences().
These are content-addressed paths or outputs of locally performed
builds. They are trusted even if they don't have signatures, so "nix
verify-paths" won't complain about them.
This enables an optimisation in hydra-queue-runner, preventing a
download of a NAR it just uploaded to the cache when reading files
like hydra-build-products.
Also, move a few free-standing functions into StoreAPI and Derivation.
Also, introduce a non-nullable smart pointer, ref<T>, which is just a
wrapper around std::shared_ptr ensuring that the pointer is never
null. (For reference-counted values, this is better than passing a
"T&", because the latter doesn't maintain the refcount. Usually, the
caller will have a shared_ptr keeping the value alive, but that's not
always the case, e.g., when passing a reference to a std::thread via
std::bind.)
Previously files in the Nix store were owned by root or by nixbld,
depending on whether they were created by a substituter or by a
builder. This doesn't matter much, but causes spurious diffoscope
differences. So use root everywhere.
I.e., not readable to the nixbld group. This improves purity a bit for
non-chroot builds, because it prevents a builder from enumerating
store paths (i.e. it can only access paths it knows about).
Namely:
nix-store: derivations.cc:242: nix::Hash nix::hashDerivationModulo(nix::StoreAPI&, nix::Derivation): Assertion `store.isValidPath(i->first)' failed.
This happened because of the derivation output correctness check being
applied before the references of a derivation are valid.
There is no risk of getting an inconsistent result here: if the ID
returned by queryValidPathId() is deleted from the database
concurrently, subsequent queries involving that ID will simply fail
(since IDs are never reused).
In the Hydra build farm we fairly regularly get SQLITE_PROTOCOL errors
(e.g., "querying path in database: locking protocol"). The docs for
this error code say that it "is returned if some other process is
messing with file locks and has violated the file locking protocol
that SQLite uses on its rollback journal files." However, the SQLite
source code reveals that this error can also occur under high load:
if( cnt>5 ){
int nDelay = 1; /* Pause time in microseconds */
if( cnt>100 ){
VVA_ONLY( pWal->lockError = 1; )
return SQLITE_PROTOCOL;
}
if( cnt>=10 ) nDelay = (cnt-9)*238; /* Max delay 21ms. Total delay 996ms */
sqlite3OsSleep(pWal->pVfs, nDelay);
}
i.e. if certain locks cannot be not acquired, SQLite will retry a
number of times before giving up and returing SQLITE_PROTOCOL. The
comments say:
Circumstances that cause a RETRY should only last for the briefest
instances of time. No I/O or other system calls are done while the
locks are held, so the locks should not be held for very long. But
if we are unlucky, another process that is holding a lock might get
paged out or take a page-fault that is time-consuming to resolve,
during the few nanoseconds that it is holding the lock. In that case,
it might take longer than normal for the lock to free.
...
The total delay time before giving up is less than 1 second.
On a heavily loaded machine like lucifer (the main Hydra server),
which often has dozens of processes waiting for I/O, it seems to me
that a page fault could easily take more than a second to resolve.
So, let's treat SQLITE_PROTOCOL as SQLITE_BUSY and retry the
transaction.
Issue NixOS/hydra#14.
On a system with multiple CPUs, running Nix operations through the
daemon is significantly slower than "direct" mode:
$ NIX_REMOTE= nix-instantiate '<nixos>' -A system
real 0m0.974s
user 0m0.875s
sys 0m0.088s
$ NIX_REMOTE=daemon nix-instantiate '<nixos>' -A system
real 0m2.118s
user 0m1.463s
sys 0m0.218s
The main reason seems to be that the client and the worker get moved
to a different CPU after every call to the worker. This patch adds a
hack to lock them to the same CPU. With this, the overhead of going
through the daemon is very small:
$ NIX_REMOTE=daemon nix-instantiate '<nixos>' -A system
real 0m1.074s
user 0m0.809s
sys 0m0.098s
For instance, it's pointless to keep copy-from-other-stores running if
there are no other stores, or download-using-manifests if there are no
manifests. This also speeds things up because we don't send queries
to those substituters.
Otherwise subsequent invocations of "--repair" will keep rebuilding
the path. This only happens if the path content differs between
builds (e.g. due to timestamps).