Registry files

Usage

If your project has a large number of data files, it can be tedious to list them in a dictionary. In these cases, it’s better to store the file names and hashes in a file and use pooch.Pooch.load_registry to read them.

import os
import pkg_resources

POOCH = pooch.create(
    path=pooch.os_cache("plumbus"),
    base_url="https://github.com/rick/plumbus/raw/{version}/data/",
    version=version,
    version_dev="main",
    # We'll load it from a file later
    registry=None,
)
# Get registry file from package_data
registry_file = pkg_resources.resource_stream("plumbus", "registry.txt")
# Load this registry file
POOCH.load_registry(registry_file)

In this case, the registry.txt file is in the plumbus/ package directory and should be shipped with the package (see below for instructions). We use pkg_resources to access the registry.txt, giving it the name of our Python package.

Registry file format

Registry files are light-weight text files that specify a file’s name and hash. In our example, the contents of registry.txt are:

c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc
cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w

A specific hashing algorithm can be enforced, if a checksum for a file is prefixed with alg::

c137.csv sha1:e32b18dab23935bc091c353b308f724f18edcb5e
cronen.csv md5:b53c08d3570b82665784cedde591a8b0

From Pooch v1.2.0 the registry file can also contain line comments, prepended with a #:

# C-137 sample data
c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc
# Cronenberg sample data
cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w

Attention

Make sure you set the Pooch version in your setup.py to >=1.2.0 when using comments as earlier versions cannot handle them: install_requires = [..., "pooch>=1.2.0", ...]

Packaging registry files

To make sure the registry file is shipped with your package, include the following in your MANIFEST.in file:

include plumbus/registry.txt

And the following entry in the setup function of your setup.py file:

setup(
    ...
    package_data={"plumbus": ["registry.txt"]},
    ...
)

Creating a registry file

If you have many data files, creating the registry and keeping it updated can be a challenge. Function pooch.make_registry will create a registry file with all contents of a directory. For example, we can generate the registry file for our fictitious project from the command-line:

$ python -c "import pooch; pooch.make_registry('data', 'plumbus/registry.txt')"

Create registry file from remote files

If you want to create a registry file for a large number of data files that are available for download but you don’t have their hashes or any local copies, you must download them first. Manually downloading each file can be tedious. However, we can automate the process using pooch.retrieve. Below, we’ll explore two different scenarios.

If the data files share the same base url, we can use pooch.retrieve to download them and then use pooch.make_registry to create the registry:

import os

# Names of the data files
filenames = ["c137.csv", "cronen.csv", "citadel.csv"]

# Base url from which the data files can be downloaded from
base_url = "https://www.some-data-hosting-site.com/files/"

# Create a new directory where all files will be downloaded
directory = "data_files"
os.makedirs(directory)

# Download each data file to data_files
for fname in filenames:
    path = pooch.retrieve(
        url=base_url + fname, known_hash=None, fname=fname, path=directory
    )

# Create the registry file from the downloaded data files
pooch.make_registry("data_files", "registry.txt")

If each data file has its own url, the registry file can be manually created after downloading each data file through pooch.retrieve:

import os

# Names and urls of the data files. The file names are used for naming the
# downloaded files. These are the names that will be included in the registry.
fnames_and_urls = {
    "c137.csv": "https://www.some-data-hosting-site.com/c137/data.csv",
    "cronen.csv": "https://www.some-data-hosting-site.com/cronen/data.csv",
    "citadel.csv": "https://www.some-data-hosting-site.com/citadel/data.csv",
}

# Create a new directory where all files will be downloaded
directory = "data_files"
os.makedirs(directory)

# Create a new registry file
with open("registry.txt", "w") as registry:
    for fname, url in fnames_and_urls.items():
        # Download each data file to the specified directory
        path = pooch.retrieve(
            url=url, known_hash=None, fname=fname, path=directory
        )
        # Add the name, hash, and url of the file to the new registry file
        registry.write(
            f"{fname} {pooch.file_hash(path)} {url}\n"
        )

Warning

Notice that there are no checks for download integrity (since we don’t know the file hashes before hand). Only do this for trusted data sources and over a secure connection. If you have access to file hashes/checksums, we highly recommend using them to set the known_hash argument.