Training your Pooch

The problem

You develop a Python library called plumbus for analysing data emitted by interdimensional portals. You want to distribute sample data so that your users can easily try out the library by copying and pasting from the docs. You want to have a plumbus.datasets module that defines functions like fetch_c137() that will return the data loaded as a pandas.DataFrame for convenient access.

Assumptions

We’ll setup a Pooch to solve your data distribution needs. In this example, we’ll work with the follow assumptions:

  1. Your sample data are in a folder of your Github repository.
  2. You use git tags to mark releases of your project in the history.
  3. Your project has a variable that defines the version string.
  4. The version string contains an indicator that the current commit is not a release (like 'v1.2.3+12.d908jdl' or 'v0.1+dev').

Let’s say that this is the layout of your repository on Github:

doc/
    ...
data/
    README.md
    c137.csv
    cronen.csv
plumbus/
    __init__.py
    ...
    datasets.py
setup.py
...

The sample data are stored in the data folder of your repository.

Setup

Pooch can download and cache your data files to the users computer automatically. This is what the plumbus/datasets.py file would look like:

"""
Load sample data.
"""
import pandas
import pooch

from . import version  # The version string of your project


GOODBOY = pooch.create(
    # Use the default cache folder for the OS
    path=pooch.os_cache("plumbus"),
    # The remote data is on Github
    base_url="https://github.com/rick/plumbus/raw/{version}/data/",
    version=version,
    # If this is a development version, get the data from the master branch
    version_dev="master",
    # The registry specifies the files that can be fetched from the local storage
    registry={
        "c137.csv": "19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc",
        "cronen.csv": "1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w",
    },
)


def fetch_c137():
    """
    Load the C-137 sample data as a pandas.DataFrame.
    """
    # The file will be downloaded automatically the first time this is run.
    fname = GOODBOY.fetch("c137.csv")
    data = pandas.read_csv(fname)
    return data


def fetch_cronen():
    """
    Load the Cronenberg sample data as a pandas.DataFrame.
    """
    fname = GOODBOY.fetch("cronen.csv")
    data = pandas.read_csv(fname)
    return data

When the user calls plumbus.datasets.fetch_c137() for the first time, the data file will be downloaded and stored in the local storage. In this case, we’re using pooch.os_cache to set the local folder to the default cache location for your OS. You could also provide any other path if you prefer. See the documentation for pooch.create for more options.

Hashes

Pooch uses SHA256 hashes to check if files are up-to-date or possibly corrupted:

  • If a file exists in the local folder, Pooch will check that its hash matches the one in the registry. If it doesn’t, we’ll assume that it needs to be updated.
  • If a file needs to be updated or doesn’t exist, Pooch will download it from the remote source and check the hash. If the hash doesn’t match, an exception is raised to warn of possible file corruption.

You can generate hashes for your data files using the terminal:

$ openssl sha256 data/c137.csv
SHA256(data/c137.csv)= baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d

Or using the pooch.file_hash function (which is a convenient way of calling Python’s hashlib):

import pooch
print(pooch.file_hash("data/c137.csv"))

Versioning

The files from different version of your project will be kept in separate folders to make sure they don’t conflict with each other. This way, you can safely update data files while maintaining backward compatibility. For example, if path=".plumbus" and version="v0.1", the data folder will be .plumbus/v0.1.

When your project updates, Pooch will automatically setup a separate folder for the new data files based on the given version string. The remote URL will also be updated. Notice that there is a format specifier {version} in the URL that Pooch substitutes for you.

Versioning is optional and can be ignored by omitting the version and version_dev arguments or setting them to None.

User-defined paths

In the above example, the location of the local storage in the users computer is hard-coded. There is no way for them to change it to something else. To avoid being a tyrant, you can allow the user to define the path argument using an environment variable:

GOODBOY = pooch.create(
    # This is still the default in case the environment variable isn't defined
    path=pooch.os_cache("plumbus"),
    base_url="https://github.com/rick/plumbus/raw/{version}/data/",
    version=version,
    version_dev="master",
    registry={
        "c137.csv": "19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc",
        "cronen.csv": "1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w",
    },
    # The name of the environment variable that can overwrite the path argument
    env="PLUMBUS_DATA_DIR",
)

In this case, if the user defines the PLUMBUS_DATA_DIR environment variable, we’ll use its value instead of path. Pooch will still append the value of version to the path, so the value of PLUMBUS_DATA_DIR should not include a version number.

Subdirectories

You can have data files in subdirectories of the remote data store. These files will be saved to the same subdirectories in the local storage folder. Note, however, that the names of these files in the registry must use Unix-style separators ('/') even on Windows. We will handle the appropriate conversions.

Post-processing hooks

Sometimes further post-processing actions need to be taken on downloaded files (unzipping, conversion to a more efficient format, etc). If these actions are time or memory consuming, it would be best to do this only once when the file is actually downloaded and not every time pooch.Pooch.fetch is called.

One way to do this is using post-processing hooks. Method pooch.Pooch.fetch takes a processor argument that allows us to specify a function that is executed post-download and before returning the local file path. The processor also lets us overwrite the file name returned by pooch.Pooch.fetch.

See the API Reference for a list of all available post-processing hooks.

For example, let’s say our data file is zipped and we want to store an unzipped copy of it and read that instead. We can do this with a post-processing hook that unzips the file and returns the path to the unzipped file instead of the original zip archive:

import os
from zipfile import ZipFile


def unpack(fname, action, pup):
    """
    Post-processing hook to unzip a file and return the unzipped file name.

    Parameters
    ----------
    fname : str
       Full path of the zipped file in local storage
    action : str
       One of "download" (file doesn't exist and will download),
       "update" (file is outdated and will download), and
       "fetch" (file exists and is updated so no download).
    pup : Pooch
       The instance of Pooch that called the processor function.

    Returns
    -------
    fname : str
       The full path to the unzipped file.
       (Return the same fname is your processor doesn't modify the file).

    """
    # Create a new name for the unzipped file. Appending something to the name is a
    # relatively safe way of making sure there are no clashes with other files in
    # the registry.
    unzipped = fname + ".unzipped"
    # Don't unzip if file already exists and is not being downloaded
    if action in ("update", "download") or not os.path.exists(unzipped):
        with ZipFile(fname, "r") as zip_file:
            # Extract the data file from within the archive
            with zip_file.open("actual-data-file.txt") as data_file:
                # Save it to our desired file name
                with open(unzipped, "wb") as output:
                    output.write(data_file.read())
    # Return the path of the unzipped file
    return unzipped


def fetch_zipped_file():
    """
    Load a large zipped sample data as a pandas.DataFrame.
    """
    # Pass in the processor to unzip the data file
    fname = GOODBOY.fetch("zipped-data-file.zip", processor=unpack)
    # fname is now the path of the unzipped file which can be loaded by pandas
    # directly
    data = pandas.read_csv(fname)
    return data

Fortunately, you don’t have to implement your own unzip processor. Pooch provides the pooch.Unzip processor for exactly this use case. The above example using the Pooch processor would look like:

from pooch import Unzip


def fetch_zipped_file():
    """
    Load a large zipped sample data as a pandas.DataFrame.
    """
    # Extract the file "actual-data-file.txt" from the archive
    unpack =  Unzip(members=["actual-data-file.txt"])
    # Pass in the processor to unzip the data file
    fnames = GOODBOY.fetch("zipped-data-file.zip", processor=unpack)
    # Returns the paths of all extract members (in our case, only one)
    fname = fnames[0]
    # fname is now the path of the unzipped file ("actual-data-file.txt") which can
    # be loaded by pandas directly
    data = pandas.read_csv(fname)
    return data

Alternatively, your zip archive could contain multiple files that you want to unpack. In this case, the default behavior of pooch.Unzip is to extract all files into a directory and return a list of file paths instead of a single one:

def fetch_zipped_archive():
    """
    Load all files from a zipped archive.
    """
    # Pass in the processor to unzip the data file
    fnames = GOODBOY.fetch("zipped-archive.zip", processor=Unzip())
    data = [pandas.read_csv(fname) for fname in fnames]
    return data

If you have a compressed file that is not an archive (zip or tar), you can use pooch.Decompress to decompress it after download. For example, large binary files can be compressed with gzip to reduce download times but will need to be decompressed before loading, which can be slow. You can trade storage space for speed by keeping a decompressed copy of the file:

from pooch import Decompress

def fetch_compressed_file():
    """
    Load a large binary file that has been gzip compressed.
    """
    # Pass in the processor to decompress the file on download
    fname = GOODBOY.fetch("large-binary-file.npy.gz", processor=Decompress())
    # The file returned is the decompressed version which can be loaded by numpy
    data = numpy.load(fname)
    return data

Custom downloaders and authentication

By default, pooch.Pooch.fetch will download files over HTTP without authentication. Sometimes this is not enough: some servers require logins, some are FTP instead of HTTP. To get around this, you can pass a downloader to fetch.

Pooch provides HTTPDownloader class (which is used by default) that can be used to provide login credentials to HTTP servers that require authentication. For example:

from pooch import HTTPDownloader


def fetch_protected_data():
    """
    Fetch a file from a server that requires authentication
    """
    # Let the downloader know the login credentials
    download_auth = HTTPDownloader(auth=("my_username", "my_password"))
    fname = GOODBOY.fetch("some-data.csv", downloader=download_auth)
    data = pandas.read_csv(fname)
    return data

It’s probably not a good idea to hard-code credentials in your code. One way around this is to ask users to set their own credentials through environment variables. The download code could look something like this:

import os


def fetch_protected_data():
    """
    Fetch a file from a server that requires authentication
    """
    # Get the credentials from the user's environment
    username = os.environ.get("SOMESITE_USERNAME")
    password = os.environ.get("SOMESITE_PASSWORD")
    # Let the downloader know the login credentials
    download_auth = HTTPDownloader(auth=(username, password))
    fname = GOODBOY.fetch("some-data.csv", downloader=download_auth)
    data = pandas.read_csv(fname)
    return data

If your use case is not covered by our downloaders, you can implement your own. See pooch.Pooch.fetch for the required format of downloaders. As an example, consider the case in which the login credentials need to be provided to a site that is redirected from the original download URL in the Pooch registry:

import requests


def redirect_downloader(url, output_file, pooch):
    """
    Download after following a redirection.
    """
    # Get the credentials from the user's environment
    username = os.environ.get("SOMESITE_USERNAME")
    password = os.environ.get("SOMESITE_PASSWORD")
    # Make a request that will redirect to the login page
    login = requests.get(url)
    # Provide the credentials and download from the new URL
    download = HTTPDownloader(auth=(username, password))
    download(login.url, output_file, mypooch)


def fetch_protected_data():
    """
    Fetch a file from a server that requires authentication
    """
    fname = GOODBOY.fetch("some-data.csv", downloader=redirect_downloader)
    data = pandas.read_csv(fname)
    return data

Printing a download progress bar

The HTTPDownloader can use tqdm to print a download progress bar. This is turned off by default but can be enabled using:

from pooch import HTTPDownloader


def fetch_large_data():
    """
    Fetch a large file from a server and print a progress bar.
    """
    download = HTTPDownloader(progressbar=True)
    fname = GOODBOY.fetch("large-data-file.h5", downloader=download)
    data = h5py.File(fname, "r")
    return data

The resulting progress bar will be printed to stderr and should look something like this:

100%|█████████████████████████████████████████| 336/336 [...]

Note

tqdm is not installed by default with Pooch. You will have to install it separately in order to use this feature.

So you have 1000 data files

If your project has a large number of data files, it can be tedious to list them in a dictionary. In these cases, it’s better to store the file names and hashes in a file and use pooch.Pooch.load_registry to read them:

import os

GOODBOY = pooch.create(
    # Use the default cache folder for the OS
    path=pooch.os_cache("plumbus"),
    # The remote data is on Github
    base_url="https://github.com/rick/plumbus/raw/{version}/data/",
    version=version,
    # If this is a development version, get the data from the master branch
    version_dev="master",
    # We'll load it from a file later
    registry=None,
)
GOODBOY.load_registry(os.path.join(os.path.dirname(__file__), "registry.txt"))

The registry.txt file in this case is in the same directory as the datasets.py module and should be shipped with the package. It’s contents are:

c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc
cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w

To make sure the registry file is shipped with your package, include the following in your MANIFEST.in file:

include plumbus/registry.txt

And the following entry in the setup function of your setup.py:

setup(
    ...
    package_data={"plumbus": ["registry.txt"]},
    ...
)

Creating a registry file

If you have many data files, creating the registry and keeping it updated can be a challenge. Function pooch.make_registry will create a registry file with all contents of a directory. For example, we can generate the registry file for our fictitious project from the command-line:

$ python -c "import pooch; pooch.make_registry('data', 'plumbus/registry.txt')"

Multiple URLs

You can set a custom download URL for individual files with the urls argument of pooch.create or pooch.Pooch. It should be a dictionary with the file names as keys and the URLs for downloading the files as values. For example, say we have a citadel.csv file that we want to download from https://www.some-data-hosting-site.com instead:

# The basic setup is the same and we must include citadel.csv in the registry.
GOODBOY = pooch.create(
    path=pooch.os_cache("plumbus"),
    base_url="https://github.com/rick/plumbus/raw/{version}/data/",
    version=version,
    version_dev="master",
    registry={
        "c137.csv": "19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc",
        "cronen.csv": "1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w",
        "citadel.csv": "893yprofwjndcwhx9c0ehp3ue9gcwoscjwdfgh923e0hwhcwiyc",
    },
    # Now specify custom URLs for some of the files in the registry.
    urls={
        "citadel.csv": "https://www.some-data-hosting-site.com/files/citadel.csv",
    },
)

Notice that versioning of custom URLs is not supported (since they are assumed to be data files independent of your project) and the file name will not be appended automatically to the URL (in case you want to change the file name in local storage).

Custom URLs can be used along side base_url or you can omit base_url entirely by setting it to an empty string (base_url=""). However, doing so requires setting a custom URL for every file in the registry.

You can also include custom URLs in a registry file by adding the URL for a file to end of the line (separated by a space):

c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc
cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w
citadel.csv 893yprofwjndcwhx9c0ehp3ue9gcwoscjwdfgh923e0hwhcwiyc https://www.some-data-hosting-site.com/files/citadel.csv

pooch.Pooch.load_registry will automatically populate the urls attribute. This way, custom URLs don’t need to be set in the code. In fact, the module code doesn’t change at all:

# Define the Pooch exactly the same (urls is None by default)
GOODBOY = pooch.create(
    path=pooch.os_cache("plumbus"),
    base_url="https://github.com/rick/plumbus/raw/{version}/data/",
    version=version,
    version_dev="master",
    registry=None,
)
# If custom URLs are present in the registry file, they will be set automatically
GOODBOY.load_registry(os.path.join(os.path.dirname(__file__), "registry.txt"))