Fetching files from a registry

Fetching files from a registry#

If you need to manage the download of multiple files from one or more locations, then this section is for you!

Setup#

In the following example we’ll assume that:

  1. You have several data files served from the same base URL (for example, "https://www.somewebpage.org/science/data").

  2. You know the file names and their hashes.

We will use pooch.create to set up our download manager:

import pooch


odie = pooch.create(
    # Use the default cache folder for the operating system
    path=pooch.os_cache("my-project"),
    base_url="https://www.somewebpage.org/science/data/",
    # The registry specifies the files that can be fetched
    registry={
        "temperature.csv": "sha256:19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc",
        "gravity-disturbance.nc": "sha256:1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w",
    },
)

The return value (odie) is an instance of pooch.Pooch. It contains all of the information needed to fetch the data files in our registry and store them in the specified cache folder.

Note

The Pooch registry is a mapping of file names and their associated hashes (and optionally download URLs).

Tip

If you don’t know the hash or are otherwise unable to obtain it, it is possible to bypass the check. This is not recommended for general use, only if it can’t be avoided. See Hashes: Calculating and bypassing.

Attention

You can have data files in subdirectories of the remote data store (URL). These files will be saved to the same subdirectories in the local storage folder.

However, the names of these files in the registry must use Unix-style separators ('/') even on Windows. Pooch will handle the appropriate conversions.

Downloading files#

To download one our data files and load it with xarray:

import xarray as xr


file_path = odie.fetch("gravity-disturbance.nc")
# Standard use of xarray to load a netCDF file (.nc)
data = xr.open_dataset(file_path)

The call to pooch.Pooch.fetch will check if the file already exists in the cache folder.

If it doesn’t:

  1. The file is downloaded and saved to the cache folder.

  2. The hash of the downloaded file is compared against the one stored in the registry to make sure the file isn’t corrupted.

  3. The function returns the absolute path to the file on your computer.

If it does:

  1. Check if it’s hash matches the one in the registry.

  2. If it does, no download happens and the file path is returned.

  3. If it doesn’t, the file is downloaded once more to get an updated version on your computer.

Why use this method?#

With pooch.Pooch, you can centralize the information about the URLs, hashes, and files in a single place. Once the instance is created, it can be used to fetch individual files without repeating the URL and hash everywhere.

A good way to use this is to place the call to pooch.create in Python module (a .py file). Then you can import the module in .py scripts or Jupyter notebooks and use the instance to fetch your data. This way, you don’t need to define the URLs or hashes in multiple scripts/notebooks.

Customizing the download#

The pooch.Pooch.fetch method supports for all of Pooch’s downloaders and processors. You can use HTTP, FTP, and SFTP (even with authentication), decompress files, unpack archives, show progress bars, and more with a bit of configuration.