Retrieving a single data file#

Basic usage#

If you only want to download one or two data files, use the pooch.retrieve function:

import pooch


file_path = pooch.retrieve(
    # URL to one of Pooch's test files
    url="https://github.com/fatiando/pooch/raw/v1.0.0/data/tiny-data.txt",
    known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",
)

The code above will:

  1. Check if the file from this URL already exists in Pooch’s default cache folder (see pooch.os_cache).

  2. If it doesn’t, the file is downloaded and saved to the cache folder.

  3. The MD5 hash is compared against the known_hash to make sure the file isn’t corrupted.

  4. The function returns the absolute path to the file on your computer.

If the file already existed on your machine, Pooch will check if it’s MD5 hash matches the known_hash:

  • If it does, no download happens and the file path is returned.

  • If it doesn’t, the file is downloaded once more to get an updated version on your computer.

Since the download happens only once, you can place this function call at the start of your script or Jupyter notebook without having to worry about repeat downloads. Anyone getting a copy of your code should also get the correct data file the first time they run it.

See also

Pooch can handle multiple download protocols like HTTP, FTP, SFTP, and even download from repositories like figshare and Zenodo by using the DOI instead of a URL. See Download protocols.

See also

You can use different hashes by specifying different algorithm names: sha256:XXXXXX, sha1:XXXXXX, etc. See Hashes: Calculating and bypassing.

Unknown file hash#

If you don’t know the hash of the file, you can set known_hash=None to bypass the check. retrieve will print a log message with the SHA256 hash of the downloaded file. It’s highly recommended that you copy and paste this hash into your code and use it as the known_hash.

Tip

Setting the known_hash guarantees that the next time your code is run (by you or someone else) the exact same file is downloaded. This helps make the results of your code reproducible.

Customizing the download#

The pooch.retrieve function supports for all of Pooch’s downloaders and processors. You can use HTTP, FTP, and SFTP (even with authentication), decompress files, unpack archives, show progress bars, and more with a bit of configuration.

When not to use retrieve#

If you need to manage the download and caching of several files from one or more sources, then you should start using the full capabilities of the pooch.Pooch class. It can handle sandboxing data for different package versions, allow users to set the download locations, and more.

The classic example is a Python package that contains several sample datasets for use in testing and documentation.

See Fetching files from a registry and Manage a package’s sample data to get started.