Retrieving a data file

A common task in data analysis workflows is downloading the data from a publicly available source. This could be done manually (which can’t be easily reproduced) or programmatically using urllib or requests (which can require a non-trivial amount of code). Ideally, we should be checking that the downloaded file is not corrupted with a known checksum.

Getting started

Pooch is designed to simplify all of these tasks (and more). If you’re only looking to download one or two data files only, Pooch offers the pooch.retrieve function:

import pooch


# Download the file and save it locally.
fname = pooch.retrieve(
    # URL to one of Pooch's test files
    url="https://github.com/fatiando/pooch/raw/v1.0.0/data/tiny-data.txt",
    # Pooch will check the MD5 checksum of the downloaded file against the
    # given value to make sure it haven't been corrupted. You can use other
    # hashes by specifying different algorithm names (sha256, sha1, etc).
    known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",
)

The file is stored locally, by default in a folder called pooch in the default cache location of your operating system (see pooch.os_cache). The function returns the full path to the downloaded data file, which you can then pass to pandas, numpy, xarray, etc, to load into memory.

Running this code a second time will not trigger a download since the file already exists. So you can place this function call at the start of your script or Jupyter notebook without having to worry about repeat downloads. Anyone getting a copy of your code should also get the correct data file the first time they run it.

If the file is updated on the server and known_hash is set to the checksum of the new file, Pooch will automatically detect that the file needs to be updated and download the new version.

Note

The pooch.retrieve function is useful when you have one or two files to download. If you need to manage the download and caching of several files (for example, if you’re developing a Python package or for large data analysis projects), then you should start using the full capabilities of the pooch.Pooch class. It can handle sandboxing data for different package versions, allow users to set the download locations, and more.

See Beginner tricks and Intermediate tricks to get started.

Unknown file hash

If you don’t know the hash of the file, you can set known_hash=None to bypass the check. If this is the case, retrieve will print a log message with the SHA256 hash of the downloaded file. It’s highly recommended that you copy and paste this hash into your code and use it as the known_hash.

That way, the next time your code is run (by you or someone else) you can guarantee that the exact same file is downloaded. This is a way to help make sure the results of your code are reproducible.

Customizing the download

Function pooch.retrieve has support for all of Pooch’s downloaders and processors. You can use HTTP, FTP, and SFTP (with or without authentication), decompress files, unpack archives, show progress bars, and more with a bit of configuration.