Retrieving a single data file¶
If you only want to download one or two data files, use the
import pooch file_path = pooch.retrieve( # URL to one of Pooch's test files url="https://github.com/fatiando/pooch/raw/v1.0.0/data/tiny-data.txt", known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e", )
The code above will:
Check if the file from this URL already exists in Pooch’s default cache folder (see
If it doesn’t, the file is downloaded and saved to the cache folder.
The MD5 hash is compared against the
known_hashto make sure the file isn’t corrupted.
The function returns the absolute path to the file on your computer.
If the file already existed on your machine, Pooch will check if it’s MD5 hash
If it does, no download happens and the file path is returned.
If it doesn’t, the file is downloaded once more to get an updated version on your computer.
Since the download happens only once, you can place this function call at the start of your script or Jupyter notebook without having to worry about repeat downloads. Anyone getting a copy of your code should also get the correct data file the first time they run it.
You can use different hashes by specifying different algorithm names:
sha1:XXXXXX, etc. See Hashes: Calculating and bypassing.
Unknown file hash¶
If you don’t know the hash of the file, you can set
bypass the check.
retrieve will print a log message with the SHA256 hash of the
It’s highly recommended that you copy and paste this hash into your code
and use it as the
known_hash guarantees that the next time your code is run
(by you or someone else) the exact same file is downloaded. This helps
make the results of your code reproducible.
Customizing the download¶
pooch.retrieve function supports for all of Pooch’s
downloaders and processors.
You can use HTTP, FTP, and SFTP
(even with authentication),
show progress bars, and more with a bit of configuration.
When not to use
If you need to manage the download and caching of several files from one or
more sources, then you should start using the full capabilities of the
It can handle sandboxing
data for different package versions, allow users to set the download
locations, and more.
The classic example is a Python package that contains several sample datasets for use in testing and documentation.