## The problem¶

You develop a Python library called plumbus for analysing data emitted by interdimensional portals. You want to distribute sample data so that your users can easily try out the library by copying and pasting from the docs. You want to have a plumbus.datasets module that defines functions like fetch_c137() that will return the data loaded as a pandas.DataFrame for convenient access.

## Assumptions¶

We’ll setup a Pooch to solve your data distribution needs. In this example, we’ll work with the follow assumptions:

1. Your sample data are in a folder of your Github repository.
2. You use git tags to mark releases of your project in the history.
3. Your project has a variable that defines the version string.
4. The version string contains an indicator that the current commit is not a release (like 'v1.2.3+12.d908jdl' or 'v0.1+dev').

Let’s say that this is the layout of your repository on Github:

doc/
...
data/
c137.csv
cronen.csv
plumbus/
__init__.py
...
datasets.py
setup.py
...


The sample data are stored in the data folder of your repository.

## Setup¶

Pooch can download and cache your data files to the users computer automatically. This is what the plumbus/datasets.py file would look like:

"""
"""
import pandas
import pooch

from . import version  # The version string of your project

GOODBOY = pooch.create(
# Use the default cache folder for the OS
path=pooch.os_cache("plumbus"),
# The remote data is on Github
base_url="https://github.com/rick/plumbus/raw/{version}/data/",
version=version,
# If this is a development version, get the data from the master branch
version_dev="master",
# The registry specifies the files that can be fetched from the local storage
registry={
"c137.csv": "19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc",
"cronen.csv": "1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w",
},
)

def fetch_c137():
"""
Load the C-137 sample data as a pandas.DataFrame.
"""
# The file will be downloaded automatically the first time this is run.
fname = GOODBOY.fetch("c137.csv")
return data

def fetch_cronen():
"""
Load the Cronenberg sample data as a pandas.DataFrame.
"""
fname = GOODBOY.fetch("cronen.csv")
return data


When the user calls plumbus.datasets.fetch_c137() for the first time, the data file will be downloaded and stored in the local storage. In this case, we’re using pooch.os_cache to set the local folder to the default cache location for your OS. You could also provide any other path if you prefer. See the documentation for pooch.create for more options.

## Hashes¶

Pooch uses SHA256 hashes to check if files are up-to-date or possibly corrupted:

• If a file exists in the local folder, Pooch will check that its hash matches the one in the registry. If it doesn’t, we’ll assume that it needs to be updated.
• If a file needs to be updated or doesn’t exist, Pooch will download it from the remote source and check the hash. If the hash doesn’t match, an exception is raised to warn of possible file corruption.

You can generate hashes for your data files using the terminal: