pooch.create

pooch.create(path, base_url, version=None, version_dev='master', env=None, registry=None, urls=None, retry_if_failed=0)[source]

Create a Pooch with sensible defaults to fetch data files.

If a version string is given, the Pooch will be versioned, meaning that the local storage folder and the base URL depend on the project version. This is necessary if your users have multiple versions of your library installed (using virtual environments) and you updated the data files between versions. Otherwise, every time a user switches environments would trigger a re-download of the data. The version string will be appended to the local storage path (for example, ~/.mypooch/cache/v0.1) and inserted into the base URL (for example, https://github.com/fatiando/pooch/raw/v0.1/data). If the version string contains +XX.XXXXX, it will be interpreted as a development version.

Does not create the local data storage folder. The folder will only be created the first time a download is attempted with pooch.Pooch.fetch. This makes it safe to use this function at the module level (so it’s executed on import and the resulting Pooch is a global variable).

Parameters
  • path (str, PathLike, list or tuple) – The path to the local data storage folder. If this is a list or tuple, we’ll join the parts with the appropriate separator. The version will be appended to the end of this path. Use pooch.os_cache for a sensible default.

  • base_url (str) – Base URL for the remote data source. All requests will be made relative to this URL. The string should have a {version} formatting mark in it. We will call .format(version=version) on this string. If the URL is a directory path, it must end in a '/' because we will not include it.

  • version (str or None) – The version string for your project. Should be PEP440 compatible. If None is given, will not attempt to format base_url and no subfolder will be appended to path.

  • version_dev (str) – The name used for the development version of a project. If your data is hosted on Github (and base_url is a Github raw link), then "master" is a good choice (default). Ignored if version is None.

  • env (str or None) – An environment variable that can be used to overwrite path. This allows users to control where they want the data to be stored. We’ll append version to the end of this value as well.

  • registry (dict or None) – A record of the files that are managed by this Pooch. Keys should be the file names and the values should be their hashes. Only files in the registry can be fetched from the local storage. Files in subdirectories of path must use Unix-style separators ('/') even on Windows.

  • urls (dict or None) – Custom URLs for downloading individual files in the registry. A dictionary with the file names as keys and the custom URLs as values. Not all files in registry need an entry in urls. If a file has an entry in urls, the base_url will be ignored when downloading it in favor of urls[fname].

  • retry_if_failed (int) – Retry a file download the specified number of times if it fails because of a bad connection or a hash mismatch. By default, downloads are only attempted once (retry_if_failed=0). Initially, will wait for 1s between retries and then increase the wait time by 1s with each retry until a maximum of 10s.

Returns

pooch (Pooch) – The Pooch initialized with the given arguments.

Examples

Create a Pooch for a release (v0.1):

>>> pup = create(path="myproject",
...              base_url="http://some.link.com/{version}/",
...              version="v0.1",
...              registry={"data.txt": "9081wo2eb2gc0u..."})
>>> print(pup.path.parts)  # The path is a pathlib.Path
('myproject', 'v0.1')
>>> # The local folder is only created when a dataset is first downloaded
>>> print(pup.path.exists())
False
>>> print(pup.base_url)
http://some.link.com/v0.1/
>>> print(pup.registry)
{'data.txt': '9081wo2eb2gc0u...'}
>>> print(pup.registry_files)
['data.txt']

If this is a development version (12 commits ahead of v0.1), then the version_dev will be used (defaults to "master"):

>>> pup = create(path="myproject",
...              base_url="http://some.link.com/{version}/",
...              version="v0.1+12.do9iwd")
>>> print(pup.path.parts)
('myproject', 'master')
>>> print(pup.base_url)
http://some.link.com/master/

Versioning is optional (but highly encouraged):

>>> pup = create(path="myproject",
...              base_url="http://some.link.com/",
...              registry={"data.txt": "9081wo2eb2gc0u..."})
>>> print(pup.path.parts)  # The path is a pathlib.Path
('myproject',)
>>> print(pup.base_url)
http://some.link.com/

To place the storage folder at a subdirectory, pass in a list and we’ll join the path for you using the appropriate separator for your operating system:

>>> pup = create(path=["myproject", "cache", "data"],
...              base_url="http://some.link.com/{version}/",
...              version="v0.1")
>>> print(pup.path.parts)
('myproject', 'cache', 'data', 'v0.1')

The user can overwrite the storage path by setting an environment variable:

>>> # The variable is not set so we'll use *path*
>>> pup = create(path=["myproject", "not_from_env"],
...              base_url="http://some.link.com/{version}/",
...              version="v0.1",
...              env="MYPROJECT_DATA_DIR")
>>> print(pup.path.parts)
('myproject', 'not_from_env', 'v0.1')
>>> # Set the environment variable and try again
>>> import os
>>> os.environ["MYPROJECT_DATA_DIR"] = os.path.join("myproject", "env")
>>> pup = create(path=["myproject", "not_env"],
...              base_url="http://some.link.com/{version}/",
...              version="v0.1",
...              env="MYPROJECT_DATA_DIR")
>>> print(pup.path.parts)
('myproject', 'env', 'v0.1')