Skip to content

Rule retrieve_databundle_light

Not all data dependencies are shipped with the git repository, since git is not suited for handling large changing files. Instead we provide separate data bundles which can be obtained using the retrieve_databundle_light rule when retrieve_databundle flag in the configuration file is on. If that is the case, retrieve_databundle_light rule is included into the workflow. The common data needed to run the model will be loaded corresponding to settings of the config_default.yaml or config_tutorial.yaml depending on the tutorial flag.

Script Documentation

The script retrieve common datasets like exclusive economic zone polygons, landcover data, hydrobasins, global electrcity demand datasets and regional-tailored cutouts.

This rule downloads the data bundle from zenodo or google drive used as a backup option and extracts it in the data, resources and cutouts sub-directory. Temporal bundle data are deleted once downloaded and unzipped.

The :ref:tutorial_electricity uses a smaller data bundle referred as tutorial used to run tutorials and tests.

The required bundles are downloaded automatically according tailoring to a requested region when datasets are selected from the data bundles specified in the bundle configuration `bundle_config.yaml` file located inconfig`` folder.

Each data bundle entry has the following structure:

.. code:: yaml

bundle_name: # name of the bundle countries: [country code, region code or country list] # list of countries represented in the databundle [tutorial: true/false] # (optional, default false) whether the bundle is a tutorial or not category: common/resources/data/cutouts # category of data contained in the bundle: destination: "." # folder where to unzip the files with respect to the repository root ("" or ".") urls: # list of urls by source, e.g. zenodo or google zenodo: {zenodo url} # key to download data from zenodo gdrive: {google url} # key to download data from google drive protectedplanet: {url} # key to download data from protected planet; the url can contain {month:s} and {year:d} to let the workflow specify the current month and year direct: {url} # key to download data directly from a url; if unzip option is enabled data are unzipped post: # key to download data using an url post request; if unzip option is enabled data are unzipped url: {url} [post arguments] [unzip: true/false] # (optional, default false) used in direct download technique to automatically unzip files output: [...] # list of outputs of the databundle [disable_by_opt:] # option to disable outputs from the bundle; it contains a dictionary of options, each one with # each one with its output. When "all" is specified, the entire bundle is not executed [{option}: [outputs,...,/all]] # list of options and the outputs to remove, or "all" corresponding to ignore everything

Depending on the country list that is asked to perform, all needed databundles are downloaded according to the following rules:

  • The databundle shall adhere to the tutorial configuration: when the tutorial configuration is running, only the databundles having tutorial flag true shall be downloaded
  • For every data category, the most suitable bundles are downloaded by order of number of countries matched: for every bundles matching the category, the algorithm sorts the bundles by the number of countries that are matched and starts downloading them starting from those matching more countries till all countries are matched or no more bundles are available
  • For every bundle to download, it is given priority to the first bundle source, as listed in the urls option of each bundle configuration; when a source fails, the following source is used and so on

Relevant Settings

.. code:: yaml

tutorial:  # configuration stating whether the tutorial is needed

.. seealso:: Documentation of the configuration file config.yaml at :ref:toplevel_cf

Outputs

  • data: input data unzipped into the data folder
  • resources: input data unzipped into the resources folder
  • cutouts: input data unzipped into the cutouts folder

load_databundle_config(config)

Load databundle configurations from path file or dictionary

Parameters

config : dict or str Configuration data for the databundles, either as a dictionary or as a path to a yaml file containing the configuration.

Returns

dict A dictionary containing the configuration for each databundle.

download_and_unzip_zenodo(config, rootpath, hot_run=True, disable_progress=False)

download_and_unzip_zenodo(config, rootpath, dest_path, hot_run=True, disable_progress=False)

Function to download and unzip the data from zenodo

Parameters

config : dict Configuration data for the category to download rootpath : str Absolute path of the repository hot_run : bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : bool (default False) When true the progress bar to download data is disabled

Returns

True when download is successful, False otherwise

download_and_unzip_gdrive(config, rootpath, hot_run=True, disable_progress=False)

download_and_unzip_gdrive(config, rootpath, dest_path, hot_run=True, disable_progress=False)

Function to download and unzip the data from google drive

Parameters

config : Dict Configuration data for the category to download rootpath : str Absolute path of the repository hot_run : Bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : Bool (default False) When true the progress bar to download data is disabled

Returns

True when download is successful, False otherwise

download_and_unzip_protectedplanet(config, rootpath, attempts=3, hot_run=True, disable_progress=False)

download_and_unzip_protectedplanet(config, rootpath, dest_path, hot_run=True, disable_progress=False)

Function to download and unzip the data by category from protectedplanet

Parameters

config : Dict Configuration data for the category to download rootpath : str Absolute path of the repository attempts : int (default 3) Number of attempts to download the data by month. The download is attempted for the current and previous months according to the number of attempts hot_run : Bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : Bool (default False) When true the progress bar to download data is disabled

Returns

True when download is successful, False otherwise

download_and_unpack(url, file_path, resource, destination, headers=None, hot_run=True, unzip=True, disable_progress=False)

download_and_unpack( url, file_path, resource, destination, headers=None, hot_run=True, unzip=True, disable_progress=False)

A helper function to encapsulate retrieval and unzip

Parameters

hot_run : Bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : Bool (default False) When true the progress bar to download data is disabled

Returns

True when download is successful, False otherwise

download_and_unzip_direct(config, rootpath, hot_run=True, disable_progress=False)

download_and_unzip_direct(config, rootpath, dest_path, hot_run=True, disable_progress=False)

Function to download the data by category from a direct url with no processing. If in the configuration file the unzip is specified True, then the downloaded data is unzipped.

Parameters

config : Dict Configuration data for the category to download rootpath : str Absolute path of the repository hot_run : Bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : Bool (default False) When true the progress bar to download data is disabled

Returns

True when download is successful, False otherwise

download_and_unzip_hydrobasins(config, rootpath, hot_run=True, disable_progress=False)

download_and_unzip_basins(config, rootpath, dest_path, hot_run=True, disable_progress=False)

Function to download and unzip the data for hydrobasins from HydroBASINS database available via https://www.hydrosheds.org/products/hydrobasins

We are using data from the HydroSHEDS version 1 database which is © World Wildlife Fund, Inc. (2006-2022) and has been used herein under license. WWF has not evaluated our data pipeline and therefore gives no warranty regarding its accuracy, completeness, currency or suitability for any particular purpose. Portions of the HydroSHEDS v1 database incorporate data which are the intellectual property rights of © USGS (2006-2008), NASA (2000-2005), ESRI (1992-1998), CIAT (2004-2006), UNEP-WCMC (1993), WWF (2004), Commonwealth of Australia (2007), and Her Royal Majesty and the British Crown and are used under license. The HydroSHEDS v1 database and more information are available at https://www.hydrosheds.org.

Parameters

config : Dict Configuration data for the category to download rootpath : str Absolute path of the repository hot_run : Bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : Bool (default False) When true the progress bar to download data is disabled

Returns

True when download is successful, False otherwise

download_and_unzip_post(config, rootpath, hot_run=True, disable_progress=False)

download_and_unzip_post(config, rootpath, dest_path, hot_run=True, disable_progress=False)

Function to download the data by category from a post request.

Parameters

config : dict Configuration data for the category to download rootpath : str Absolute path of the repository hot_run : bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : bool (default False) When true the progress bar to download data is disabled

Returns

True when download is successful, False otherwise

get_best_bundles_by_category(country_list, category, config_bundles, tutorial, config_enable)

get_best_bundles_by_category(country_list, category, config_bundles, tutorial)

Function to get the best bundles that download the data for selected countries, given category and tutorial characteristics.

The selected bundles shall adhere to the following criteria: - The bundles' tutorial parameter shall match the tutorial argument - The bundles' category shall match the category of data to download - When multiple bundles are identified for the same set of users, the bundles matching more countries are first selected and more bundles are added until all countries are matched or no more bundles are available

Parameters

country_list : list List of country codes for the countries to download category : str Category of the data to download config_bundles : dict Dictionary of configurations for all available bundles tutorial : bool Whether data for tutorial shall be downloaded config_enable : dict Dictionary of the enabled/disabled scripts

Returns

list List of bundles to download

get_best_bundles(countries, config_bundles, tutorial, config_enable, include_categories=[], exclude_categories=[])

get_best_bundles(countries, category, config_bundles, tutorial)

Function to get the best bundles that download the data for selected countries, given tutorial characteristics.

First, the categories of data to download are identified in agreement to the bundles that match the list of countries and tutorial configuration.

Then, the bundles to be downloaded shall adhere to the following criteria: - The bundles' tutorial parameter shall match the tutorial argument - The bundles' category shall match the category of data to download - When multiple bundles are identified for the same set of users, the bundles matching more countries are first selected and more bundles are added until all countries are matched or no more bundles are available

Parameters

countries : list List of country codes for the countries to download config_bundles : dict Dictionary of configurations for all available bundles tutorial : bool Whether data for tutorial shall be downloaded config_enable : dict Dictionary of the enabled/disabled scripts include_categories : list (Optional) Lists of config bundle categories to include; when empty exclude_categories : list (Optional) Lists of config bundle categories to exclude; when empty

Returns

list List of bundles to download

get_best_bundles_in_snakemake(config, include_categories=[], exclude_categories=[])

Function to get the best bundles to download in snakemake, given the configuration file and the categories to include/exclude.

Parameters

config : dict Configuration for the data bundles include_categories : list (Optional) Lists of config bundle categories to include; when empty exclude_categories : list (Optional) Lists of config bundle categories to exclude; when empty

Returns

list List of bundles to download

datafiles_retrivedatabundle(config, bundles_to_download)

Function to get the output files from the bundles, given the target countries, tutorial settings, etc.

Parameters

config : dict Configuration dictionary for the data bundles bundles_to_download : list List of bundles to download

Returns

list List of output files from the bundles to download

retrieve_databundle(bundles_to_download, config_bundles, hydrobasins_level, rootpath='.', disable_progress=False)

Retrieve the specified databundles and unzip them. Also provides warning messages in case of download failure and logs the successfully downloaded bundles.

Parameters

bundles_to_download : list A list of databundle names to download. config_bundles : dict A dictionary containing the configuration for each databundle. hydrobasins_level : int The level of hydrobasins to retrieve. rootpath : str The root path for the downloaded files. disable_progress : bool Whether to disable the progress bar.

Returns

None The function downloads and unzips the databundles and does not return anything.

check_retrieved_cutout_match(snakemake)

Validate that the retrieved cutout spatially matches the region.

Parameters

snakemake : snakemake.io.Snakemake Snakemake object containing input and output file paths.

Returns

None The function checks if the retrieved cutout matches the region and logs the result.

debug_using_databundle_cli(snakemake)

Checks if all Snakemake output files exist (ignoring "data/landcover") and exit if they do. If any outputs are missing, the function reroutes execution to a command-line interface script for debugging.

Parameters

snakemake : snakemake.io.Snakemake Snakemake object containing output file paths.

Returns

None The function checks for the existence of output files and reroutes to a CLI script if any are missing.