Data Download & Preparation¶
Scripts for downloading and preparing input data.
retrieve_databundle_light
¶
The script retrieve common datasets like exclusive economic zone polygons, landcover data, hydrobasins, global electrcity demand datasets and regional-tailored cutouts.
This rule downloads the data bundle from zenodo or google drive used as
a backup option and extracts it in the data, resources and cutouts
sub-directory. Temporal bundle data are deleted once downloaded and unzipped.
The :ref:tutorial_electricity uses a smaller data bundle referred as
tutorial used to run tutorials and tests.
The required bundles are downloaded automatically according tailoring to
a requested region when datasets are selected from the data bundles specified
in the bundle configuration `bundle_config.yaml` file located inconfig``
folder.
Each data bundle entry has the following structure:
.. code:: yaml
bundle_name: # name of the bundle countries: [country code, region code or country list] # list of countries represented in the databundle [tutorial: true/false] # (optional, default false) whether the bundle is a tutorial or not category: common/resources/data/cutouts # category of data contained in the bundle: destination: "." # folder where to unzip the files with respect to the repository root ("" or ".") urls: # list of urls by source, e.g. zenodo or google zenodo: {zenodo url} # key to download data from zenodo gdrive: {google url} # key to download data from google drive protectedplanet: {url} # key to download data from protected planet; the url can contain {month:s} and {year:d} to let the workflow specify the current month and year direct: {url} # key to download data directly from a url; if unzip option is enabled data are unzipped post: # key to download data using an url post request; if unzip option is enabled data are unzipped url: {url} [post arguments] [unzip: true/false] # (optional, default false) used in direct download technique to automatically unzip files output: [...] # list of outputs of the databundle [disable_by_opt:] # option to disable outputs from the bundle; it contains a dictionary of options, each one with # each one with its output. When "all" is specified, the entire bundle is not executed [{option}: [outputs,...,/all]] # list of options and the outputs to remove, or "all" corresponding to ignore everything
Depending on the country list that is asked to perform, all needed databundles are downloaded according to the following rules:
- The databundle shall adhere to the tutorial configuration: when the tutorial configuration is running, only the databundles having tutorial flag true shall be downloaded
- For every data category, the most suitable bundles are downloaded by order of number of countries matched: for every bundles matching the category, the algorithm sorts the bundles by the number of countries that are matched and starts downloading them starting from those matching more countries till all countries are matched or no more bundles are available
- For every bundle to download, it is given priority to the first bundle source,
as listed in the
urlsoption of each bundle configuration; when a source fails, the following source is used and so on
Relevant Settings
.. code:: yaml
tutorial: # configuration stating whether the tutorial is needed
.. seealso::
Documentation of the configuration file config.yaml at
:ref:toplevel_cf
Outputs
data: input data unzipped into the data folderresources: input data unzipped into the resources foldercutouts: input data unzipped into the cutouts folder
load_databundle_config(config)
¶
download_and_unzip_zenodo(config, rootpath, hot_run=True, disable_progress=False)
¶
download_and_unzip_zenodo(config, rootpath, dest_path, hot_run=True, disable_progress=False)
Function to download and unzip the data from zenodo
Parameters¶
config : dict Configuration data for the category to download rootpath : str Absolute path of the repository hot_run : bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : bool (default False) When true the progress bar to download data is disabled
Returns¶
True when download is successful, False otherwise
download_and_unzip_gdrive(config, rootpath, hot_run=True, disable_progress=False)
¶
download_and_unzip_gdrive(config, rootpath, dest_path, hot_run=True, disable_progress=False)
Function to download and unzip the data from google drive
Parameters¶
config : Dict Configuration data for the category to download rootpath : str Absolute path of the repository hot_run : Bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : Bool (default False) When true the progress bar to download data is disabled
Returns¶
True when download is successful, False otherwise
download_and_unzip_protectedplanet(config, rootpath, attempts=3, hot_run=True, disable_progress=False)
¶
download_and_unzip_protectedplanet(config, rootpath, dest_path, hot_run=True, disable_progress=False)
Function to download and unzip the data by category from protectedplanet
Parameters¶
config : Dict Configuration data for the category to download rootpath : str Absolute path of the repository attempts : int (default 3) Number of attempts to download the data by month. The download is attempted for the current and previous months according to the number of attempts hot_run : Bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : Bool (default False) When true the progress bar to download data is disabled
Returns¶
True when download is successful, False otherwise
download_and_unpack(url, file_path, resource, destination, headers=None, hot_run=True, unzip=True, disable_progress=False)
¶
download_and_unpack( url, file_path, resource, destination, headers=None, hot_run=True, unzip=True, disable_progress=False)
A helper function to encapsulate retrieval and unzip
Parameters¶
hot_run : Bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : Bool (default False) When true the progress bar to download data is disabled
Returns¶
True when download is successful, False otherwise
download_and_unzip_direct(config, rootpath, hot_run=True, disable_progress=False)
¶
download_and_unzip_direct(config, rootpath, dest_path, hot_run=True, disable_progress=False)
Function to download the data by category from a direct url with no processing. If in the configuration file the unzip is specified True, then the downloaded data is unzipped.
Parameters¶
config : Dict Configuration data for the category to download rootpath : str Absolute path of the repository hot_run : Bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : Bool (default False) When true the progress bar to download data is disabled
Returns¶
True when download is successful, False otherwise
download_and_unzip_hydrobasins(config, rootpath, hot_run=True, disable_progress=False)
¶
download_and_unzip_basins(config, rootpath, dest_path, hot_run=True, disable_progress=False)
Function to download and unzip the data for hydrobasins from HydroBASINS database available via https://www.hydrosheds.org/products/hydrobasins
We are using data from the HydroSHEDS version 1 database which is © World Wildlife Fund, Inc. (2006-2022) and has been used herein under license. WWF has not evaluated our data pipeline and therefore gives no warranty regarding its accuracy, completeness, currency or suitability for any particular purpose. Portions of the HydroSHEDS v1 database incorporate data which are the intellectual property rights of © USGS (2006-2008), NASA (2000-2005), ESRI (1992-1998), CIAT (2004-2006), UNEP-WCMC (1993), WWF (2004), Commonwealth of Australia (2007), and Her Royal Majesty and the British Crown and are used under license. The HydroSHEDS v1 database and more information are available at https://www.hydrosheds.org.
Parameters¶
config : Dict Configuration data for the category to download rootpath : str Absolute path of the repository hot_run : Bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : Bool (default False) When true the progress bar to download data is disabled
Returns¶
True when download is successful, False otherwise
download_and_unzip_post(config, rootpath, hot_run=True, disable_progress=False)
¶
download_and_unzip_post(config, rootpath, dest_path, hot_run=True, disable_progress=False)
Function to download the data by category from a post request.
Parameters¶
config : dict Configuration data for the category to download rootpath : str Absolute path of the repository hot_run : bool (default True) When true the data are downloaded When false, the workflow is run without downloading and unzipping disable_progress : bool (default False) When true the progress bar to download data is disabled
Returns¶
True when download is successful, False otherwise
get_best_bundles_by_category(country_list, category, config_bundles, tutorial, config_enable)
¶
get_best_bundles_by_category(country_list, category, config_bundles, tutorial)
Function to get the best bundles that download the data for selected countries, given category and tutorial characteristics.
The selected bundles shall adhere to the following criteria: - The bundles' tutorial parameter shall match the tutorial argument - The bundles' category shall match the category of data to download - When multiple bundles are identified for the same set of users, the bundles matching more countries are first selected and more bundles are added until all countries are matched or no more bundles are available
Parameters¶
country_list : list List of country codes for the countries to download category : str Category of the data to download config_bundles : dict Dictionary of configurations for all available bundles tutorial : bool Whether data for tutorial shall be downloaded config_enable : dict Dictionary of the enabled/disabled scripts
Returns¶
list List of bundles to download
get_best_bundles(countries, config_bundles, tutorial, config_enable, include_categories=[], exclude_categories=[])
¶
get_best_bundles(countries, category, config_bundles, tutorial)
Function to get the best bundles that download the data for selected countries, given tutorial characteristics.
First, the categories of data to download are identified in agreement to the bundles that match the list of countries and tutorial configuration.
Then, the bundles to be downloaded shall adhere to the following criteria: - The bundles' tutorial parameter shall match the tutorial argument - The bundles' category shall match the category of data to download - When multiple bundles are identified for the same set of users, the bundles matching more countries are first selected and more bundles are added until all countries are matched or no more bundles are available
Parameters¶
countries : list List of country codes for the countries to download config_bundles : dict Dictionary of configurations for all available bundles tutorial : bool Whether data for tutorial shall be downloaded config_enable : dict Dictionary of the enabled/disabled scripts include_categories : list (Optional) Lists of config bundle categories to include; when empty exclude_categories : list (Optional) Lists of config bundle categories to exclude; when empty
Returns¶
list List of bundles to download
get_best_bundles_in_snakemake(config, include_categories=[], exclude_categories=[])
¶
Function to get the best bundles to download in snakemake, given the configuration file and the categories to include/exclude.
Parameters¶
config : dict Configuration for the data bundles include_categories : list (Optional) Lists of config bundle categories to include; when empty exclude_categories : list (Optional) Lists of config bundle categories to exclude; when empty
Returns¶
list List of bundles to download
datafiles_retrivedatabundle(config, bundles_to_download)
¶
retrieve_databundle(bundles_to_download, config_bundles, hydrobasins_level, rootpath='.', disable_progress=False)
¶
Retrieve the specified databundles and unzip them. Also provides warning messages in case of download failure and logs the successfully downloaded bundles.
Parameters¶
bundles_to_download : list A list of databundle names to download. config_bundles : dict A dictionary containing the configuration for each databundle. hydrobasins_level : int The level of hydrobasins to retrieve. rootpath : str The root path for the downloaded files. disable_progress : bool Whether to disable the progress bar.
Returns¶
None The function downloads and unzips the databundles and does not return anything.
check_retrieved_cutout_match(snakemake)
¶
debug_using_databundle_cli(snakemake)
¶
Checks if all Snakemake output files exist (ignoring "data/landcover") and exit if they do. If any outputs are missing, the function reroutes execution to a command-line interface script for debugging.
Parameters¶
snakemake : snakemake.io.Snakemake Snakemake object containing output file paths.
Returns¶
None The function checks for the existence of output files and reroutes to a CLI script if any are missing.
download_osm_data
¶
Python interface to download OpenStreetMap data Documented at https://github.com/pypsa-meets-earth/earth-osm
Relevant Settings¶
None # multiprocessing & infrastructure selection can be an option in future
Inputs¶
None
Outputs¶
data/osm/pbf: Raw OpenStreetMap data as .pbf files per countrydata/osm/power: Filtered power data as .json files per countrydata/osm/out: Prepared power data as .geojson and .csv files per countryresources/osm/raw: Prepared and per type (e.g. cable/lines) aggregated power data as .geojson and .csv files
country_list_to_geofk(country_list)
¶
Convert the requested country list into geofk norm.
Parameters¶
input : str Any two-letter country name or aggregation of countries given in the regions config file Country name duplications won't distort the result. Examples are: ["NG","ZA"], downloading osm data for Nigeria and South Africa ["SNGM"], downloading data for Senegal&Gambia shape ["NG","ZA","NG"], won't distort result.
Returns¶
full_codes_list : list Example ["NG","ZA"]
convert_iso_to_geofk(iso_code, iso_coding=True, convert_dict=read_osm_config('iso_to_geofk_dict'))
¶
Function to convert the iso code name of a country into the corresponding geofabrik In Geofabrik, some countries are aggregated, thus if a single country is requested, then all the agglomeration shall be downloaded For example, Senegal (SN) and Gambia (GM) cannot be found alone in geofabrik, but they can be downloaded as a whole SNGM.
The conversion directory, initialized to iso_to_geofk_dict is used to perform such conversion When a two-letter code country is found in convert_dict, and iso_coding is enabled, then that two-letter code is converted into the corresponding value of the dictionary
Parameters¶
iso_code : str Two-code country code to be converted iso_coding : bool When true, the iso to geofk is performed convert_dict : dict Dictionary used to apply the conversion iso to geofk The keys correspond to the countries iso codes that need a different region to be downloaded
clean_osm_data
¶
prepare_substation_df(df_all_buses)
¶
Prepare raw substations dataframe to the structure compatible with PyPSA- Eur.
Parameters¶
df_all_buses : dataframe Raw substations dataframe as downloaded from OpenStreetMap
set_unique_id(df, col)
¶
Create unique id's, where id is specified by the column "col" The steps below create unique bus id's without losing the original OSM bus_id.
Unique bus_id are created by simply adding -1,-2,-3 to the original bus_id Every unique id gets a -1 If a bus_id exist i.e. three times it it will the counted by cumcount -1,-2,-3 making the id unique
Parameters¶
df : dataframe Dataframe considered for the analysis col : str Column name for the analyses; examples: "bus_id" for substations or "line_id" for lines
split_cells(df, cols=['voltage'])
¶
Split semicolon separated cells i.e. [66000;220000] and create new identical rows.
Parameters¶
df : dataframe Dataframe under analysis cols : list List of target columns over which to perform the analysis
Example¶
Original data: row 1: '66000;220000', '50'
After applying split_cells(): row 1, '66000', '50' row 2, '220000', '50'
filter_voltage(df, threshold_voltage=35000)
¶
Filters df to contain only lines with voltage above threshold_voltage.
filter_frequency(df, accepted_values=[50, 60, 0], threshold=0.1)
¶
Filters df to contain only lines with frequency with accepted_values.
filter_circuits(df, min_value_circuit=0.1)
¶
Filters df to contain only lines with circuit value above min_value_circuit.
finalize_substation_types(df_all_buses)
¶
Specify bus_id and voltage columns as integer.
prepare_lines_df(df_lines)
¶
This function prepares the dataframe for lines and cables.
Parameters¶
df_lines : dataframe Raw lines or cables dataframe as downloaded from OpenStreetMap
finalize_lines_type(df_lines)
¶
This function is aimed at finalizing the type of the columns of the dataframe.
clean_frequency(df, default_frequency='50')
¶
Function to clean raw frequency column: manual fixing and fill nan values
clean_voltage(df)
¶
Function to clean the raw voltage column: manual fixing and drop nan values
clean_circuits(df)
¶
Function to clean the raw circuits column: manual fixing and clean nan values
clean_cables(df)
¶
Function to clean the raw cables column: manual fixing and drop undesired values
split_and_match_voltage_frequency_size(df)
¶
Function to match the length of the columns in subset by duplicating the last value in the column.
The function does as follows:
- First, it splits voltage and frequency columns by semicolon For example, the following lines row 1: '50', '220000 row 2: '50;50;50', '220000;380000'
become: row 1: ['50'], ['220000'] row 2: ['50','50','50'], ['220000','380000']
- Then, it harmonize each row to match the length of the lists by filling the missing values with the last elements of each list. In agreement to the example of before, after the cleaning:
row 1: ['50'], ['220000'] row 2: ['50','50','50'], ['220000','380000','380000']
fill_circuits(df)
¶
This function fills the rows circuits column so that the size of each list element matches the size of the list in the frequency column.
Multiple procedure are adopted:
- In the rows of circuits where the number of elements matches the number of the frequency column, nothing is done
- Where the number of elements in the cables column match the ones in the frequency column, then the values of cables are used.
- Where the number of elements in cables exceed those in frequency, the cables elements are downscaled and the last values of cables are summed. Let's assume that cables is [3,3,3] but frequency is [50,50]. With this procedure, cables is treated as [3,6] and used for calculating the circuits
- Where the number in cables has an unique number, e.g. ['6'], but frequency does not, e.g. ['50', '50'], then distribute the cables proportionally across the values. Note: the distribution accounts for the frequency type; when the frequency is 50 or 60, then a circuit requires 3 cables, when DC (0 frequency) is used, a circuit requires 2 cables.
- Where no information of cables or circuits is available, a circuit is assumed for every frequency entry.
explode_rows(df, cols)
¶
Function that explodes the rows as specified in cols, including warning alerts for unexpected values.
Example¶
row 1: [50,50], [33000, 110000]
after explode_rows applied on the two columns becomes row 1: 50, 33000 row 2: 50, 110000
integrate_lines_df(df_all_lines, distance_crs)
¶
Function to add underground, under_construction, frequency and circuits.
prepare_generators_df(df_all_generators)
¶
Prepare the dataframe for generators.
find_first_overlap(geom, country_geoms, default_name)
¶
Return the first index whose shape intersects the geometry.
set_countryname_by_shape(df, ext_country_shapes, exclude_external=True, col_country='country')
¶
Set the country name by the name shape
create_extended_country_shapes(country_shapes, offshore_shapes, tolerance=0.01)
¶
Obtain the extended country shape by merging on- and off-shore shapes.
set_name_by_closestcity(df_all_generators, colname='name')
¶
Function to set the name column equal to the name of the closest city.
load_network_data(network_asset, data_options)
¶
Function to check if OSM or custom data should be considered.
The network_asset should be a string named "lines", "cables" or "substations".
build_osm_network
¶
join_non_null_unique(values, sep='|')
¶
Join unique non-null values as strings.
OSM tag columns may contain None/NaN values depending on the region. Directly calling sep.join(values.unique()) can fail when non-string or missing values are present.
set_substations_ids(buses, distance_crs, tol=5000)
¶
Assigns station IDs to buses based on their proximity.
Parameters: - buses: GeoDataFrame object representing the buses data. - distance_crs: Coordinate reference system (CRS) to convert the geometry to. - tol: Tolerance distance in chosen CRS to define cluster proximity.
Returns: - None. Modifies the 'station_id' column in the 'buses' GeoDataFrame.
Example: set_substations_ids(buses_data, 'EPSG:3857', tol=5000)
set_lines_ids(lines, buses, distance_crs)
¶
Function to set line buses ids to the closest bus in the list.
merge_stations_same_station_id(buses, delta_lon=0.001, delta_lat=0.001, precision=4)
¶
Function to merge buses with same voltage and station_id This function iterates over all substation ids and creates a bus_id for every substation and voltage level.
Therefore, a substation with multiple voltage levels is represented with different buses, one per voltage level
get_ac_frequency(df, fr_col='tag_frequency')
¶
Function to define a default frequency value.¶
Attempts to find the most usual non-zero frequency across the dataframe; 50 Hz is assumed as a back-up value
get_transformers(buses, lines)
¶
Function to create fake transformer lines that connect buses of the same station_id at different voltage.
get_converters(buses, lines)
¶
Function to create fake converter lines that connect buses of the same station_id of different polarities.
connect_stations_same_station_id(lines, buses)
¶
Function to create fake links between substations with the same substation_id.
set_lv_substations(buses)
¶
Function to set what nodes are lv, thereby setting substation_lv The current methodology is to set lv nodes to buses where multiple voltage level are found, hence when the station_id is duplicated.
merge_stations_lines_by_station_id_and_voltage(lines, buses, geo_crs, distance_crs, tol=2000)
¶
Function to merge close stations and adapt the line datasets to adhere to the merged dataset.
fix_overpassing_lines(lines, buses, distance_crs, tol=1)
¶
Snap buses to lines that are within a certain tolerance. It does this by first buffering the buses by the tolerance distance, and then performing a spatial join to find all lines that intersect with the buffers. For each group of lines that intersect with a buffer, the function identifies the points that overpass the line (i.e., are not snapped to the line), and then snaps those points to the nearest point on the line. The line is then split at each snapped point, resulting in a new set of lines that are snapped to the buses. The function returns a GeoDataFrame containing the snapped lines, and the original GeoDataFrame containing the buses.
Parameters¶
lines : GeoDataFrame GeoDataFrame containing the lines buses : GeoDataFrame GeoDataFrame containing the buses distance_crs : str Coordinate reference system to use for distance calculations tol : float Tolerance in meters to snap the buses to the lines
Returns¶
lines : GeoDataFrame GeoDataFrame containing the lines
force_ac_lines(df, col='tag_frequency')
¶
Function that forces all PyPSA lines to be AC lines.
A network can contain AC and DC power lines that are modelled as PyPSA "Line" component. When DC lines are available, their power flow can be controlled by their converter. When it is artificially converted into AC, this feature is lost. However, for debugging and preliminary analysis, it can be useful to bypass problems.
add_buses_to_empty_countries(country_list, fp_country_shapes, buses)
¶
Function to add a bus for countries missing substation data.