Add `esmvalcore.dataset` module #1877

bouweandela · 2022-12-19T21:00:22Z

New features

Add the esmvalcore.dataset.Dataset class, which can be used to define datasets from Python. It can be used to find and load the data needed for using the esmvalcore.preprocessor functions.
Add the function esmvalcore.dataset.datasets_to_recipe, which can be used to create or update recipes with Datasets. This can be a convenient alternative to the recipe filler utility.
Adds a new option to config-user.yml called always_search_esgf. This will cause the tool to always search ESGF for files, even if some files are available locally. This feature can be used when you want to 1) create a recipe with all data available on ESGF or 2) download the latest versions of files from ESGF when there is already an older version available.
Move the files that do things related to loading a recipe (esmvalcore/_recipe.py, esmvalcore/_recipe_checks.py, and esmvalcore/recipe_schema.yml) into a new esmvalcore/_recipe directory. This makes it possible to add functionality without growing the file esmvalcore/_recipe.py even further. -> Should this be done in a separate pull request?

Example Jupyter notebooks that demonstrate how to use the new features are available in the notebooks directory and also shown on readthedocs in a new section Example notebooks.

This is part of #1609. The next (and hopefully final) step towards getting #1609 merged will be to use the new Dataset in esmvalcore/_recipe.py to read the datasets in the recipe. This pull request introduces some duplicate code because it already transfers some of the functionality from esmvalcore/_recipe.py to the esmvalcore.dataset.Dataset, but it cannot be removed yet because it is still used by the old dataset reading code.

Link to documentation: https://esmvaltool--1877.org.readthedocs.build/projects/ESMValCore/en/1877/api/esmvalcore.dataset.html

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

codecov · 2022-12-19T21:10:23Z

Codecov Report

Merging #1877 (96c2701) into main (fe2e8cf) will increase coverage by 0.35%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1877      +/-   ##
==========================================
+ Coverage   91.67%   92.02%   +0.35%     
==========================================
  Files         232      234       +2     
  Lines       11561    12030     +469     
==========================================
+ Hits        10598    11071     +473     
+ Misses        963      959       -4

Impacted Files	Coverage Δ
esmvalcore/config/_config_validators.py	`92.75% <ø> (ø)`
esmvalcore/_main.py	`91.06% <100.00%> (ø)`
esmvalcore/_recipe/check.py	`90.99% <100.00%> (ø)`
esmvalcore/_recipe/from_datasets.py	`100.00% <100.00%> (ø)`
esmvalcore/_recipe/recipe.py	`96.19% <100.00%> (ø)`
esmvalcore/cmor/table.py	`94.69% <100.00%> (+0.19%)`	⬆️
esmvalcore/dataset.py	`100.00% <100.00%> (ø)`
esmvalcore/experimental/recipe.py	`90.16% <100.00%> (ø)`
esmvalcore/preprocessor/__init__.py	`88.61% <100.00%> (ø)`
esmvalcore/preprocessor/_ancillary_vars.py	`98.73% <100.00%> (+0.14%)`	⬆️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

bouweandela · 2023-01-03T12:47:54Z

The remaining Codacy issues fall into three categories:

Prospector does not like the use of private methods because this could cause trouble if someone subclasses the Dataset class and implements a different version of these methods. I'm not sure how to fix this without making too much of the internals of the Dataset class public, so I would suggest we ignore this for now unless anyone has a better suggestion.
Codacy doesn't seem to respect from __future__ import annotations, so there are some complaints about wrong types. Running prospector locally these problems are not shown.
There was a bug in pydocstyle (D401 "not in imperative mood" not truly applicable to @property PyCQA/pydocstyle#531) so it complained about the docstrings of properties. It is fixed in the latest release, but Codacy is not using that yet.

schlunma

Thanks for this addition Bouwe!! 🚀 I had a look at the code that changes the current behavior of the tool. Looks really good, I only have very small comments.

I will now play around with the notebooks.

Note: I did NOT look at the code of the new features.

doc/quickstart/configure.rst

esmvalcore/config-user.yml

setup.cfg

environment.yml

Co-authored-by: Manuel Schlund <32543114+schlunma@users.noreply.github.com>

schlunma · 2023-01-10T13:18:54Z

I just tested the notebooks. This is really amazing stuff, Bouwe! Especially the usage of wildcards in combination with the Dataset -> recipe function is a game changer!! 🚀

A couple of comments:

I think it would be nice to add a comment to each of the notebooks how to load a user's custom configuration file, e.g., CFG.load_from_file('path/to/config.yml').
The link in the Discovering data notebook to configurable duration does not work.
Cell [7] in the Loading, processing, and visualizing data notebook does not download anything if you actually have ESGF data, it needs to be

files = list(tas.files)
for anc in tas.ancillaries:
    files.extend(anc.files)
files = [file for file in files if isinstance(file, ESGFFile)]
download(files, CFG['download_dir'])
tas.find_files()
print(tas.files)
for anc in tas.ancillaries:
    print(anc.files)

Adding ancillaries after calling augment_facets() needs to wrong facets for the ancillaries, e.g.,

tas = Dataset(
    short_name='tas',
    mip='Amon',
    project='CMIP5',
    dataset='GFDL-ESM2G',
    ensemble='r1i1p1',
    exp='historical',
    timerange='1990/2100',
)
tas.augment_facets()
tas.add_ancillary(short_name='areacella', mip='fx', ensemble='r0i0p0')
print(tas.ancillaries)

gives

[Dataset:
 {'dataset': 'GFDL-ESM2G',
  'project': 'CMIP5',
  'mip': 'fx',
  'short_name': 'areacella',
  'ensemble': 'r0i0p0',
  'exp': 'historical',
  'frequency': 'mon',
  'institute': ['NOAA-GFDL'],
  'long_name': 'Near-Surface Air Temperature',
  'modeling_realm': ['atmos'],
  'original_short_name': 'tas',
  'product': ['output1', 'output2'],
  'standard_name': 'air_temperature',
  'timerange': '1990/2100',
  'units': 'K'}
 session: 'session-d9f21999-2ea5-43b7-932d-29a9045fde5a_20230110_130835']

Notice the wrong long_name and units. This leads to problems when trying to download the data (i.e., it does not work).

I could not run all cells of the Loading, processing, and visualizing data notebook due to a concatenation error. For

tas = Dataset(
    short_name='tas',
    mip='Amon',
    project='CMIP5',
    dataset='CanESM2',
    ensemble='r1i1p1',
    exp=['historical', 'rcp85'],
    timerange='1850/2100',
)

the tool found fx files for the historical and rcp85 simulation. This leads to a concatenation error when calling cube = tas.load() since the tool tries to concatenate two fx files in time (which it shouldn't do):

ERROR:esmvalcore.preprocessor:Failed to run preprocessor function 'concatenate' on the data

I noticed that find_files ignores time ranges. I.e., if a time range with no data is selected, a load() gives an IndexError: list index out of range; if some years of the time range are not available the missing years are simply ignored (this should at least raise a warning in my opinion).

bouweandela · 2023-01-12T10:06:43Z

Thanks for reviewing @schlunma!

I think it would be nice to add a comment to each of the notebooks how to load a user's custom configuration file, e.g., CFG.load_from_file('path/to/config.yml').

I would prefer to put that in a separate notebook in another pull request. The content is already in the esmvalcore.config module documentation, but it would be nicer to show move that as an example notebook. Another nice addition would be a notebook that shows how to run a recipe with this content.

bouweandela · 2023-01-12T10:10:05Z

The link in the Discovering data notebook to configurable duration does not work.

It should be fixed now

bouweandela · 2023-01-12T10:44:21Z

Cell [7] in the Loading, processing, and visualizing data notebook does not download anything if you actually have ESGF data, ..

Thanks, I changed it

I could not run all cells of the Loading, processing, and visualizing data notebook due to a concatenation error.

Good find! I changed the experiment to just historical to keep the notebook simple.

bouweandela · 2023-01-12T10:55:37Z

Adding ancillaries after calling augment_facets() needs to wrong facets for the ancillaries

Indeed, it does not work if you do it in that order. I don't think it can be made to work reliably, so I added some documentation to the docstring and notebook to explain more clearly how the ancillaries are added. Do you think it is better now?

bouweandela · 2023-01-12T12:06:12Z

I noticed that find_files ignores time ranges. I.e., if a time range with no data is selected, a load() gives an IndexError: list index out of range; if some years of the time range are not available the missing years are simply ignored (this should at least raise a warning in my opinion).

I added a check if there are any input files so at least there is no weird IndexError. I tried to use esmvalcore._recipe.check.data_availability, but the error message came out not looking very applicable because it refers to a log file that isn't so easily accessible from the API. I think that gaps in the resulting time coordinate should be found by the CMOR check anyway.

schlunma · 2023-01-12T16:19:10Z

Thanks Bouwe for the changes, all the notebooks run fine now!

Thanks for fixing the notebook, but shouldn't this also be fixed in the code? I think this is clearly a bug. In the tool this is solved by using just the first fx file if the variable has frequency=fx. This might not be the perfect solution though..
Perfect, for missing files this raises a nice error now. You are probably right about data with gaps, this will raise an error in the frequency check. However, I was thinking about missing data at the start/end of the interval, e.g., using timerange: 2000/3000 for the historical experiment. In this case the tool finds data for the years 2000-2005 and happily returns just these five years without even issuing a warning. I find this highly problematic, e.g., if only a very small number of years is missing or a multi-model mean is calculated and not all input datasets span the same time period.

I am running some recipes at the moment, all looks fine so far 👍

bouweandela · 2023-01-13T08:25:42Z

Thanks for fixing the notebook, but shouldn't this also be fixed in the code? I think this is clearly a bug. In the tool this is solved by using just the first fx file if the variable has frequency=fx. This might not be the perfect solution though..

I think this should be left to the user of the API. For recipes, there isn't much flexibility for programming things, but for the API there is. You could manually choose which dataset you want to use as an ancillary dataset or if you have many you could compare the facets and choose the best match automatically. E.g. for use from the recipe in #1609 I implemented

ESMValCore/esmvalcore/_recipe/datasets.py

Lines 255 to 289 in 8cdd9cc

    
           def _clean_ancillaries(dataset: Dataset) -> None: 
        
               """Ignore duplicate and not expanded ancillary variables.""" 
        
               def match(ancillary_ds: Dataset) -> int: 
        
                   """Compute match of ancillary dataset with main dataset.""" 
        
                   score = 0 
        
                   for key, value in dataset.facets.items(): 
        
                       if key in ancillary_ds.facets: 
        
                           if ancillary_ds.facets[key] == value: 
        
                               score += 1 
        
                   return score 
        
               ancillaries = [] 
        
               for _, duplicates in groupby(dataset.ancillaries, 
        
                                            key=lambda ds: ds['short_name']): 
        
                   group = sorted(duplicates, key=match, reverse=True) 
        
                   ancillary_ds = group[0] 
        
                   if len(group) > 1: 
        
                       logger.warning( 
        
                           "For dataset %s: only using ancillary dataset %s, " 
        
                           "ignoring duplicate ancillary datasets\n%s", 
        
                           dataset.summary(shorten=True), 
        
                           ancillary_ds.summary(shorten=True), 
        
                           "\n".join(ds.summary(shorten=True) for ds in group[1:]), 
        
                       ) 
        
                   if any(_isglob(v) for v in ancillary_ds.facets.values()): 
        
                       logger.warning( 
        
                           "For dataset %s: ignoring ancillary dataset %s, " 
        
                           "unable to expand wildcards.", 
        
                           dataset.summary(shorten=True), 
        
                           ancillary_ds.summary(shorten=True), 
        
                       ) 
        
                   else: 
        
                       ancillaries.append(ancillary_ds) 
        
               dataset.ancillaries = ancillaries

.

bouweandela · 2023-01-13T08:36:31Z

Perfect, for missing files this raises a nice error now. You are probably right about data with gaps, this will raise an error in the frequency check. However, I was thinking about missing data at the start/end of the interval, e.g., using timerange: 2000/3000 for the historical experiment. In this case the tool finds data for the years 2000-2005 and happily returns just these five years without even issuing a warning. I find this highly problematic, e.g., if only a very small number of years is missing or a multi-model mean is calculated and not all input datasets span the same time period.

Would it be OK with you to postpone implementing such a feature to another pull request? I can see that it would be useful, but it's not trivial to implement. I could compare the bounds of the time coordinate (if there are any and fall back to points otherwise and do something clever to compensate for the fact that time points can be anywhere in the interval) with the requested period in the timerange facet, but I'm not sure how well this will work in practice.

schlunma · 2023-01-13T09:26:59Z

I think this should be left to the user of the API. For recipes, there isn't much flexibility for programming things, but for the API there is. You could manually choose which dataset you want to use as an ancillary dataset or if you have many you could compare the facets and choose the best match automatically. E.g. for use from the recipe in #1609 I implemented

Not too happy about this to be honest, but since this is such an edge case and it will be handled when used within ESMValTool I guess that's fine.

Would it be OK with you to postpone implementing such a feature to another pull request? I can see that it would be useful, but it's not trivial to implement. I could compare the bounds of the time coordinate (if there are any and fall back to points otherwise and do something clever to compensate for the fact that time points can be anywhere in the interval) with the requested period in the timerange facet, but I'm not sure how well this will work in practice.

Couldn't we use the same code we use in the tool for this? If the time range of the files doesn't match the requested time range an error like the following is raised:

2023-01-13 08:59:17,614 UTC [3178263] ERROR   Missing data for preprocessor test/tas:
- No input data available for years 2101-2150 in files:
/work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ/NOAA-GFDL/GFDL-ESM2G/rcp85/mon/atmos/Amon/r1i1p1/v20120412/tas/tas_Amon_GFDL-ESM2G_rcp85_r1i1p1_208601-209012.nc                                                                                                                    
/work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ/NOAA-GFDL/GFDL-ESM2G/rcp85/mon/atmos/Amon/r1i1p1/v20120412/tas/tas_Amon_GFDL-ESM2G_rcp85_r1i1p1_209101-209512.nc                                                                                                                    
/work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ/NOAA-GFDL/GFDL-ESM2G/rcp85/mon/atmos/Amon/r1i1p1/v20120412/tas/tas_Amon_GFDL-ESM2G_rcp85_r1i1p1_209601-210012.nc

Again, not too happy about this, but if it's too much work and you make sure that this is handled when this is used within the tool I am also fine with it.

All the recipes I tested ran fine with this change, so I think this is good to go 👍 Thanks again for this great feature, @bouweandela! 🎉

bouweandela · 2023-01-13T09:56:52Z

Awesome, thanks a lot for reviewing @schlunma!

Couldn't we use the same code we use in the tool for this?

There are two main issues with the current code used for checking when running the recipe:

It only works if there is a recognizable time range in the filename or if the file has already been downloaded it loads the data and checks the time coordinate so then it also works

It only checks for entire years (not months, days, or hours):

ESMValCore/esmvalcore/_recipe_checks.py

Lines 149 to 164 in 155335a

    
           start_year = var['start_year'] 
        
           end_year = var['end_year'] 
        
           required_years = set(range(start_year, end_year + 1, 1)) 
        
           available_years = set() 
        
           for filename in input_files: 
        
               start, end = _get_start_end_year(filename) 
        
               available_years.update(range(start, end + 1)) 
        
           missing_years = required_years - available_years 
        
           if missing_years: 
        
               missing_txt = _group_years(missing_years) 
        
               raise InputFilesNotFound( 
        
                   "No input data available for years {} in files:\n{}".format( 
        
                       missing_txt, "\n".join(str(f) for f in input_files)))

so I feel that adding this will give users a false sense of security because it only works sometimes.

make sure that this is handled when this is used within the tool I am also fine with it.

Thanks, I will make sure the behavior of the tool when running a recipe does not change in #1609 and I created an issue to describe the feature here #1892.

schlunma · 2023-01-13T10:09:35Z

There are two main issues with the current code used for checking when running the recipe:
1. It only works if there is a recognizable time range in the filename

2. It only checks for entire years (not months, days, or hours)
so I feel that adding this will give users a false sense of security because it only works sometimes.

But since we use the clip_timerange preprocessor while loading the files I guess we should be fine? This selection would only be a course first step, the actual selection is done in clip_timerange.

valeriupredoi

I went through the code too, and I think this is super solid! But I have a few minor suggestions; am approving this since it's up to @bouweandela to implement my suggestions, nothing make or break. Note that I've not looked at the iPython notebooks 🍺

esmvalcore/_recipe/from_datasets.py

esmvalcore/_recipe/recipe.py

esmvalcore/cmor/table.py

setup.cfg

tests/integration/recipe/test_check.py

valeriupredoi · 2023-01-13T16:18:34Z

OK I reckon this can be merged now, cheers muchly for the great work @bouweandela and @schlunma - this is awesome!

…aset

Add esmvalcore.dataset.Dataset

d183a93

bouweandela added the api Notebook API label Dec 19, 2022

Fix flake8 issues

d9897f7

bouweandela added 12 commits December 19, 2022 22:33

Improve docs

088e028

Clean up from_files

4ba6190

Small improvements

35e5f4a

Add datasets to recipe function

e66b98d

Add missing __init__.py file

86f12f0

Add tests

01c68d2

Add more tests

ea7976c

Fix pylint/prospector issues

257e8a8

Add notebooks

ce3d500

Add notebooks to documentation

38ac7ad

Add ipython to build notebooks as documentation

0519561

Add always_search_esgf option

ac05d8c

bouweandela marked this pull request as ready for review January 3, 2023 12:48

bouweandela added 2 commits January 3, 2023 13:57

Fix issue with downloading files

0828714

.. and the corresponding test

d91371c

schlunma reviewed Jan 10, 2023

View reviewed changes

doc/quickstart/configure.rst Outdated Show resolved Hide resolved

esmvalcore/config-user.yml Outdated Show resolved Hide resolved

setup.cfg Show resolved Hide resolved

schlunma reviewed Jan 10, 2023

View reviewed changes

environment.yml Show resolved Hide resolved

bouweandela and others added 2 commits January 10, 2023 12:20

Add ipython to setup.py

208c9c0

Improve description of always_search_esgf option

427de24

Co-authored-by: Manuel Schlund <32543114+schlunma@users.noreply.github.com>

Fix broken link

5911196

Simplify notebook and correct bug in selecting files for download

0fa9b4f

Add description of how add_ancillaries works

5d033dc

Add error message when no input files were found

0971c30

Add unit test

02c4e8e

bouweandela added this to the v2.8.0 milestone Jan 12, 2023

schlunma approved these changes Jan 13, 2023

View reviewed changes

bouweandela mentioned this pull request Jan 13, 2023

Add a function to Dataset for checking that the data for the full requested timerange is available #1892

Closed

valeriupredoi approved these changes Jan 13, 2023

View reviewed changes

valeriupredoi added the enhancement New feature or request label Jan 13, 2023

bouweandela added 2 commits January 13, 2023 16:40

Add link to examples

6401ef8

Improve documentation of datasets_to_recipe

ecb703c

Merge branch 'main' of github.com:ESMValGroup/ESMValCore into add-dat…

96c2701

…aset

valeriupredoi merged commit de1a27d into main Jan 13, 2023

valeriupredoi deleted the add-dataset branch January 13, 2023 16:44

bouweandela mentioned this pull request Jan 19, 2023

Support wildcards in the recipe and improve support for ancillary variables and dataset versioning #1609

Merged

9 tasks

remi-kazeroni mentioned this pull request Feb 14, 2023

Reading CMIP3 datasets with no dates in the filenames #1929

Closed

Add esmvalcore.dataset module #1877

Add esmvalcore.dataset module #1877

Uh oh!

Conversation

bouweandela commented Dec 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New features

Before you get started

Checklist

Uh oh!

codecov bot commented Dec 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bouweandela commented Jan 3, 2023

Uh oh!

schlunma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

schlunma commented Jan 10, 2023

Uh oh!

bouweandela commented Jan 12, 2023

Uh oh!

bouweandela commented Jan 12, 2023

Uh oh!

bouweandela commented Jan 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bouweandela commented Jan 12, 2023

Uh oh!

bouweandela commented Jan 12, 2023

Uh oh!

schlunma commented Jan 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bouweandela commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bouweandela commented Jan 13, 2023

Uh oh!

schlunma commented Jan 13, 2023

Uh oh!

bouweandela commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schlunma commented Jan 13, 2023

Uh oh!

valeriupredoi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valeriupredoi commented Jan 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add `esmvalcore.dataset` module #1877

Add `esmvalcore.dataset` module #1877

bouweandela commented Dec 19, 2022 •

edited

Loading

codecov bot commented Dec 19, 2022 •

edited

Loading

bouweandela commented Jan 12, 2023 •

edited

Loading

schlunma commented Jan 12, 2023 •

edited

Loading

bouweandela commented Jan 13, 2023 •

edited

Loading

bouweandela commented Jan 13, 2023 •

edited

Loading