Skip to content

Conversation

@bouweandela
Copy link
Member

@bouweandela bouweandela commented Dec 19, 2022

New features

  • Add the esmvalcore.dataset.Dataset class, which can be used to define datasets from Python. It can be used to find and load the data needed for using the esmvalcore.preprocessor functions.

  • Add the function esmvalcore.dataset.datasets_to_recipe, which can be used to create or update recipes with Datasets. This can be a convenient alternative to the recipe filler utility.

  • Adds a new option to config-user.yml called always_search_esgf. This will cause the tool to always search ESGF for files, even if some files are available locally. This feature can be used when you want to 1) create a recipe with all data available on ESGF or 2) download the latest versions of files from ESGF when there is already an older version available.

  • Move the files that do things related to loading a recipe (esmvalcore/_recipe.py, esmvalcore/_recipe_checks.py, and esmvalcore/recipe_schema.yml) into a new esmvalcore/_recipe directory. This makes it possible to add functionality without growing the file esmvalcore/_recipe.py even further. -> Should this be done in a separate pull request?

Example Jupyter notebooks that demonstrate how to use the new features are available in the notebooks directory and also shown on readthedocs in a new section Example notebooks.

This is part of #1609. The next (and hopefully final) step towards getting #1609 merged will be to use the new Dataset in esmvalcore/_recipe.py to read the datasets in the recipe. This pull request introduces some duplicate code because it already transfers some of the functionality from esmvalcore/_recipe.py to the esmvalcore.dataset.Dataset, but it cannot be removed yet because it is still used by the old dataset reading code.

Link to documentation: https://esmvaltool--1877.org.readthedocs.build/projects/ESMValCore/en/1877/api/esmvalcore.dataset.html


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@bouweandela bouweandela added the api Notebook API label Dec 19, 2022
@codecov
Copy link

codecov bot commented Dec 19, 2022

Codecov Report

Merging #1877 (96c2701) into main (fe2e8cf) will increase coverage by 0.35%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1877      +/-   ##
==========================================
+ Coverage   91.67%   92.02%   +0.35%     
==========================================
  Files         232      234       +2     
  Lines       11561    12030     +469     
==========================================
+ Hits        10598    11071     +473     
+ Misses        963      959       -4     
Impacted Files Coverage Δ
esmvalcore/config/_config_validators.py 92.75% <ø> (ø)
esmvalcore/_main.py 91.06% <100.00%> (ø)
esmvalcore/_recipe/check.py 90.99% <100.00%> (ø)
esmvalcore/_recipe/from_datasets.py 100.00% <100.00%> (ø)
esmvalcore/_recipe/recipe.py 96.19% <100.00%> (ø)
esmvalcore/cmor/table.py 94.69% <100.00%> (+0.19%) ⬆️
esmvalcore/dataset.py 100.00% <100.00%> (ø)
esmvalcore/experimental/recipe.py 90.16% <100.00%> (ø)
esmvalcore/preprocessor/__init__.py 88.61% <100.00%> (ø)
esmvalcore/preprocessor/_ancillary_vars.py 98.73% <100.00%> (+0.14%) ⬆️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@bouweandela
Copy link
Member Author

The remaining Codacy issues fall into three categories:

  • Prospector does not like the use of private methods because this could cause trouble if someone subclasses the Dataset class and implements a different version of these methods. I'm not sure how to fix this without making too much of the internals of the Dataset class public, so I would suggest we ignore this for now unless anyone has a better suggestion.
  • Codacy doesn't seem to respect from __future__ import annotations, so there are some complaints about wrong types. Running prospector locally these problems are not shown.
  • There was a bug in pydocstyle (D401 "not in imperative mood" not truly applicable to @property PyCQA/pydocstyle#531) so it complained about the docstrings of properties. It is fixed in the latest release, but Codacy is not using that yet.

@bouweandela bouweandela marked this pull request as ready for review January 3, 2023 12:48
Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this addition Bouwe!! 🚀 I had a look at the code that changes the current behavior of the tool. Looks really good, I only have very small comments.

I will now play around with the notebooks.

Note: I did NOT look at the code of the new features.

bouweandela and others added 2 commits January 10, 2023 12:20
Co-authored-by: Manuel Schlund <32543114+schlunma@users.noreply.github.com>
@schlunma
Copy link
Contributor

I just tested the notebooks. This is really amazing stuff, Bouwe! Especially the usage of wildcards in combination with the Dataset -> recipe function is a game changer!! 🚀

A couple of comments:

  1. I think it would be nice to add a comment to each of the notebooks how to load a user's custom configuration file, e.g., CFG.load_from_file('path/to/config.yml').
  2. The link in the Discovering data notebook to configurable duration does not work.
  3. Cell [7] in the Loading, processing, and visualizing data notebook does not download anything if you actually have ESGF data, it needs to be
files = list(tas.files)
for anc in tas.ancillaries:
    files.extend(anc.files)
files = [file for file in files if isinstance(file, ESGFFile)]
download(files, CFG['download_dir'])
tas.find_files()
print(tas.files)
for anc in tas.ancillaries:
    print(anc.files)
  1. Adding ancillaries after calling augment_facets() needs to wrong facets for the ancillaries, e.g.,
tas = Dataset(
    short_name='tas',
    mip='Amon',
    project='CMIP5',
    dataset='GFDL-ESM2G',
    ensemble='r1i1p1',
    exp='historical',
    timerange='1990/2100',
)
tas.augment_facets()
tas.add_ancillary(short_name='areacella', mip='fx', ensemble='r0i0p0')
print(tas.ancillaries)

gives

[Dataset:
 {'dataset': 'GFDL-ESM2G',
  'project': 'CMIP5',
  'mip': 'fx',
  'short_name': 'areacella',
  'ensemble': 'r0i0p0',
  'exp': 'historical',
  'frequency': 'mon',
  'institute': ['NOAA-GFDL'],
  'long_name': 'Near-Surface Air Temperature',
  'modeling_realm': ['atmos'],
  'original_short_name': 'tas',
  'product': ['output1', 'output2'],
  'standard_name': 'air_temperature',
  'timerange': '1990/2100',
  'units': 'K'}
 session: 'session-d9f21999-2ea5-43b7-932d-29a9045fde5a_20230110_130835']

Notice the wrong long_name and units. This leads to problems when trying to download the data (i.e., it does not work).

  1. I could not run all cells of the Loading, processing, and visualizing data notebook due to a concatenation error. For
tas = Dataset(
    short_name='tas',
    mip='Amon',
    project='CMIP5',
    dataset='CanESM2',
    ensemble='r1i1p1',
    exp=['historical', 'rcp85'],
    timerange='1850/2100',
)

the tool found fx files for the historical and rcp85 simulation. This leads to a concatenation error when calling cube = tas.load() since the tool tries to concatenate two fx files in time (which it shouldn't do):

ERROR:esmvalcore.preprocessor:Failed to run preprocessor function 'concatenate' on the data
  1. I noticed that find_files ignores time ranges. I.e., if a time range with no data is selected, a load() gives an IndexError: list index out of range; if some years of the time range are not available the missing years are simply ignored (this should at least raise a warning in my opinion).

@bouweandela
Copy link
Member Author

Thanks for reviewing @schlunma!

  1. I think it would be nice to add a comment to each of the notebooks how to load a user's custom configuration file, e.g., CFG.load_from_file('path/to/config.yml').

I would prefer to put that in a separate notebook in another pull request. The content is already in the esmvalcore.config module documentation, but it would be nicer to show move that as an example notebook. Another nice addition would be a notebook that shows how to run a recipe with this content.

@bouweandela
Copy link
Member Author

  1. The link in the Discovering data notebook to configurable duration does not work.

It should be fixed now

@bouweandela
Copy link
Member Author

bouweandela commented Jan 12, 2023

  1. Cell [7] in the Loading, processing, and visualizing data notebook does not download anything if you actually have ESGF data, ..

Thanks, I changed it

  1. I could not run all cells of the Loading, processing, and visualizing data notebook due to a concatenation error.

Good find! I changed the experiment to just historical to keep the notebook simple.

@bouweandela
Copy link
Member Author

  1. Adding ancillaries after calling augment_facets() needs to wrong facets for the ancillaries

Indeed, it does not work if you do it in that order. I don't think it can be made to work reliably, so I added some documentation to the docstring and notebook to explain more clearly how the ancillaries are added. Do you think it is better now?

@bouweandela
Copy link
Member Author

  1. I noticed that find_files ignores time ranges. I.e., if a time range with no data is selected, a load() gives an IndexError: list index out of range; if some years of the time range are not available the missing years are simply ignored (this should at least raise a warning in my opinion).

I added a check if there are any input files so at least there is no weird IndexError. I tried to use esmvalcore._recipe.check.data_availability, but the error message came out not looking very applicable because it refers to a log file that isn't so easily accessible from the API. I think that gaps in the resulting time coordinate should be found by the CMOR check anyway.

@bouweandela bouweandela added this to the v2.8.0 milestone Jan 12, 2023
@schlunma
Copy link
Contributor

schlunma commented Jan 12, 2023

Thanks Bouwe for the changes, all the notebooks run fine now!

  1. Thanks for fixing the notebook, but shouldn't this also be fixed in the code? I think this is clearly a bug. In the tool this is solved by using just the first fx file if the variable has frequency=fx. This might not be the perfect solution though..
  2. Perfect, for missing files this raises a nice error now. You are probably right about data with gaps, this will raise an error in the frequency check. However, I was thinking about missing data at the start/end of the interval, e.g., using timerange: 2000/3000 for the historical experiment. In this case the tool finds data for the years 2000-2005 and happily returns just these five years without even issuing a warning. I find this highly problematic, e.g., if only a very small number of years is missing or a multi-model mean is calculated and not all input datasets span the same time period.

I am running some recipes at the moment, all looks fine so far 👍

@bouweandela
Copy link
Member Author

bouweandela commented Jan 13, 2023

  1. Thanks for fixing the notebook, but shouldn't this also be fixed in the code? I think this is clearly a bug. In the tool this is solved by using just the first fx file if the variable has frequency=fx. This might not be the perfect solution though..

I think this should be left to the user of the API. For recipes, there isn't much flexibility for programming things, but for the API there is. You could manually choose which dataset you want to use as an ancillary dataset or if you have many you could compare the facets and choose the best match automatically. E.g. for use from the recipe in #1609 I implemented

def _clean_ancillaries(dataset: Dataset) -> None:
"""Ignore duplicate and not expanded ancillary variables."""
def match(ancillary_ds: Dataset) -> int:
"""Compute match of ancillary dataset with main dataset."""
score = 0
for key, value in dataset.facets.items():
if key in ancillary_ds.facets:
if ancillary_ds.facets[key] == value:
score += 1
return score
ancillaries = []
for _, duplicates in groupby(dataset.ancillaries,
key=lambda ds: ds['short_name']):
group = sorted(duplicates, key=match, reverse=True)
ancillary_ds = group[0]
if len(group) > 1:
logger.warning(
"For dataset %s: only using ancillary dataset %s, "
"ignoring duplicate ancillary datasets\n%s",
dataset.summary(shorten=True),
ancillary_ds.summary(shorten=True),
"\n".join(ds.summary(shorten=True) for ds in group[1:]),
)
if any(_isglob(v) for v in ancillary_ds.facets.values()):
logger.warning(
"For dataset %s: ignoring ancillary dataset %s, "
"unable to expand wildcards.",
dataset.summary(shorten=True),
ancillary_ds.summary(shorten=True),
)
else:
ancillaries.append(ancillary_ds)
dataset.ancillaries = ancillaries
.

@bouweandela
Copy link
Member Author

  1. Perfect, for missing files this raises a nice error now. You are probably right about data with gaps, this will raise an error in the frequency check. However, I was thinking about missing data at the start/end of the interval, e.g., using timerange: 2000/3000 for the historical experiment. In this case the tool finds data for the years 2000-2005 and happily returns just these five years without even issuing a warning. I find this highly problematic, e.g., if only a very small number of years is missing or a multi-model mean is calculated and not all input datasets span the same time period.

Would it be OK with you to postpone implementing such a feature to another pull request? I can see that it would be useful, but it's not trivial to implement. I could compare the bounds of the time coordinate (if there are any and fall back to points otherwise and do something clever to compensate for the fact that time points can be anywhere in the interval) with the requested period in the timerange facet, but I'm not sure how well this will work in practice.

@schlunma
Copy link
Contributor

I think this should be left to the user of the API. For recipes, there isn't much flexibility for programming things, but for the API there is. You could manually choose which dataset you want to use as an ancillary dataset or if you have many you could compare the facets and choose the best match automatically. E.g. for use from the recipe in #1609 I implemented

Not too happy about this to be honest, but since this is such an edge case and it will be handled when used within ESMValTool I guess that's fine.

Would it be OK with you to postpone implementing such a feature to another pull request? I can see that it would be useful, but it's not trivial to implement. I could compare the bounds of the time coordinate (if there are any and fall back to points otherwise and do something clever to compensate for the fact that time points can be anywhere in the interval) with the requested period in the timerange facet, but I'm not sure how well this will work in practice.

Couldn't we use the same code we use in the tool for this? If the time range of the files doesn't match the requested time range an error like the following is raised:

2023-01-13 08:59:17,614 UTC [3178263] ERROR   Missing data for preprocessor test/tas:
- No input data available for years 2101-2150 in files:
/work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ/NOAA-GFDL/GFDL-ESM2G/rcp85/mon/atmos/Amon/r1i1p1/v20120412/tas/tas_Amon_GFDL-ESM2G_rcp85_r1i1p1_208601-209012.nc                                                                                                                    
/work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ/NOAA-GFDL/GFDL-ESM2G/rcp85/mon/atmos/Amon/r1i1p1/v20120412/tas/tas_Amon_GFDL-ESM2G_rcp85_r1i1p1_209101-209512.nc                                                                                                                    
/work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ/NOAA-GFDL/GFDL-ESM2G/rcp85/mon/atmos/Amon/r1i1p1/v20120412/tas/tas_Amon_GFDL-ESM2G_rcp85_r1i1p1_209601-210012.nc

Again, not too happy about this, but if it's too much work and you make sure that this is handled when this is used within the tool I am also fine with it.

All the recipes I tested ran fine with this change, so I think this is good to go 👍 Thanks again for this great feature, @bouweandela! 🎉

@bouweandela
Copy link
Member Author

bouweandela commented Jan 13, 2023

Awesome, thanks a lot for reviewing @schlunma!

Couldn't we use the same code we use in the tool for this?

There are two main issues with the current code used for checking when running the recipe:

  1. It only works if there is a recognizable time range in the filename or if the file has already been downloaded it loads the data and checks the time coordinate so then it also works
  2. It only checks for entire years (not months, days, or hours):
    start_year = var['start_year']
    end_year = var['end_year']
    required_years = set(range(start_year, end_year + 1, 1))
    available_years = set()
    for filename in input_files:
    start, end = _get_start_end_year(filename)
    available_years.update(range(start, end + 1))
    missing_years = required_years - available_years
    if missing_years:
    missing_txt = _group_years(missing_years)
    raise InputFilesNotFound(
    "No input data available for years {} in files:\n{}".format(
    missing_txt, "\n".join(str(f) for f in input_files)))

so I feel that adding this will give users a false sense of security because it only works sometimes.

make sure that this is handled when this is used within the tool I am also fine with it.

Thanks, I will make sure the behavior of the tool when running a recipe does not change in #1609 and I created an issue to describe the feature here #1892.

@schlunma
Copy link
Contributor

There are two main issues with the current code used for checking when running the recipe:

1. It only works if there is a recognizable time range in the filename

2. It only checks for entire years (not months, days, or hours)

so I feel that adding this will give users a false sense of security because it only works sometimes.

But since we use the clip_timerange preprocessor while loading the files I guess we should be fine? This selection would only be a course first step, the actual selection is done in clip_timerange.

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code too, and I think this is super solid! But I have a few minor suggestions; am approving this since it's up to @bouweandela to implement my suggestions, nothing make or break. Note that I've not looked at the iPython notebooks 🍺

@valeriupredoi valeriupredoi added the enhancement New feature or request label Jan 13, 2023
@valeriupredoi
Copy link
Contributor

OK I reckon this can be merged now, cheers muchly for the great work @bouweandela and @schlunma - this is awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Notebook API enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants