Skip to content

Add the GMTSampleData class to simplify the load_sample_data and list_sample_data functions (#2342)#2342

Merged
seisman merged 20 commits intomainfrom
simplify-load-samples
Feb 14, 2023
Merged

Add the GMTSampleData class to simplify the load_sample_data and list_sample_data functions (#2342)#2342
seisman merged 20 commits intomainfrom
simplify-load-samples

Conversation

@seisman
Copy link
Member

@seisman seisman commented Feb 1, 2023

Description of proposed changes

It's easier to review this PR by looking at the changes in each commit:

  1. 22a8e48: Move load_sample_data and list_sample_data to the end. Required by the changes in commit 4641649
  2. 4641649: Add a GMTSampleData class, similar to the GMTRemoteData used in load_remote_dataset.py. This class contains two attributes, the function to load the dataset and the description of the dataset. With the GMTSampleData class, the load_sample_data and list_sample_data functions are greatly simplified and we no longer need to maintain two dictionaries which have the same keys.
  3. ea9132c: Add a fmt_dataset_list decorator to automatically inserts the list of available datasets to the docstrings of the load_sample_data function, to address Improve the documentation of load_sample_data and list_sample_data #1774.

Preview: https://pygmt-dev--2342.org.readthedocs.build/en/2342/api/generated/pygmt.datasets.load_sample_data.html

image

Fixes #1774 and supersedes #1814.

Reminders

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst.
  • Write detailed docstrings for all functions/methods.
  • If wrapping a new module, open a 'Wrap new GMT module' issue and submit reasonably-sized PRs.
  • If adding new functionality, add an example to docstrings or tutorials.
  • Use underscores (not hyphens) in names of Python files and directories.

Slash Commands

You can write slash commands (/command) in the first line of a comment to perform
specific operations. Supported slash commands are:

  • /format: automatically format and lint the code
  • /test-gmt-dev: run full tests on the latest GMT development version

@seisman seisman added the maintenance Boring but important stuff for the core devs label Feb 1, 2023
@seisman seisman added this to the 0.9.0 milestone Feb 1, 2023
Comment on lines +302 to +308
texts = "\n ".join(
f'- ``"{name}"``: {dataset.description}.' for name, dataset in datasets.items()
)

docstrings = module_func.__doc__.format(dataset_list=texts)
module_func.__doc__ = textwrap.dedent(docstrings)
return module_func
Copy link
Member Author

@seisman seisman Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This decorator inserts the list of available datasets to the load_sample_data function by substituting the {dataset_list} placeholder with the following text strings:

- ``"bathymetry"``: Table of ship bathymetric observations off Baja California
- ``"earth_relief_holes"``: Regional 20 arc-minutes Earth relief grid with holes
- ...

However, because the placeholder {dataset_list} is indented by 8 whitespaces (see Line 341), substituting the placeholder with the text strings would give:

        - ``"bathymetry"``: Table of ship bathymetric observations off Baja California
- ``"earth_relief_holes"``: Regional 20 arc-minutes Earth relief grid with holes
- ...

So at line 302, I have to use "\n " (newline with 8 whitespaces) to join the list, which is not ideal but I don't have a better solution for this.

@seisman seisman requested a review from maxrjones February 1, 2023 01:00
@seisman seisman changed the title Simplify load_sample_data and list_sample_data functions Add the GMTSampleData class to simplify the load_sample_data and list_sample_data functions Feb 1, 2023
@seisman seisman added the needs review This PR has higher priority and needs review. label Feb 1, 2023
--------
load_sample_data : Load an example dataset from the GMT server.
"""
return {name: dataset.description for name, dataset in datasets.items()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a test_list_sample_data that has a dictionary of all of the functions/descriptions to test line 324?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting a test to check if the returned value is equal to:

{'bathymetry': 'Table of ship bathymetric observations off Baja California',
 'earth_relief_holes': 'Regional 20 arc-minutes Earth relief grid with holes',
 'fractures': 'Table of hypothetical fracture lengths and azimuths',
 'hotspots': 'Table of locations, names, and symbol sizes of hotpots from  Mueller et al., 1993',
 'japan_quakes': 'Table of earthquakes around Japan from NOAA NGDC database',
 'mars_shape': 'Table of topographic signature of the hemispheric dichotomy of  Mars from Smith and Zuber (1996)',
 'maunaloa_co2': 'Table of CO2 readings from Mauna Loa',
 'notre_dame_topography': 'Table 5.11 in Davis: Statistics and Data Analysis in Geology',
 'ocean_ridge_points': 'Table of ocean ridge points for the entire world',
 'rock_compositions': 'Table of rock sample compositions',
 'usgs_quakes': 'Table of global earthquakes from the USGS'}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 518900d, I added a test to check if the returned value is in dict type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine; I was just going off of the codecov alert saying that like 324 was untested. My thought had been to use something like a list of the dataset names being equal to dict.keys() to avoid a very lengthy comparison, but I think your way works better.

Co-authored-by: Will Schlitzer <schlitzer90@gmail.com>
@willschlitzer willschlitzer added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. labels Feb 2, 2023
@seisman
Copy link
Member Author

seisman commented Feb 3, 2023

@GenericMappingTools/pygmt-maintainers It would be good if we can have one more reviewer since this PR brings some big changes.

Copy link
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, it'll be a big improvement!

My preference would be to document of the list of datasets manually as the output of a docstring example for list_sample_data (in which case the doctest should catch whether it's out of date). I would prefer avoiding the decorator/docstring inject trick when possible because it makes the docstrings less interpretable for IDEs like VSCode.

Co-authored-by: Yvonne Fröhlich <94163266+yvonnefroehlich@users.noreply.github.com>
@seisman
Copy link
Member Author

seisman commented Feb 6, 2023

I would prefer avoiding the decorator/docstring inject trick when possible because it makes the docstrings less interpretable for IDEs like VSCode.

That's a good point.

My preference would be to document of the list of datasets manually as the output of a docstring example for list_sample_data (in which case the doctest should catch whether it's out of date).

I'm OK with adding the the output of list_sample_data as a docstring example, but instead of adding it to the list_sample_data function, I feel it's more useful to add it to the load_sample_data function, i.e., in the load_sample_data function:

>>> # use list_sample_data to see the available datasets
>>> list_sample_data()
{'bathymetry': 'Table of ship bathymetric observations off Baja California',
 'earth_relief_holes': 'Regional 20 arc-minutes Earth relief grid with holes',
 'fractures': 'Table of hypothetical fracture lengths and azimuths',
 'hotspots': 'Table of locations, names, and symbol sizes of hotpots from  Mueller et al., 1993',
 'japan_quakes': 'Table of earthquakes around Japan from NOAA NGDC database',
 'mars_shape': 'Table of topographic signature of the hemispheric dichotomy of  Mars from Smith and Zuber (1996)',
 'maunaloa_co2': 'Table of CO2 readings from Mauna Loa',
 'notre_dame_topography': 'Table 5.11 in Davis: Statistics and Data Analysis in Geology',
 'ocean_ridge_points': 'Table of ocean ridge points for the entire world',
 'rock_compositions': 'Table of rock sample compositions',
 'usgs_quakes': 'Table of global earthquakes from the USGS'}
>>> # load the sample bathymetry dataset
>>> data = load_sample_data("bathymetry")

@maxrjones
Copy link
Member

I would prefer avoiding the decorator/docstring inject trick when possible because it makes the docstrings less interpretable for IDEs like VSCode.

That's a good point.

My preference would be to document of the list of datasets manually as the output of a docstring example for list_sample_data (in which case the doctest should catch whether it's out of date).

I'm OK with adding the the output of list_sample_data as a docstring example, but instead of adding it to the list_sample_data function, I feel it's more useful to add it to the load_sample_data function, i.e., in the load_sample_data function:

>>> # use list_sample_data to see the available datasets
>>> list_sample_data()
{'bathymetry': 'Table of ship bathymetric observations off Baja California',
 'earth_relief_holes': 'Regional 20 arc-minutes Earth relief grid with holes',
 'fractures': 'Table of hypothetical fracture lengths and azimuths',
 'hotspots': 'Table of locations, names, and symbol sizes of hotpots from  Mueller et al., 1993',
 'japan_quakes': 'Table of earthquakes around Japan from NOAA NGDC database',
 'mars_shape': 'Table of topographic signature of the hemispheric dichotomy of  Mars from Smith and Zuber (1996)',
 'maunaloa_co2': 'Table of CO2 readings from Mauna Loa',
 'notre_dame_topography': 'Table 5.11 in Davis: Statistics and Data Analysis in Geology',
 'ocean_ridge_points': 'Table of ocean ridge points for the entire world',
 'rock_compositions': 'Table of rock sample compositions',
 'usgs_quakes': 'Table of global earthquakes from the USGS'}
>>> # load the sample bathymetry dataset
>>> data = load_sample_data("bathymetry")

sounds good to me

@seisman seisman marked this pull request as draft February 11, 2023 03:04
@seisman seisman removed the final review call This PR requires final review and approval from a second reviewer label Feb 11, 2023
@seisman
Copy link
Member Author

seisman commented Feb 11, 2023

I tried to follow @maxrjones's suggestion and added a doctest for list_sample_data(), but things is becoming more complicated.

The default returned value of list_sample_data is a unformatted big dictionary, which is not readable and needs to be in a single line:

{'bathymetry': 'Table of ship bathymetric observations off Baja California', 'earth_relief_holes': 'Regional 20 arc-minutes Earth relief grid with holes', 'fractures': 'Table of hypothetical fracture lengths and azimuths', 'hotspots': 'Table of locations, names, and symbol sizes of hotpots from  Mueller et al., 1993', 'japan_quakes': 'Table of earthquakes around Japan from NOAA NGDC database', 'mars_shape': 'Table of topographic signature of the hemispheric dichotomy of  Mars from Smith and Zuber (1996)', 'maunaloa_co2': 'Table of CO2 readings from Mauna Loa', 'notre_dame_topography': 'Table 5.11 in Davis: Statistics and Data Analysis in Geology', 'ocean_ridge_points': 'Table of ocean ridge points for the entire world', 'rock_compositions': 'Table of rock sample compositions', 'usgs_quakes': 'Table of global earthquakes from the USGS'}

Then I try to use pprint(list_sample_data()) instead. It returns:

pprint(list_sample_data())
{'bathymetry': 'Table of ship bathymetric observations off Baja California',
 'earth_relief_holes': 'Regional 20 arc-minutes Earth relief grid with holes',
 'fractures': 'Table of hypothetical fracture lengths and azimuths',
 'hotspots': 'Table of locations, names, and symbol sizes of hotpots from  '
             'Mueller et al., 1993',
 'japan_quakes': 'Table of earthquakes around Japan from NOAA NGDC database',
 'mars_shape': 'Table of topographic signature of the hemispheric dichotomy '
               'of  Mars from Smith and Zuber (1996)',
 'maunaloa_co2': 'Table of CO2 readings from Mauna Loa',
 'notre_dame_topography': 'Table 5.11 in Davis: Statistics and Data Analysis '
                          'in Geology',
 'ocean_ridge_points': 'Table of ocean ridge points for the entire world',
 'rock_compositions': 'Table of rock sample compositions',
 'usgs_quakes': 'Table of global earthquakes from the USGS'}

Now it looks better but pylint reports that some lines are longer than 79 characters. So I only have two options:

  1. Use pprint(list_sample_data(), indent=75) to make sure that the maximum line length < 80:
{'bathymetry': 'Table of ship bathymetric observations off Baja '
               'California',
 'earth_relief_holes': 'Regional 20 arc-minutes Earth relief grid with '
                       'holes',
 'fractures': 'Table of hypothetical fracture lengths and azimuths',
 'hotspots': 'Table of locations, names, and symbol sizes of hotpots from '
             'Müller et al. (1993)',
 'japan_quakes': 'Table of earthquakes around Japan from the NOAA NGDC '
                 'database',
 'mars_shape': 'Table of topographic signature of the hemispheric '
               'dichotomy of Mars from Smith and Zuber (1996)',
 'maunaloa_co2': 'Table of CO2 readings from Mauna Loa',
 'notre_dame_topography': 'Table 5.11 in Davis: Statistics and Data '
                          'Analysis in Geology',
 'ocean_ridge_points': 'Table of ocean ridge points for the entire world',
 'rock_compositions': 'Table of rock sample compositions',
 'usgs_quakes': 'Table of earthquakes from the USGS'}
  1. Use pprint(list_sample_data(), indent=120) so that each entry is printed in a single line:
{'bathymetry': 'Table of ship bathymetric observations off Baja California',
 'earth_relief_holes': 'Regional 20 arc-minutes Earth relief grid with holes',
 'fractures': 'Table of hypothetical fracture lengths and azimuths',
 'hotspots': 'Table of locations, names, and symbol sizes of hotpots from Müller et al. (1993)',
 'japan_quakes': 'Table of earthquakes around Japan from the NOAA NGDC database',
 'mars_shape': 'Table of topographic signature of the hemispheric dichotomy of Mars from Smith and Zuber (1996)',
 'maunaloa_co2': 'Table of CO2 readings from Mauna Loa',
 'notre_dame_topography': 'Table 5.11 in Davis: Statistics and Data Analysis in Geology',
 'ocean_ridge_points': 'Table of ocean ridge points for the entire world',
 'rock_compositions': 'Table of rock sample compositions',
 'usgs_quakes': 'Table of earthquakes from the USGS'}

I choose option 2 because in my opinion it's more readable, but I have to add # noqa: W505 and # pylint: disable=line-too-long to disable flakeheaven and pylint errors:

@seisman seisman marked this pull request as ready for review February 11, 2023 16:27
@seisman seisman added the needs review This PR has higher priority and needs review. label Feb 11, 2023
Copy link
Member

@michaelgrund michaelgrund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine!

@seisman seisman added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. labels Feb 13, 2023
@seisman seisman changed the title Add the GMTSampleData class to simplify the load_sample_data and list_sample_data functions Add the GMTSampleData class to simplify the load_sample_data and list_sample_data functions (#2342)Co-authored-by: Will Schlitzer <schlitzer90@gmail.com> Co-authored-by: Yvonne Fröhlich <94163266+yvonnefroehlich@users.noreply.github.com> Co-authored-by: Michael Grund <23025878+michaelgrund@users.noreply.github.com> Feb 14, 2023
@seisman seisman merged commit 454380a into main Feb 14, 2023
@seisman seisman deleted the simplify-load-samples branch February 14, 2023 07:24
@seisman seisman removed the final review call This PR requires final review and approval from a second reviewer label Feb 14, 2023
@seisman seisman changed the title Add the GMTSampleData class to simplify the load_sample_data and list_sample_data functions (#2342)Co-authored-by: Will Schlitzer <schlitzer90@gmail.com> Co-authored-by: Yvonne Fröhlich <94163266+yvonnefroehlich@users.noreply.github.com> Co-authored-by: Michael Grund <23025878+michaelgrund@users.noreply.github.com> Add the GMTSampleData class to simplify the load_sample_data and list_sample_data functions (#2342) Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maintenance Boring but important stuff for the core devs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve the documentation of load_sample_data and list_sample_data

5 participants