-
Notifications
You must be signed in to change notification settings - Fork 0
HIRING FOR FAIR DATA PROJECT: Job requirements #16
Description
The tasks we need completed fall into three broad categories: back-end, front-end and metadata/data processing.
Back-end
- The Archive's data need some place to live that isn't an open S3 bucket. Most likely, this means setting up a Dataverse instance (hopefully in collaboration with FRDR (SUSTAINABILITY: Data storage for the Canadian COVID-19 Data Archive #3).
- Dataverse has its own built-in data access API, which will be a critically important part of making the data accessible (BACK-END: API design #9).
- We may need a layer on top of the basic Dataverse/Dataverse API setup to handle the data processing pipeline (FAIR DATA: Pipeline for processing archived datasets into a common format #15) from raw datasets to FAIRified datasets (e.g., making it easy to establish dataset provenance (FAIR DATA: Data provenance for FAIR datasets #12)).
- Various other sustainability tasks such as improving automated data collection for the Archive via further development of the Python-based
archivistpackage (SUSTAINABILITY: Automated data collection for the Canadian COVID-19 Data Archive #2) probably fall under back-end tasks as well.
Front-end
- The Archive's raw and FAIRified data need to be discoverable (FRONT-END: Data tool design #8). The easiest way to do this would be to setup an instance of a tool like geodisy, which runs on top of a Dataverse instance and uses GeoServer and GeoBlacklight to facilitate geospatial data discovery. On this note, it may be wise to reach out the geodisy/UBC library team, as we may be able to give back to the project by developing additional functionality in the form of plugins, etc.
- We may need an additional layer/plugins on top of the basic service in order to best present our FAIRified data (FRONT-END: Data tool design #8, FAIR DATA: Integration of census and other StatCan data #11, FAIR DATA: Data provenance for FAIR datasets #12).
- Any additional data visualization tasks that may be required.
Metadata/data processing
- The existing data in the Archive need a metadata taxonomy (METADATA: Metadata taxonomy for FAIR data #7) and then must be fortified with extensive metadata (METADATA: Adding metadata to list of datasets in the Canadian COVID-19 Data Archive #13, LICENSING: Add licensing metadata to the list of datasets in the Canadian COVID-19 Data Archive #5), both to enhance findability of data as well to use as a basis for the dataset processing pipeline for FAIRifying the data (FAIR DATA: Common format for FAIR datasets #10, FAIR DATA: Pipeline for processing archived datasets into a common format #15).
- We must create a data processing pipeline to FAIRify raw datasets into a common format (FAIR DATA: Common format for FAIR datasets #10, FAIR DATA: Pipeline for processing archived datasets into a common format #15). This may use R, Python, SQL or a combination of these languages.
- Other sustainability tasks such as maintaining the list of datasets that are being actively archived (SUSTAINABILITY: Maintaining the list of datasets for the Canadian COVID-19 Data Archive #4).
One issue with the classification above is that the sub-tasks and skills required to complete them aren't necessarily cleanly divided into these three separate categories. For example, both setting up Dataverse and Geodisy require a similar skillset, and the entire stack must be integrated. Furthermore, someone with a deep knowledge of the data and subject area who may be the best person to develop a metadata taxonomy and add metadata may not be the best person to actually write the code necessary to integrate each dataset into a data processing pipeline. As such, it may be best not to think of each section as three separate jobs of roughly equal size, and instead develop job descriptions based on the general skillsets required.
Thoughts, @colliand?