Skip to content

nhsx/language-corpus-tools

status: hibernate

NHS Language Corpus Tools - Discovery

Analytics Unit - Discovery Project

Warning to Users

This codebase is a proof of concept and should only be used for demonstration purposes within a controlled environment. The components are not a live product and should not be deployed in a live or production environment.

We further recommend looking for the most recent versions of the individual components in their original repositories.

About the Project

This discovery project seeks to investigate possible approaches to building a data set of NHS focussed text sources for the purposes of training and benchmarking NLP models in the NHS. You can read more about it in the blog here.

The aim was to test thinking and feasibility of such a solution by exploring aspects of:

  • infrastructure, scalability and maintenance
  • possible data sources, appropriate metadata, clinical input and required governance
  • possible use cases of the outputs for training, benchmarking, validating and testing

This repository contains aspects of the tooling used during the discovery phase.

Note: No data, public or private are shared in this repository.

Project Stucture

  • appstack folder contains scripts and configuration files to deploy the stack either on AWS Elastic Container Service or to deploy on a local system running Docker. Please refer to the folder README for further details.

  • doccano_autolabelling folder contains a script to implement an trial autolabelling approach into the doccano deployment.

  • scrapers folder contains a scraper framework as well as a number of implemented scrapers. Please refer to the folder README for further details.

  • user_stories folder contains a copy of the user stories which were identified as part of this discovery work.

Limitations of Use

This repository is exploratory, pre-alpha code that has been developed for demonstration and evaluation purposes only. It is not to be used as a live service. No testing has been performed apart from ad-hoc trials and tests by its developers. No guarantees are made as to its performance.

Although containing code to deploy as a cloud app, no auto-scaling or redundancy mechanisms have been built. No security reviews have been performed and therefore no guarantees are made as to the security of this release.

Built using

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

See CONTRIBUTING.md for detailed guidance.

License

Distributed under the MIT License. See LICENSE for more information.

Contact

To find out more about the Analytics Unit visit our project website or get in touch at england.tdau@nhs.net.

About

NHS Language Corpus Tools - Discovery

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •