Makes it dead easy to run SemEHR: just put a set of documents in a folder and then run SemEHR container over it.
You can either pull semehr docker image
docker pull semehr/core
Or, compile your copy by using the following command. (You can customise your copy by modifying the Dockerfile downloadable from above)
docker build -t semehr/core - < Dockerfile
- select/create a host directory (let's call it
datadir) for input fulltexts and outputs. There should be 3 subfolders (if not exist, they will be created automatically):- input_docs: for putting full text documents; Add your own documents or two sample docs will be put in here for demostration purposes.
- output_docs: for saving
temporaryNLP annotations; - semehr_results: for saving SeEHR results.
- (optional) create a SemEHR configuration file (semehr_settings.json) in
datadir. If not, a default configuration will be used, i.e.docker/docker_doc_based_settings.json. - (optional) [Gazetteer settings] A sample gazetteer will be used for NLP annotation. This is a list of entities used for a stroke subtyping study. It is recommended to use a UMLS gazetteer that bio-yodie can use. Due to license purpose, we cannot provide it. If you have got a UMLS license, please follow instructions from here to populate your own. Or, you can get in touch (honghan.wu@gmail.com) for help on the resource generation.
docker run --name=semehr-test \
--mount type=bind,src=FULL PATH OF YOUR DATA FOLDER,dst=/data/ \
--mount type=bind,src=FULL PATH OF YOUR CLONED CogStack-SemEHR repo,dst=/opt/semehr/CogStack-SemEHR \
semehr/core
If you have got a bio-yodie usable UMLS resource populated, you can use it like the following.
docker run --name=semehr-test \
--mount type=bind,src=FULL PATH OF YOUR DATA FOLDER,dst=/data/ \
--mount type=bind,src=FULL PATH OF YOUR CLONED CogStack-SemEHR repo,dst=/opt/semehr/CogStack-SemEHR \
--mount type=bind,src=FULL PATH OF YOUR UMLS RESOURCE FOR BIO-YODIE,dst=/opt/gcp/bio-yodie-1-2-1/bio-yodie-resources \
semehr/core
Each file in the semehr_results folder contains the annotations generated for a full text file. It contains three attributes at its
top level as follows.
{
"sentences": [],
"annotations": [],
"phenotypes": []
}sentencesarray contains the sentences start/end offsets.annotationsarray contains the UMLS concept mentions (details as follows).phenotypesarray contains the mentions of entities in the customised gazetteer.
{
"ruled_by": [
"negation_filters.json"
],
"end": 1734,
"pref": "Bleeding",
"negation": "Negated",
"sty": "Pathologic Function",
"start": 1726,
"study_concepts": [
"Bleeding"
],
"experiencer": "Patient",
"str": "bleeding",
"temporality": "Recent",
"id": "cui-47",
"cui": "C0019080"
}ruled_bygives the rule set that the annotation matched. Generally, there are several types of rules of negation, hypothetical, not a mention, other experiencer (see here for full list). These rules were developed for clinical studies conducted on SLaM CRIS data. You can also create your own rules using the same syntax (regular expressions). Consider this as an extra improvement step on the embedded NLP model (i.e. bio-yodie for now) in SemEHR.study_conceptsthe type of the annotation as specified in the study configuration (e.g., cancer can be mapped to many UMLS concepts). So, essentially, a study concept denotes a list of UMLS CUIs, which is specified by the study designer. SemEHR repo'sstudiesfolder contains configurations of several studies conducted on SLaM CRIS.- all other attribtues are general attributes of SemEHR as specified in the wiki.
NB: when a study configuration is provided, SemEHR will only apply rules on annotations that are relevant to the study and skip other ones. This means only those whose study_concepts is not empty will be checked by rules.
A log file semehr.log will be generated in the above mentioned attached data folder.