Skip to content

[ML] Natural Language Processing tasks and models#73523

Merged
davidkyle merged 21 commits intoelastic:feature/pytorch-inferencefrom
davidkyle:bert-tokenizer
Jun 2, 2021
Merged

[ML] Natural Language Processing tasks and models#73523
davidkyle merged 21 commits intoelastic:feature/pytorch-inferencefrom
davidkyle:bert-tokenizer

Conversation

@davidkyle
Copy link
Copy Markdown
Member

@davidkyle davidkyle commented May 28, 2021

Following on from #72218 which defined how large PyTorch models can be stored, this PR introduces the concepts of Natural Language Processing tasks and defines a way to evaluate BERT models.

Mask Fill and Named Entity Recognition tasks are implemented here but others could be easily added now the framework is in place. In particular this PR implements tokenisation of input text for BERT models and defines a structure for post-graph processing.

Once the PyTorch model is uploaded a trained model config referencing it must be PUT

PUT ml/trained_models/bert-model-for-maskfill
{
    "description": "Mask fill model",
    "model_type": "pytorch",
    "inference_config": {
        "classification": {
            "num_top_classes": 1
        }
    },
    "input": {
        "field_names": ["text_field"]
    },
    "location": {
        "index": {
            "model_id": "bert-model-for-maskfill",
            "name": "big_model"
        }
    }
}

And the model deployed:

POST _ml/trained_models/deployment/bert-model-for-maskfill/_start

Mask Fill Example

POST _ml/trained_models/deployment/bert-model-for-maskfill/_infer
{
  "input": "Paris is the [MASK] of France."
}

Returns

[
  {
    "token" : "capital",
    "score" : 0.9861745037766138,
    "sequence" : "Paris is the capital of France."
  },
  {
    "token" : "center",
    "score" : 0.00372138405614492,
    "sequence" : "Paris is the center of France."
  },
  {
    "token" : "Capital",
    "score" : 0.003259749401778711,
    "sequence" : "Paris is the Capital of France."
  },
  {
    "token" : "centre",
    "score" : 0.002157122475609145,
    "sequence" : "Paris is the centre of France."
  },
  {
    "token" : "city",
    "score" : 9.026127599384262E-4,
    "sequence" : "Paris is the city of France."
  }
]

NER Example

POST _ml/trained_models/deployment/bert-model-fine-tuned-for-ner/_infer
{
  "input": "Today's GAH is live from Amsterdam, BC, London, Munich and Texas"
}

Returns:

[
  {
    "label" : "organisation",
    "score" : 0.940775243737086,
    "word" : "GAH"
  },
  {
    "label" : "location",
    "score" : 0.9987588832004948,
    "word" : "Amsterdam"
  },
  {
    "label" : "location",
    "score" : 0.9958452874139202,
    "word" : "BC"
  },
  {
    "label" : "location",
    "score" : 0.9981461858828271,
    "word" : "London"
  },
  {
    "label" : "location",
    "score" : 0.9991212183928049,
    "word" : "Munich"
  },
  {
    "label" : "location",
    "score" : 0.9994121461792658,
    "word" : "Texas"
  }
]

Feature branch PR

Co-authored-by: Dimitris Athanasiou dimitris@elastic.co

@davidkyle davidkyle added >feature :ml Machine learning labels May 28, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label May 28, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent
Copy link
Copy Markdown
Member

run elasticsearch-ci/part-1

@mark-vieira
Copy link
Copy Markdown
Contributor

jenkins test this please

Copy link
Copy Markdown
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a couple of test related comments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is left empty. Should we add some tests here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some tests for this one

Copy link
Copy Markdown
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Just a question about the name of the fill mask results field. Good to merge though even if you decide to change that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be predictions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@davidkyle davidkyle merged commit 8e51034 into elastic:feature/pytorch-inference Jun 2, 2021
@davidkyle davidkyle deleted the bert-tokenizer branch June 2, 2021 10:13
davidkyle added a commit that referenced this pull request Jun 3, 2021
The feature branch contains changes to configure PyTorch models with a 
TrainedModelConfig and defines a format to store the binary models. 
The _start and _stop deployment actions control the model lifecycle 
and the model can be directly evaluated with the _infer endpoint. 
2 Types of NLP tasks are supported: Named Entity Recognition and Fill Mask.

The feature branch consists of these PRs: #73523, #72218, #71679
#71323, #71035, #71177, #70713
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>feature :ml Machine learning Team:ML Meta label for the ML team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants