Matthew Honnibal

Practical Tips for Bootstrapping Information Extraction Pipelines

Fri, 09 Aug 2024 00:00:00 -0400

In this presentation, I will build on Ines Montani's keynote, "Applied NLP in the Age of Generative AI", by demonstrating how to create an information extraction pipeline. The talk will focus on using the spaCy NLP library and the Prodigy annotation tool, although the principles discussed will also apply to other frameworks.

Designing for tomorrow's programming workflows

Thu, 04 Apr 2024 00:00:00 -0400

Video: https://www.youtube.com/watch?v=6t80gIb-HBI New tools are changing how people program, and even who programs. Type hints, modern editor support and, more recently, AI-powered tools like GitHub Copilot and ChatGPT are truly transforming our workflows and improving developer productivity. But what does this mean for how we should be writing and designing our APIs and libraries? In this talk, I'll share what I've learned from developing open-source tools used by thousands of developers, strategies for how to design future-proof developer APIs and why, contrary to what you might think, making tools programmable is becoming even more important, not less. If how people program is changing, how should we adjust how we’re designing our APIs, whether they’re Python, REST or some other technology? In this talk, I’ll suggest three implications. First, programmatic interfaces are becoming more accessible. More professional tools can be optionally programmable, and more users will be able to take advantage of that feature to make their tasks more reliable and less repetitive. Second, libraries can lean more towards composable building blocks, instead of offering a large flat surface of entry point functions. Guiding users through multi-step workflows is hard, so a "horizontal" design with lots of all-in-one functions can be easier to adopt from documentation. But the horizontal design is also harder for users to extend and debug. Generative AI tools can help address the learning curve problem, and give users more control. Third, backwards compatibility will be more important than ever. Evolving versions already made it much harder to find usage from sites such as StackOverflow, and it causes an even bigger problem for current generative AI technologies. This is another point in favour of composable building blocks, as it's much harder to maintain backwards compatibility for a horizontal API style.

How many Labelled Examples do you need for a BERT-sized Model to Beat GPT-4 on Predictive Tasks?

Wed, 25 Oct 2023 00:00:00 -0400

Video: https://www.youtube.com/watch?v=3iaxLTKJROc Large Language Models (LLMs) offer a new machine learning interaction paradigm: in-context learning. This approach is clearly much better than approaches that rely on explicit labelled data for a wide variety of generative tasks (e.g. summarisation, question answering, paraphrasing). In-context learning can also be applied to predictive tasks such as text categorization and entity recognition, with few or no labelled exemplars. But how does in-context learning actually compare to supervised approaches on those tasks? The key advantage is you need less data, but how many labelled examples do you need on different problems before a BERT-sized model can beat GPT4 in accuracy? The answer might surprise you: models with fewer than 1b parameters are actually very good at classic predictive NLP, while in-context learning struggles on many problem shapes — especially tasks with many labels or that require structured prediction. Methods of improving in-context learning accuracy involve increasing trade-offs of speed for accuracy, suggesting that distillation and LLM-guided annotation will be the most practical approaches. Implementation of this approach is discussed with reference to the spaCy open-source library and the Prodigy annotation tool.

spaCy meets Transformers

Sat, 12 Oct 2019 00:00:00 -0400

Huge transformer models like BERT, GPT-2 and XLNet have set a new standard for accuracy on almost every Natural Language Processing leaderboard. However, these models are very new, and most of the software ecosystem surrounding them is oriented towards the many opportunities for further research that they provide. In this talk, I’ll describe how you can now use these models in spaCy, a popular library for putting Natural Language Processing to work on real problems. I’ll also discuss the many opportunities that new transfer learning technologies can offer production NLP, regardless of which specific software packages you choose to get the job done.

Building new NLP solutions with spaCy and Prodigy

Sat, 07 Jul 2018 00:00:00 -0400

Commercial machine learning projects are currently like start-ups: many projects fail, but some are extremely successful, justifying the total investment. While some people will tell you to "embrace failure", I say failure sucks — so what can we do to fight it? In this talk, I will discuss how to address some of the most likely causes of failure for new Natural Language Processing (NLP) projects. My main recommendation is to take an iterative approach: don't assume you know what your pipeline should look like, let alone your annotation schemes or model architectures. I will also discuss a few tips for figuring out what's likely to work, along with a few common mistakes. To keep the advice well-grounded, I will refer specifically to our open-source library spaCy, and our commercial annotation tool Prodigy.

Multi-lingual natural language understanding with spaCy

Sun, 15 Apr 2018 00:00:00 -0400

spaCy is a popular open-source Natural Language Processing library designed for practical usage. In this talk, I'll outline the new parsing model we've been developing to improve spaCy's support for more languages and text types. Like other transition-based parsers, the model predicts a sequence of actions that push tokens to and from a stack and build arcs between them. However, we expect the arc-eager system with actions that can also repair previous parse errors, introduce sentence boundaries, and split or merge the pre-segmented tokens. The joint approach improves parse accuracy on many types of text, especially for non-whitespace writing systems. We have also found significant practical advantage to short pipelines. Short pipelines are easier to reason about, and increase runtime flexibility by reducing the risk of train/test skew.

Embed, encode, attend, predict: A four-step framework for understanding neural network approaches to Natural Language Understanding problems

Thu, 12 Apr 2018 00:00:00 -0400

While there is a wide literature on developing neural networks for natural language understanding, the networks all have the same general architecture, determined by basic facts about the nature of linguistic input. In this talk I name and explain the four components (embed, encode, attend, predict), give a brief history of approaches to each subproblem, and explain two sophisticated networks in terms of this framework -- one for text classification, and another for textual entailment. The talk assumes a general knowledge of neural networks and machine learning. The talk should be especially suitable for people who have been working on computer vision or other problems. Just as computer vision models are designed around the fact that images are two or three-dimensional arrays of continuous values, NLP models are designed around the fact that text is a linear sequence of discrete symbols that form a hierarchical structure: letters are grouped into words, which are grouped into larger syntactic units (phrases, clauses, etc), which are grouped into larger discursive structures (utterances, paragraphs, sections, etc). Because the input symbols are discrete (letters, words, etc), the first step is "embed": map the discrete symbols into continuous vector representations. Because the input is a sequence, the second step is "encode": update the vector representation for each symbol given the surrounding context. You can't understand a sentence by looking up each word in the dictionary --- context matters. Because the input is hierarchical, sentences mean more than the sum of their parts. This motivates step three, attend: learn a further mapping from a variable-length matrix to a fixed-width vector, which we can then use to predict some specific information about the meaning of the text.