christina.kim

Scaling Laws for Language Transfer Learning

Sun, 11 Apr 2021 00:00:00 +0000

Building upon OpenAI’s recent work on scaling laws, my project explores how much pre-training on English helps when transferring across different languages.

Here, I will discuss scaling laws discovered while fine-tuning across different languages with pre-trained English language models. Specifically, I found that a) pre-trained English models help most when learning German, then Spanish, and finally Chinese and b) transfer from English to Chinese, German, and Spanish scales predictably in terms of parameters, data, and compute.

Introduction

Historically, the advancement of deep learning capabilities has centered around three levers: improved algorithms, faster and cheaper compute, and larger and higher quality datasets. Given machine learning’s promise to significantly impact society, deepening our general understanding of machine learning, and how certain levers improve models, is critical for making better predictions for which capabilities will develop next, and when. Recently, researchers have increasingly explored scaling relationships between these three levers.

Fig 1. Figure from Henighan et al. 2020.

My project’s framework for experiments is inspired by the work on scaling laws published by OpenAI in the past year. Scaling laws (Kaplan et al. 2020) can predict machine learning performance as a function of model size, dataset size, and the amount of compute used for training. Henighan et al. (2020) also found that this relationship holds over several orders of magnitude across different modalities, as seen in the figure above. Earlier this year, scaling relationships were found for transfer learning from pre-trained english text models to Python (Hernandez et al 2021). These results show that compute, dataset size, and model size are different limiting factors that scale with each other in surprisingly predictable trends when we setup our experiments to measure those things.

In my project, I continue the study of transfer between distributions and look at scaling between three other languages and English. Scaling laws for transfer are important because the scaling relationships explain how to work in limited data regimes. In an ideal world, one would have an infinite amount of data for a model to learn from. However, getting a large quantity of high quality data is a nontrivial, if not impossible, task and as a result, most problems exist in the low data regime. Before the Scholars program, I was a machine learning engineer and saw firsthand how costly it was in terms of both time and money to get good quality human labels for our tasks. Exploring the relationships between different languages can provide more insight on how to tackle low-resource languages and how to best leverage pre-trained language models. Given the real world limitations of data, the tradeoffs between budgeting for compute on larger models and budgeting for more fine-tuning data is an important practical relationship to understand.

Experiment Methodology

Building upon work from Scaling Laws for Transfer (Hernandez et. al. 2021), my experiments focused on exploring the relationships between fine-tuning on non-English languages. My experiments try to answer the question: How much does pre-training on English help when transferring across different languages as we vary the dataset size and model size?

Pre-training

I first trained English language models in a similar setup to Scaling Laws for Neural Language Models. I pre-trained decoder-only transformers of size 124M, 70M, 51M, 39M, 16M, 3.3M, non-embedding parameters with the same hyperparameters on OpenWebtext2(65.86GB), an open-source version of WebText created by Eleuther AI. All models used Adam, a batch size of 512, context length of 1024 tokens, and a learning rate schedule with a 500 step linear warm-up with a cosine decay to 10% of the maximum learning rate. The text was encoded with a GPT2 tokenizer, a byte-level Byte-Pair Encoding tokenizer with a 50K vocab size. All models were trained for a total of 26 billion tokens with no repeats. The code to reproduce these pre-trained models is available on GitHub, including model weights. As seen in the figure below comparing loss and model size, the models exhibit scaling laws as model size increases. However, the relationship isn’t exactly linear, suggesting that maybe the models are under-trained or the hyperparameters are not tuned thoroughly.

Fig 2. Training curves for transformers on OpenWebtext2. The larger models overlap with each other early in training which suggest the hyperparameters are not tuned thoroughly.

Fig 3. Language modeling performance improves as parameters increases, but the relationship is not linear.

Fine-Tuning

For the fine-tuning experiments, dataset size spanned six orders of magnitude, and model size spanned two orders of magnitude trained on three different languages: Spanish, Chinese, and German. Models trained or fine-tuned on Chinese datasets leveraged a 1.2 billion character dataset, Community QA. Community QA (webtext2019zh) is similar to the WebText corpus. Models trained or fine-tuned on Spanish texts and German texts are from Oscar, a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus.

The models fine-tuned on non-English languages with the pre-trained English model weights, and the models trained on non-English languages from-scratch were all trained up to the optimal loss if the model started over-fitting or up to 15M tokens, whichever came first. Models were trained with a learning rate schedule of a 300 step warm-up with a cosine decay to 10% of the maximum learning rate. The code to replicate these experiments will also be available on GitHub.

Results

Effective Data Transfer

Fig 4. The performance of a 16M parameter transformer model on Chinese, both trained from scratch on Chinese and pre-trained on English then fine-tuned on Chinese.

In my experiments, I wanted to find the effective data transferred for models trained on English text to Chinese, Spanish, and German text. The effective data transferred is defined in Scaling Laws for Transfer as the amount of additional fine-tuning data that a model of the same size, trained on only that fine-tuning dataset, would have needed to achieve the same loss as a pre-trained model. In the figure above, each point is a 16M transformer trained to convergence on dataset of X tokens. The total amount of data required for the model trained from scratch can be represented as $D_e = D_f + D_t$ where $D_e$ is the total amount of effective data, $D_f$ is the amount of data needed for the fine-tuned model, and $D_t$ is the amount of additional data needed for the trained from scratch model. $D_t$ is the amount of data transferred from pre-training on English.

Fig 5. Comparing performance of a 16M parameter transformers trained from scratch, and fine-tuned on Chinese, Spanish, and German. For the dataset size of 8000 tokens, $D_t$, the amount of data transferred, is largest for German. The dashed line on the graphs represent $D_t$. As the number of tokens in the dataset size increase, $D_t$ becomes smaller across all languages.

As seen in the figures above, English to Chinese had a smaller amount of data transferred compared to English to Spanish for the same model size and English to German had the greatest amount of data transferred. Pre-trained English text models help most when learning German, followed by Spanish, and finally, Chinese. I believe these results reflect the degree of linguistic similarities between English and the non-English languages. English and German are both derived from Proto-Germanic and are linguistically most similar. Although the Spanish alphabet shares almost all the same symbols with the English alphabet, it is a Romance language, and Chinese does not share the same alphabet as English. Each language has a distinctive shape and distance between fine-tuning and training from scratch. For instance, the effective data transfer is not too much greater for Spanish, vs Chinese, at the smallest dataset size, 8000 tokens. However, as we increase the dataset size, pre-training continues to help for another order of magnitude until the 100M token dataset size than the Chinese which converges at 10M token dataset size.

Comparing the Fraction of Effective Data from Fine-Tuning, $D_f /D_e $

$D_f /D_e $ measures the fraction of effective data from fine-tuning. A smaller fraction means pre-training helps more. I found that as model size increases, $D_f /D_e$ decreases across all languages and pre-training becomes more effective. However, as dataset size increases, $D_f /D_e$ increases across model sizes and pre-training becomes less effective. In the figure above, German has steeper curves, seeing the effective data from fine-tuning decrease the most, which shows English helps the most while Chinese has the flattest curves, showing pre-training helps the least for Chinese.

Fig 6. Comparing the fraction of effective data from fine-tuning of a 16M parameter transformer model on Chinese, Spanish, and German.

I find many of the same trends and relationships found in the Scaling Law for Transfer between text and code, between English and different languages.

In the low data regime, pre-training is helpful across model sizes, but especially in large model sizes. When using pre-trained models, model performance is parameter limited. Model performance is considering parameter limited when the loss continues to decrease as we increase the model size. Model performance is data limited when increasing the number of parameters does not impact the loss. This is evident in the figure above, which shows that as model size increased with a fixed dataset size of Chinese text to fine-tune on, models trained from scratch on Chinese did not improve as much while models that were pre-trained on English improved more to achieve better performance. The flat lines indicate the performance is data limited, while the sloped lines indicates the performance is more limited by the number of parameters.

Lastly, pre-trained models are more compute efficient than training from-scratch across dataset sizes. This is without accounting for the compute costs for the pre-trained model.

Fig 7. Comparing amount of compute needed for a 60M transformer trained from scratch and fine-tuned on 500M tokens of Chinese

Limitations

There are important limitations of my work. First, I used the same BBPE GPT-2 tokenizer for all languages. This is extremely inefficient for the non-English languages since the tokenization requires more tokens to represent the language. For example, Chinese is upwards of 50K characters, while the BBPE tokenizer used a 50k vocab size. Second, my largest models could have been pre-trained for longer. For scaling law experiments, it’s important to make sure models are in the correct regime, either convergence frontier, or compute frontier. Additionally, I was only able to do a limited hyperparameter sweep and learning rate schedules due to compute and time. I suspect the results found are very learning rate schedule dependent. Finally, my fine-tuning datasets were from different sources. This is important to note since my results maybe more specific to the distribution found in the dataset than the underlying language.

Future Work

Some potential future directions include:

Compare the effective data transfer using pre-trained models in Chinese, German, Spanish, then fine-tuned to English
Fine-tune with low resource languages or other tasks/distributions and find the effective data transfer
Predict ideal ratio for pre-trained v.s. fine-tune for any given problem, given some budget
Study the “forgetting” problem in transfer learning in terms of effective data transfer

Acknowledgements

Thanks to my mentor Jerry Tworek, the Scholars cohort, Danielle Ensign and Kudzo Ahegbebu for sharing compute, everyone that gave me feedback (especially Danny Hernandez and Mohammad Bavarian), Jeromy Johnson for helping me find the Community QA dataset, Scholar program coordinators, Muraya Maigua and Christina Hendrickson, and OpenAI and Azure for making this all possible.

Citation

If you find this work useful, please cite it as:

@article{kim2021scalinglanguagetransfer,
  title   = "Scaling Laws for Language Transfer Learning",
  author  = "Kim, Christina",
  journal = "christina.kim",
  month   = 4,
  year    = "2021",
  url     = "https://christina.kim/2021/04/11/scaling-laws-for-language-transfer-learning/"
}

References

Jörg Bornschein et al. Sep 26, 2020. Small Data, Big Decisions: Model Selection in the Small-Data Regime

Frédéric Branchaud-Charron et al. May 20, 2019. Spectral Metric for Dataset Complexity Assessment

Tom Henighan et al. Oct 28, 2020. Scaling Laws for Autoregressive Generative Modeling

Henning Fernau. Jan 24, 2019. Algorithms for Learning Regular Expressions from Positive Data Algorithms

Pierre Guillou. Jul 3, 2020. Byte-level BPE, an Universal Tokenizer but…

Danny Hernandez et al. Feb 2, 2021. Scaling Laws for Transfer

Joel Hestness et al. Dec 1, 2017. Deep Learning Scaling is Predictable, Empiricially.

Jared Kaplan et al. Jan 23, 2020. Scaling Laws for Neural Language Models

Ameet A. Rahane et al. Apr 16, 2020. Measures of Complexity for Large Scale Image Datasets

Jonathan S. Rosenfeld et al. Sep 25, 2019. A Constructive Prediction of the Generalization Error Across Scales

Céline Van den Rul. Nov 9, 2019. [NLP] Basics: Measuring The Linguistic Complexity of Text

Experimenting in an Infinite Data Regime

Fri, 12 Feb 2021 00:00:00 +0000

Most machine learning tutorials gear toward defined datasets that can fit in the memory of most machines. These datasets are great for benchmarking new algorithms and for learning. However, newer SOTA models have many more parameters, and they train in an infinite data regime.

I ran into quite a few bugs while setting up an experiment with OpenWebText2, a clone of WebText which contains over 40GB of data. In this blog post, I want to share some differences to consider when working in an infinite data regime and how to prevent common bugs.

Working in an infinite data regime means you won’t have overfitting issues. You won’t need to be worried about having enough samples for training while saving enough for testing and evaluating. Instead of setting a max number of epochs, you’ll be setting the max number of steps in an infinite data regime since you shouldn’t need to see all of the samples (aka an entire epoch) to reach the lowest possible loss.

In an infinite data regime, it makes sense to prepare, tokenize, batch on the fly. In contrast, an indexable finite dataset can transform and fit easily in a GPU’s memory. Understanding how to work in an infinite data regime will only become more critical for machine learning researchers and practitioners.

Below is how I use PyTorch’s IterableDataset to stream from multiple files to create batches. You can use this dataset class with PyTorch’s DataLoader class. It’s important to remember that all shuffling and batching should be handled within your IterableDataset, (batch_size for your DataLoader should be set to None to let the DataLoader know that your dataset is batching).

To handle the multiple transformations from the raw text, to a batched output, I use generators for each transformation in the process.

import random
from itertools import chain
from itertools import cycle

import torch
from torch.utils.data.dataset import IterableDataset
from transformers import GPT2Tokenizer


class FileIterator:
    def __init__(self, dataset_paths):
        self.dataset_paths = dataset_paths

    def get_file(self, path):
        with open(path, "r", encoding="utf-8") as f:
            yield from f.readlines()

    def __iter__(self):
        for path in self.dataset_paths:
            yield from self.get_file(path)


class TokenizerIterator:
    def __init__(self, seq_len, tokenizer, dataset_path):
        self.seq_len = seq_len
        self.tokenizer = tokenizer
        self.data_iter = FileIterator(dataset_path)

    def tokenize_data(self, x):
        tokenized = self.tokenizer(text=x, truncation=True).input_ids
        # adding end of sequence to the beginning and end of the document
        tokenized.append(self.tokenizer.eos_token_id)

        tokenized.insert(0, self.tokenizer.eos_token_id)
        if len(tokenized) >= self.seq_len:
            for i in range(len(tokenized) - self.seq_len):
                yield tokenized[i: i + self.seq_len], tokenized[i + 1: i + 1 + self.seq_len], len(
                    tokenized[i: i + self.seq_len]
                )
        else:
            pass

    def __iter__(self):
        for x in self.data_iter:
            yield from self.tokenize_data(x)


class BatchIterator:
    def __init__(self, seq_len, batch_size, drop_last, tokenizer, dataset_paths):

        self.dataset_paths = dataset_paths
        self.batch_size = batch_size
        self.drop_last = drop_last
        self.seq_len = seq_len
        self.tokenizer = tokenizer

    def process_file(self, file):
        tokenizer_iter = TokenizerIterator(self.seq_len, self.tokenizer, file)
        for x in tokenizer_iter:
            yield x

    def shuffled_file_list(self, i):
        split = len(self.dataset_paths) // self.batch_size
        dataset_paths = self.dataset_paths[(i * split):((i + 1) * split)]
        return random.sample(dataset_paths, len(dataset_paths))

    def get_stream(self, file_list):
        return chain.from_iterable(map(self.process_file, cycle(file_list)))

    def get_streams(self):
        return zip(*[self.get_stream(self.shuffled_file_list(i)) for i in range(self.batch_size)])

    def __iter__(self):
        return self.get_streams()


class StreamingIterableDataset(IterableDataset):
    def __init__(self, batch_size, drop_last, dataset_paths, seq_len, tokenizer=None):
        if tokenizer is None:
            tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        self.seq_len = seq_len
        self.dataset_paths = dataset_paths
        self.batch_iter = BatchIterator(
            seq_len=seq_len,
            batch_size=batch_size,
            drop_last=drop_last,
            tokenizer=tokenizer,
            dataset_paths=dataset_paths,
        )

    def __iter__(self):
        try:
            for x in self.batch_iter:
                yield self.collate_fn(x)
        except StopIteration:
            return

    def collate_fn(self, batch):
        data_list, label_list, seq_len_list = [], [], []
        for _data, _label, _seq in batch:
            data_list.append(_data)
            label_list.append(_label)
            seq_len_list.append(_seq)
        # adding the permute to make the batch be the first dimension
        return (
            torch.LongTensor(data_list).permute(1, 0),
            torch.LongTensor(label_list).permute(1, 0),
            torch.LongTensor(seq_len_list),
        )

Scaling Laws

Fri, 29 Jan 2021 00:00:00 +0000

My research direction for the OpenAI Scholar’s program is heavily influenced by the scaling laws paper by OpenAI on language models, and autoregressive models published last year. Scaling laws exist for cross-entropy loss in five domains: language modeling, generative image modeling, video modeling, multimodal image to text models, and mathematical problem solving (Henighan et al 2020). They have also been discovered for a few specific problems within those domains. Scaling laws’ applicability across a wide range of scales and surprisingly exact curves indicate they are an important piece to understanding neural networks and their performance.

Scaling laws exist throughout nature and it’s particularly interesting to find them in deep learning as well.

In Scaling Laws for Neural Language Models, the authors found that the loss of LM scales as a power-law with model size, dataset size, and the amount of compute used for training up to seven order of magnitudes.

Scaling laws flavored AI research can help with our understanding of model improvements and get past the benchmark optimization phase of AI research. Instead, research can be focused scaling law optimization and scaling law tradeoff understanding. Experiments designed with scaling laws in mind can be smaller, and less compute heavy experiments as long as findings can extrapolate into trends.

Keeping Things Regular

Fri, 15 Jan 2021 00:00:00 +0000

In this post, I will introduce the direction of my OpenAI scholars’ project.

Inspired by the Scaling Laws papers by OpenAI on language models, and autoregressive models published last year, I’m interested in learning more about exploring potential scaling laws for dataset complexity. The questions I’m most interested in studying with my project are: Are there universal scaling laws for dataset complexity, or is dataset difficulty more of a pre-factor to performance? Another way to phrase this would be, can we build a map of scaling laws with respect to dataset complexity for regular languages? Can we show that these same trends apply to other natural datasets if there are scaling relationships for dataset complexity for regular languages?

Finding scaling laws for dataset complexity will contribute to a more precise understanding of transformer models and data. There exists substantial literature on the impact of algorithms and compute for machine learning, but less on the impact of data on machine learning. Previous scaling laws work has shown the relationship between dataset size and model performance. However, other potential relationships that may exist with data have not yet been explored. Understanding datasets may provide insights on how to improve performance on downstream tasks, in addition to how to transfer unsupervised approaches with transformers to other domains of ML research.

Yet while neural networks have made incredible progress, it is not always clear why this progress has been made. Better understanding what factors, such as dataset complexity, are causing models to work and why they are improving will allow us to better predict which capabilities will develop and when.

What is a Regular Language?

I will train transformers to predict the next character for strings from a regular language for my experiments. A regular language is a type of formal language defined by a regular expression. Formal language is a set of words from a given finite alphabet. Kleene’s theorem shows finite automata and regular expressions are equivalent in their expressiveness for denoting languages.

For example, the regular expression a*b can convert into a non-deterministic finite automaton (NFA), which can convert into a deterministic finite automaton (DFA), which can reduce into a minimum deterministic finite automaton (MDFA). It has two states, with one of those states being an accepting or final state. The symbols of the regular language a*b are a and b, and it has two transitions. A string is a valid word in a given language if it ends in an accepting state of the finite automaton representing the language.

Measuring Complexity

To measure dataset complexity, we will compare the number of states, transitions, and end states of the deterministic finite automata against the different regular languages. The number of states is the primary number of comparisons. The thesis here is that more states would result in a more complicated language. However, it is also essential to consider the secondary factors, such as how many states are accepting stages and how many transitions are between the states. For example, a finite automaton with many states, but all of the states are accepting, will be easy to solve. The advantages of using our synthetic languages will allow us to build a formal measurement of complexity. Creating a formal measure of complexity will allow us to increase the difficulty and control our data. For example, we can generate as many samples as needed and will not be data limited. There is existing research in computer science for how to measure the complexity of regular languages such as state complexity, and quotient complexity.

Progress So Far

Currently, I have been testing GPT and TransformerXL architectures on generated datasets of various regular languages. It appears that the number of transitions in the finite automaton impacts how easy a language is to learn. I started off generating datasets of 50> states with 100+ transitions but found that the models could not learn anything. I had initially been worried that creating straightforward regular languages would not be interesting to study. I’ve since been working from very simple (but still infinite) regular languages and plan on slowly increasing the number of states and transitions.

Thanks to my mentor Jerry Tworek for helping me design this project!

How to Navigate Conferences

Fri, 04 Dec 2020 00:00:00 +0000

The state of machine learning research moves incredibly fast. There are dozens of new papers published on arxiv every day, and it’s overwhelming trying to keep up to date. Conferences are a great way to get a lot of signals in a condensed amount of time. Conference papers and sessions are pretty vetted, which saves you from judging each new publication that comes out. As someone more recent to the field of AI research, I think it’s super beneficial to try to go to conferences, even as just a participant. I was lucky to go to NeurIPs last year with my previous startup.

One thing I like to do before conferences, though this might get harder and harder to do as conference size increases and the number of accepted papers increases, is skim all the abstracts of papers I find interesting. I’ll do an initial run through based on titles, keywords, and author affiliations to determine if I want to read the abstracts. I’m pretty generous at this stage. In the next step, I’ll read through the abstracts of all the papers I’ve selected. Then from there, I’ll (skim) read the papers of the abstracts I’ve found interesting. This process then gives me a shortlist of papers and authors I’ll try to seek out during poster sessions. Poster sessions are a great way to meet researchers and ask questions. Last year people were very excited to share their work and answer any small or big questions you may have.

For me, the most fun and educational bits of conferences are when I get to engage in 1:1 or small discussions with researchers. I’ll try to look for events happening during the meeting that are interest-oriented or find events where researchers I admire are speaking at. For example, last year, an AI Safety Unconference was an unofficial event and had fascinating discussions. Conference workshops are also a great way to find experts in a specific subject.

Conferences, whether or not they are virtual, are intimidatingly packed with events. NeurIPs this year has events 24/7 through the week. I like to spend some time figuring out which events I want to prioritize and optimize for discussion-oriented events first. Most talks are recorded, so you can always catch those later (at 2x speed) on your own time.

I’ll update this post with tips for how to navigate online conferences after next week’s virtual NeurIPs.

Research Tools

Fri, 20 Nov 2020 00:00:00 +0000

One of the hardest things from switching from software engineering to research has been the open-ended nature of research and measuring progress. For me, software engineering usually has more explicit objectives, and it’s easy to make a plan to get from 0 to 100. This post will share some tools I’ve been using to help me “do research” and explore potential research project ideas.

I’ve been trying to write down as many of my thoughts as possible. As a software engineer, it was rare to have a day where I was not writing code. As a researcher, it’s common to have days where you’re consuming papers and thinking. It’s easy to think you haven’t done much all day when you don’t have any output of your work. I’ve found that writing helps me solidify my learnings while clarifying any ideas and questions. Since my notes are just for me, I’ve been writing down all and any thoughts I have as I’ve been brainstorming and learning. I’ve been using Obsidian, which is a notetaking editor similar to Roam. Things I like about Obsidian is that it’s all local, and I can write notes on my other devices as well. It’s effortless to find mentions of an idea by using backlinks. Obsidian has helped to add structure to my day as I’ve created a daily note template and added a set of morning questions to help focus my day and a set of evening questions to reflect on what I’ve worked on. It’s easy to lose momentum in research, so I’ve been trying to have an exact next to-do item for each day, so I have something to work on.

I’ve been trying to read as many papers as I can to get exposed to new ideas. As someone new to research, it’s easier to find inspiration from other work. Many papers discuss potential future directions from their results, and that’s been an excellent source for my own project brainstorming. Another great reason to read papers is that it’ll give you a better idea of how researchers think about things. Being able to (quickly) evaluate work critically seems to be an essential skill as a researcher. Two features have been necessary for me regarding papers - searching through papers and taking notes. I’ve used Zotero to keep track of papers I’ve read/skimmed, but since using Obsidian, I’ve used Obsidian to keep track of my papers. I’ll copy the title, abstract, and link to the paper, so it’s easy to search. I’ve been using MarginNote for taking notes for papers. Some excellent features it has are that you can import websites to markup and create flashcards. They have many other exciting features, such as mindmaps, which I haven’t used as much.

As I’ve been learning new concepts, I’ve been heavily relying on Anki, which uses space repetition to augment your memory and learning. It’s easy to export cards directly into Anki from MarginNote. I’ll try to spend at least 15 minutes every day going through my notecards. There are many different ways to organize your Anki cards. I don’t have any strong suggestions for how you should use Anki, but I recommend making your cards instead of downloading other people’s decks. Writing learnings in my own words help me to solidify new facts. With spaced repetition, it’s become easier and less intimidating to learn important key concepts at a high volume.

Using all of these tools have been useful in validating potential research ideas. It’s easy to track new ideas and expand upon them. During project exploration, I’ve consumed a lot of information. Finding tools to process and efficiently learn the new information has made this exploration phase less hectic and added more structure.

I’ve also found these reads useful for how to think about your research: http://joschu.net/blog/opinionated-guide-ml-research.html https://web.mit.edu/tslvr/www/lessons_two_years.html

Let me know if there are any tools you’d recommend for research!

Transformers, Roll Out!

Fri, 06 Nov 2020 00:00:00 +0000

I’ve spent a good majority of my time (when not constantly refreshing for election results) thinking about transformers. There are already many articles describing transformers and implementing them, so I won’t go into too much detail about that here. Instead, I want to share some questions I’ve had while playing around with transformers and LSTMs on small datasets this past week. I tried to get models of the same size for these experiments roughly. When tested on variable context lengths from training in my small datasets experiments, LSTMs performed better than transformers. The intuition here is that LSTMs generalize better on context length due to their recurrence.

There are transformer architectures that try to add recurrence to the models, like the Universal Transformer. It’s interesting to note which tasks the Universal Transformer performed well on and didn’t perform well on. For example, on machine translation, the Universal Transformer performed a bit worse. The Universal Transformer is computationally more expensive than the traditional transformer architecture. In the case of machine translation, my mentor suggested that it might just essentially need to do a lookup into the weights for the correct word (aka memorization). However, the added recurrence did help for other tasks. So, what is the actual trade-off between compute and performance?

Besides adding some idea of recurrence in the Universal Transformer, many other transformer architectures try to improve on the original implementation and performance. I’m curious about evaluating and understanding the different architecture trade-offs in the many transformers’ papers. I hypothesize that autoregressive transformer models can learn some positional information, so they might benefit more from other types of positional encodings than non-autoregressive transformer models. It’d also be interesting to compare different attention mechanisms, such as the Reformer’s v.s. self-attention. I’m interested in learning the answers to these questions since I think it’ll hopefully elucidate what “skills” are essential and what degree of natural language understanding.

To better understand the performance differences between the different papers and implementation, I focused on learning more about the metrics used for language models. The terms I frequently came across were perplexity, bits per word(character), and cross-entropy. Perplexity can be thought of as the measure of uncertainty your model has for predictions. So the lower the perplexity, the higher confidence your model has about it’s predictions. Bits per word, or character, can be thought of as the entropy of the language. BPW measures the average number of bits required to encode the word. Given a language’s probability of P and our model’s learned probability Q, cross-entropy measures the total average amount of bits needed to represent events from Q to P. These terms helped me evaluate papers and helped me think about how to evaluate my toy datasets.

Below is my self-attention function that uses einsum, which has been really handy!

def self_attention(key, query, value):
   # scores = dotprod of key + query
   # b here stands for batch, l = length, k= nu pixels, num words etc)
   scores = torch.einsum('bkl,bql->bqk', key, query)

   # note that the dimensions of key query and value
  dimensions_k = key.dim()

   # this helps to create a more stable gradient (you could probably normalize via other constants)
   scores = scores/np.sqrt(dimensions_k)

   # turn scores into probabilities from 0,1
   attention = torch.softmax(scores)

   # dotprod of value vector, and attention vector
   attended_values = torch.einsum('bdl,bad->bal', value, attention)

   return attended_values

and here is an obligatory photo of my favorite transformer

Hello from OpenAI

Fri, 23 Oct 2020 00:00:00 +0000

I’m excited to be joining the Fall 2020 cohort of OpenAI’s Scholars program. I’ll be writing semi-regularly as a log of what I’m learning and thinking about.

I’m excited to be part of the scholars’ program since I find learning in a group motivating and useful, especially now when everyone is more isolated. It’s also been beneficial to ask questions and learn from people who have already been thinking about my research interests.

One of the high-level goals I’d like to work on is developing “taste” or “aesthetic” for deep learning research throughout this experience.

For the past two weeks, I’ve been reading about generalization and language models. I’ve also been working on reimplementing the smaller transformers from the Scaling Laws for Neural Languages.

Scaling Laws for Neural Languages

The paper uses a decoder only transformer for most of its experiments, in addition to LSTM models and the universal transformer. For now, I’ll focus on reproducing the smaller-scale experiments with the transformer architecture. To understand the architecture for the decoder only transformer better, I read the original GPT paper. It’s surprising to remember that this paper is only ~2 years old. I plan to use datasets available via HuggingFace’s Dataset library for training initially, and look into this WebText scraper later.

I found these resources really useful for understanding and implementing the transformer architecture.

Coincidentally, Jared Kaplan, one of the paper’s authors, gave a talk on scaling laws this past Wednesday. The slides and the video from the talk can be accessed here on the Physics ∩ ML website.

Below are papers suggested by my mentor for other relevant language model papers to read:

Generalization

I’ve also been thinking about model generalization this week. I’ve been thinking about some questions: what are the differences between generalization and memorization for some of these larger models with smaller datasets? what is the minimum amount of data required to generalize? what are other factors that allow models to generalize quickly? are there similar scaling law-esque properties for model generalization? what does it look like for a model to generalize well on out of distribution data?

Some papers I’ve been reading about generalization:

The NLP Papers to Read Before ICLR 2020

Thu, 23 Apr 2020 00:00:00 +0000

Ahead of next week’s ICLR 2020 virtual conference, I went through the 687 accepted papers (out of 2594 submitted - up 63% since 2019!) and identified 9 papers with the potential to advance the use of deep learning NLP models in everyday use cases.

Here’s what the papers found and why they matter:

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

Main Contribution: A commonly used task for pre-training language models is to mask the input and have the model predict what is masked. This paper introduces a new pre-training task called token detection. In the new task, the authors replace some tokens with alternatives by sampling from a generator. They then trained a discriminator to predict whether the generator replaced each token in an input or not.

Why It Matters: This task is more data efficient, learning potentially from all tokens in a dataset versus the ~15% masked in the usual approach. It shows there’s still room for additional creativity in how to train a language model.

An overview of replaced token detection

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi

Main Contribution: The authors propose a new decoding strategy called nucleus sampling — which truncates the tail of the probability distribution, sampling from the dynamic nucleus of tokens containing the vast majority of the probability mass. The counter-intuitive empirical observation is that even though the use of likelihood as a training objective leads to high-quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive.

Why It Matters: Text degeneration is an issue even in the latest cutting edge language models. Decoding strategies are important to create more human-like text generation for various tasks. Moving away from greedy algorithms like beam search will help performance on downstream tasks.

Example of beam search based generation vs human generation

What Can Neural Networks Reason About?

Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken-ichi Kawarabayashi, Stefanie Jegelka

Main contribution: This paper introduces a framework called algorithmic alignment to measure how well neural networks perform on reasoning tasks. Neural networks that “align” with known algorithmic solutions are better able to learn the solutions. The framework roughly states that for the model to be able to learn and successfully generalize on a reasoning task, it needs to be able to easily learn (to approximate) steps of the reasoning tasks. The authors showed graph neural networks are well suited for and therefore can learn to solve dynamic programming problems.

Why It Matters: This is a dense theoretical paper explaining architectural choices people have been intuitively making and lays the groundwork for future research exploring new architectures to better fit tasks. It creates a new framework to evaluate future algorithms and tasks.

Sequential Latent Knowledge Selection for Knowledge-Grounded Dialogue

Byeongchang Kim, Jaewoo Ahn, Gunhee Kim

Main Contribution: This paper proposes a novel approach to selecting knowledge for open-domain dialogue called Sequential Latent Model which represents knowledge history as some latent representation. They do this because keeping track of knowledge history reduces the ambiguity caused from the diversity in knowledge selection of conversation but can also help better use the response information/utterances.

Why It Matters: This work shows that improving knowledge selection can make a big difference in response generation quality. This has implications for building more robust dialogue applications.

Examples of generated responses by the author’s model and baselines on Wizard of Wikipedia. TMN stands for E2E Transformer MemNet, and A and W for apprentice and wizard.

A Probabilistic Formulation of Unsupervised Text Style Transfer

Junxian He, Xinyi Wang, Graham Neubig, Taylor Berg-Kirkpatrick

Main Contribution: The authors propose a probabilistic approach to unsupervised text style transfer. This approach works by using non-parallel data from two domains as a partially observed parallel corpus. The authors’ proposed model learns to transform sequences from one domain to another domain. By generating a parallel latent sequence that generates each observed sequence, this allows the model to learn this in an unsupervised way.

Why It Matters: The paper had good results for the following tasks: unsupervised sentiment transfer, formality transfer, word decipherment, author imitation, and machine translation. Some of these could be useful features for future writing applications. The approach introduced in the paper does not require paired training data, which makes data collection for style transfer easier.

Results on the sentiment transfer, author imitation, and formality transfer

ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut

Main Contribution: ALBERT is an extension of BERT that tries to answer the question: are larger models the answer to NLP tasks? Albert achieves SOTA results by cross-layer parameter sharing. By sharing parameters, ALBERT can be smaller with similar performance. The best results from ALBERT are with more parameters — but it still trains faster than BERT. And when they train for the same amount of wall-time, ALBERT performs better than BERT.

Why It Matters: These results are promising, showing that simply building more complex, larger, deeper models is not always the best approach to improving model performance.

State-of-the-art results on the SQuAD and RACE benchmarks

[Encoding Word Order in Complex Embeddings](https://arxiv.org/pdf/1912.12333.pdf

Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, Jakob Grue Simonsen

Main Contribution: This paper describes a new language model that captures both the position of words and their order relationships. The paper redefines word embeddings (previously thought of as fixed and independent vectors) to be functions of the word’s position. The author’s Transformer Complex-Order model outperforms the Vanilla Transformer and complex-vanilla Transformer by 1.3 and 1.1 in absolute BLEU score respectively.

Why It Matters: Position embeddings capture the position of individual words, but not the ordered relationship (e.g., adjacency or precedence) between individual word positions. Their approach allows word representations in different positions to correlate with each other in a continuous function.

Reformer

Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya

Main Contribution: The authors propose a new transformer model with two major improvements to the architecture: a) using reversible layers to prevent the need of storing the activations of all layers for backpropagation, and b) using locality sensitive hashing to approximate the costly softmax(QK^T) computation in the full dot-product attention

Why It Matters: The Reformer performs on par with SOTA Transformer models while being much more memory-efficient and much faster on long sequences. For examples, the Vaswani et al. (2017) base model had a BLEU score of 27.3 compared to Reformer’s BLEU score of 27.6 on newstest2014 for WMT English-German.

On the left, Locality-Sensitive Hashing Attention showing the hash-bucketing, sorting, and chunking steps and the resulting causal attentions. On the right, (a-d) Attention matrices for these varieties of attention.

Thieves on Sesame Street! Model Extraction of BERT-based APIs

Kalpesh Krishna, Gaurav Singh Tomar, Ankur P. Parikh, Nicolas Papernot, Mohit Iyyer

Main Contribution: This paper highlights an exploit only made feasible by the shift towards transfer learning methods within the NLP community: for a query budget of a few hundred dollars, an attacker can extract a model that performs only slightly worse than the victim model on SST2, SQuAD, MNLI, and BoolQ. On the SST2 task, the victim model had a 93.1% accuracy compared to their extracted model’s 90.1%. They show that an adversary does not need any real training data to mount the attack successfully. The attacker does not even need to use grammatical or semantically meaningful queries. They used random sequences of words coupled with task-specific heuristics to form useful queries for model extraction on a diverse set of NLP tasks.

Why It Matters: Outputs of modern NLP APIs on nonsensical text provide strong signals about model internals, allowing adversaries to train their own models and avoid paying for the API.

Overview of the proposed model’s extraction setup for question answering

Questions? Concerns? Snarky comments about papers I missed? Any and all feedback welcome. Ping me in the comments below or on twitter @christinahkim

Grounded Language Learning

Wed, 04 Mar 2020 00:00:00 +0000

What is it

Grounded Language Learning is the process of learning representations for words based on non-linguistic experience.

Grounded language learning works to make use of language, multimodal information, and interactive environments. This research works toward achieving natural language understanding. Currently, natural language understanding is commonly tested by language models from text-only corpora. This approach with language models is based on the idea that the meaning of a word is based on only its relationship to other words. This is also called a distributional notion of semantics. SOTA language models are incredible for many different tasks and even come close to beating humans at natural language understanding benchmarks. However, there are critiques around NLU benchmarks, and grounded language learning argues that the words are not grounded in anything and, therefore, actually meaningless. With grounded language learning, the hope is to be able to create models that can understand and generalize well to their context.

From VIGIL: Visually Grounded Interaction and Language

“In neuroscience, recent progress in fMRI technology has enabled better understanding of the interaction between language, vision and other modalities suggesting that the brains share neural representations of concepts across vision and language. In concurrent work, developmental cognitive scientists have argued that word acquisition in children is closely linked to them learning the underlying physical concepts in the real world and that they generalize surprisingly well at this from sparse evidence.”

Tasks

Visual QA

Embodied QA

Captioning

Visual-Audio Correspondence

Look, Listen, Learn “what can be learned by training visual and audio networks simultaneously to predict whether visual information (a video frame) corresponds or not to audio information (a sound snippet)?”

Embodied Agents Performing Interactive Tasks

AI Habitat

StreetLearn

House3D

Other Games

Mechanical Turker Descent (MTD) that trains agents to execute natural language commands grounded in a fantasy text adventure game

Shaping representations through communication: community size effect in artificial learning systems

Fri, 13 Dec 2019 00:00:00 +0000

paper summary of Shaping Representations Through Communication: Community Size Effect in Artificial Learning Systems

This paper is motivated by co-adapation that occurs (information shared between them can become too specific) a) between a single encoder/speaker, and decoder/listener or b) when shared language arises within small groups. Since as long as the encoder, and decoder agree on the information, the learned information doesn’t need to be representative, abstractive, or systematic. This paper explores adding more encoders and decoders that are randomly paired up at each training step to encourage the encoders and decoders to learn a more general representation. They find that increasing the community size of encoders/decoders reduces idiosyncrasies in learned code and prevents co-adaption.

Learning Machine Learning

Mon, 09 Dec 2019 00:00:00 +0000

Written for some coworkers who wanted to learn deep learning

How to get started

Classical ML resources
DL resources
NLP resources
Brushing up on math

Paper reading

How to manage papers
How to figure out what to read
How to read a paper

How to get started

I spent some time learning classical ML first since it was most relevant for my job. You can learn deep learning first without any other ML experience/knowledge.

ML resources

I started off with a homemade ML in 10 weeks course. TL;DR, here’s the course, using content primarily from Hands-On Machine Learning with Scikit-Learn and TensorFlow and Andrew Ng’s Coursera course on ML:

- Chapter 2 End-to-End Machine Learning Project
- Chapter 3 Classification (precision/recall, multiclass)
- Text feature extraction (from sklearn docs)
- Chapter 4 Training Models (linear/logistic regression, regularization)
- Advice for Applying Machine Learning
- Chapter 5 SVMs (plus kernels)
- Chapter 6 Decision Trees (basics)
- Chapter 7 Ensemble Learning and Random Forests (xgboost, RandomForest)
-  Chapter 8 Dimensionality Reduction (PCA, t-SNE, LDA)
- Machine Learning System Design
(Google) Best Practices for ML Engineering A group of friends and I worked through this content at a cadence of one meeting every other Wednesday starting late June 2018 wrapping up at the end of 2018.

Deep learning resources

Neural Networks and Deep Learning by Michael Nielsen http://neuralnetworksanddeeplearning.com/index.html
fast.ai
- Practical Deep Learning for Coders https://course.fast.ai/videos/?lesson=1
- Part 2: Deep Learning from the Foundations https://course.fast.ai/videos/?lesson=8
distill is a good resource for topics. ex:
- https://distill.pub/2017/momentum/
- https://distill.pub/2016/augmented-rnns/

NLP resources

Kyunghyun Cho’s lecture notes on “Natural Language Processing with Representation Learning”: https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf
Jacob Eisenstein’s textbook on “Natural Language Processing” (https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)

Brushing up on math

It’s easy to get intimated by the math in papers. I found that taking the time to relearn linear algebra and some calculus has had compounding returns!

Paper reading

Once you’ve understood common concepts, the best way to keep up to date with research and continue learning beyond courses is by reading and reimplementing papers.

How to manage papers

I recommend you track papers either through Zotero or Mendeley. I started off using Zotero but switched Mendeley to share folders/papers in groups I was in. I don’t have a strong opinion on which one is better.

How to figure out what to read? Check out these sources:

twitter - follow 20+ practitioners/researchers you admire on twitter to find interesting papers
ML subreddit
AI/DL fb groups
arXiv - there’s 10-20 new papers on arXiv every day for AI/computational linguistics so you could just browse arXiv every day for the latest papers in the topics you’re most interested in
AI blogs
- Import AI
- NLP Newsletter

How to read a paper:

your objective is to figure out quickly which papers NOT to read
spend time in the conclusions
try to answer the question what is novel?
create a reading group! Even just one other person can already save you 50% of the time.

On NMT Search Errors and Model Errors: Cat Got Your Tongue?

Mon, 18 Nov 2019 00:00:00 +0000

paper summary of On NMT Search Errors and Model Errors: Cat Got Your Tongue?

Current NMT models may not be working correctly! When mostly given perfect conditions, aka an infinite search space, over half of the time, the models predict no translation (empty sequences).

In this paper, the authors explore what NMT model’s predict as the best translation when the beam width is “infinite.” When the beam width is infinite, the model can consider all and any possibilities. Their technical contribution was coming up with a way to have models use infinite beam widths when searching. They found that in 51.8% of the cases, the model thought an empty sequence was the best translation! With beam search, longer sequences have lower probabilities, and it appears in many cases lower probabilities than a single EOS token.

This paper is a startling contribution to NMT. It explores and exposes a very troubling bug of current NMT models. This paper is an exciting contribution since many NLP problems are modeled as a sequence to sequence problems, and many of the SOTA language models involve decoding with beam search. These are exciting implications and bring new questions for NLP research.

Beam Search

Mon, 18 Nov 2019 00:00:00 +0000

Description

Beam search is a search algorithm to find the best choice from many options. It explores a graph by expanding the most promising node in a limited set. A beam search is usually used in situations where there isn’t a way to store the entire search tree in memory.

where is it used in DL?

sequence to sequence models like neural machine translation

how is it used in DL?

used to find the next best word, can be EOS

For example:

In NMT there are lots of possible combinations of words for a translation but we want to pick the best one, aka the one with the max probability distribution, and not one at random.

Interesting things to note:

When beam width = 1, then beam search is essentially a greedy algorithm

Beam search multiplies the log of the probabilities of the words to find the max probability - this leads to beam search favoring very short translations. This is kind of fixed by dividing it by the number of tokens.

Code

    def beam_search_decoder(encoded_data, beam_width: int):
        sequences = [[list(), 1.0]]
        # walk over each step in sequence
        for row in encoded_data:
            all_candidates = list()
            # expand each current candidate
            for i in range(len(sequences)):
                seq, score = sequences[i]
                for j in range(len(row)):
                    # why are we taking the log of the product of the probabilities
                    # instead of just the product of the probabilities?
                    # the probabilities are all numbers less than 1,
                    # multiplying a lot of numbers less than 1 will result in a very smol number
                    candidate = [seq + [j], score * -np.log(row[j])]
                    all_candidates.append(candidate)
            # order all candidates by score
            ordered = sorted(all_candidates, key=lambda tup: tup[1])
            # select beam_width best
            sequence = ordered[:beam_width]
        return sequence

Math

How the algorithm works from Andrew Ng’s coursera course:

History

The term “beam search” was coined by Raj Reddy, Carnegie Mellon University, 1977.

Negated: LAMA Birds cannot fly

Thu, 14 Nov 2019 00:00:00 +0000

paper summary of Negated LAMA: Birds cannot fly

When pre-trained language models, such as GPT-2, came out, I was curious about what they were learning and what applications could an LM, like GPT-2, have. What exactly was the model learning?

In Language Models as a Knowledge Base? the authors have a possible answer to that question. The idea was that LM is learning facts and understanding some things, and could you used them then as a knowledge base? The experiments they ran focused masking words in a cloze sentence to get the LM to predict what the answer would be. A cloze statement “is generated from a subject- relation-object triple from a knowledge base and from a template statement for the relation that contains variables X and Y for subject and object (e.g., “X was born in Y”).” For instance, “birds can MASK,” would return fly.

The authors of Negated LAMA find that LM is not great with negation. When given “birds cannot MASK,” it returns fly as well. This suggests that LM are not actually understanding the text shoveled into it. This does show that LM’s are good at answering all questions regardless of accuracy, but the authors suggest that maybe it should not be giving an answer.

Language Models as a Knowledge Base?

Thu, 14 Nov 2019 00:00:00 +0000

paper summary of Language Models as Knowledge Bases?

This paper explores using language models as a knowledge base. Structured knowledge bases are pretty restrictive and hard to manage. For that reason, there hasn’t been a lot of research interest in structured knowledge bases since the 80s since it appeared to be impractical for Q/A other fact-based NLP tasks.

Language models make an attractive substitute for structured knowledge bases since they don’t run into the same issues structured knowledge bases had. They don’t require schema engineering, allow queries about an open class of relations, can easily extend to more data and require no human supervision to train.

The authors propose LAMA (LAnguage Model Analysis), which is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models. They used this method to evaluate BERT. They found that BERT-large performs on par to a knowledge base from a text. Their experiments found that pretrained BERT-large was able to recall knowledge better than its other pretrained LM competitors and at a level remarkably competitive with non-neural and supervised alternatives.

Backpropagation

Wed, 06 Nov 2019 00:00:00 +0000

Description

Backpropagation is short for backward propagation of errors. It’s used to compute gradients for our loss function in machine learning. It does this by computing the partial derivatives of each training example in regards to the weights and bias. Cost can be written as a function of the outputs from the neural network

Another neat feature of backpropagation is that relates the neural network’s error and the weights and biases of the neural network’s last layer and it does this for the weights and biases of the last layer to the weights and biases of the second to last layer, and continues, using chain rule.

why is it important?

there’s lots of ways you can compute the gradients for our loss function but most of them take too long for deep learning - requiring us to compute the cost in respects to each parameter each .

Backpropagation is way faster - it allows us to simultaneously compute all the partial derivatives using one forward pass through the network, then one backward pass through the network.

Math

The actual algorithm:

Taken from Neural Networks and Deep Learning by Michael Nielson

Skip Connections and Residual Blocks

Tue, 29 Oct 2019 00:00:00 +0000

Description

This is a residual block, or skip connection. The residual is the difference between the predicted and target values. The words residual and skip are used interchangeably in many places. They can also be thought of as an identity block. If there is no residual then H(x) = x (aka the identity).

A residual block is trying to learn the residual of the true distribution minus the input, while a normal layer tries to learn that true distribution

why does this work better?

Theoretically this shouldn’t matter. Neural networks are function approximations. The more layers you add, the better the approximation should be. However, this does not work in reality for many reasons such as exploding or vanishing gradients. By allowing the values to essentially pass through in a linear way, we can use previous gradients.

In other words, a deeper net is not necessarily optimal because it might not learn the abstractions as the information learned in the earlier layers might disappear in the later layers.

Less layers means a simpler model which is faster to train.

code

    class ResidualBlock(nn.Module):
       expansion = 1


       def __init__(self, in_planes, planes, stride=1):
           super(BasicBlock, self).__init__()
           self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
           self.bn1 = nn.BatchNorm2d(planes)
           self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
           self.bn2 = nn.BatchNorm2d(planes)


           self.shortcut = nn.Sequential()
           if stride != 1 or in_planes != self.expansion*planes:
               self.shortcut = nn.Sequential(
                   nn.Conv2d(in_planes, self.expansion*planes, kernel_size=1, stride=stride, bias=False),
                   nn.BatchNorm2d(self.expansion*planes)

math

R(x) is the residual
H(x) is the output aka true distribution
x is the input in the layer

R(x) = Output — Input = H(x) — x

The residual plus the input is the true distribution
H(x) = R(x) + x

history

First introduced by microsoft in 2015.

Human-Like Decision Making: Document-level Aspect Sentiment Classification via Hierarchical Reinforcement Learning

Tue, 22 Oct 2019 00:00:00 +0000

paper summary of Human-Like Decision Making: Document-level Aspect Sentiment Classification via Hierarchical Reinforcement Learning

Document-level Aspect Sentient Classification is a task to predict user’s sentiment polarities for different aspects of a product in a review. The authors propose using Hierarchical Reinforcement Learning to do the task as it’s more interpretable than other successful neural nets that perform well on DASC.

HRL is pretty cool because it makes (more) intuitive sense as to how it works compared to other neural net architecture. Instead of regular RL, there are different tiers of policies. For DASC, the authors propose using a high-level policy to find the relevant clauses. A low-level policy to select sentiment-relevant words inside the selected clauses, they used a sentiment rating predictor that provides the reward signals to both the clause and word selection policies.

The results of this method were comparable to state of the art methods of aspect sentiment classification. A cool thing that I appreciated the authors, including were ablation studies for HRL. They broke down how each component of the HRL architecture impacted the results.

It was interesting to note that negated sentences didn’t work well with this method. I think negation is an interesting problem because if a neural net truly understands a sentence/clause, then I’d expect it to understand negations easily.

Natural Language Understanding

Tue, 18 Jun 2019 00:00:00 +0000

Natural Language Understanding has been a focus in AI since the birth of AI as a field was defined. The 60s and 70s saw a burst of knowledge bases, ontologies, math word problem solvers. NLU is considered an AI hard problem (aka when we figure this out we have figured out general artificial intelligence).

The tricky part of building a machine reading-comprehension system is that it requires a mostly-accurate (I say mostly accurate because unclear how specific human’s models of the world are) idea of the world, and the ability to generalize to new contexts/situations reasonably well. The ability to reason and make pretty good predictions in new cases given our prior knowledge and understanding of the world is what makes us great general purpose learners (and also helped us to exist till now probably).

These are the current types of tasks and benchmarks that test for commonsense knowledge and reasoning.

Textual Entailment Textual entailment (TE) in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are termed text (t) and hypothesis (h), respectively. examples of tasks:

RTE Challenges (Dagan, Glickman, & Magnini, 2005),
Story Cloze Test (Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli, & Allen, 2016)
SWAG (Zellers, Bisk, Schwartz, & Choi, 2018)

Question Answering

CommonsenseQA (Talmor, Herzig, Lourie Berant, 2015)
Winograd Schema Challenge (Levesque, 2011)
GLUE (Wang, Singh, Michael, Hill, Levy, & Bowman, 2018)
Event2Mind (Rashkin, Sap, Allaway, Smith, & Choi, 2018b)
bAbI (Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin, & Mikolov, 2015)
SuperGLUE(Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy, Bowman, 2019)

Systems Sharing

Fri, 27 Apr 2018 00:00:00 +0000

A friend asked me the other day, “What are some “systems” that you have incorporated in your life that you felt have been super good ROI?”

Here was my response:

Weekly review
- Every week I reflect on things that went well, things that didn’t go according to plan, how i could have prevented that. I would highly recommend doing something like this if you don’t already. It’s a really good way to make sure you’re noticing the things that are actually happening in your life/work. During this I try to be as objective as possible and not pass judgement on myself. This is the best way to make sure you’re being honest about why you did something. Attributing negative feelings towards a behavior makes you less incentivized to notice again in the future - no one wants to feel bad and people will go to pretty fascinating lengths to build blind spots for things they want to avoid thinking about. I find that it also makes it easier to implement new habits/goals too. This is a newer system I’ve adopted but so far has been extremely rewarding.
Scheduled deep work time
- I block off mornings as Do Not Disturb blocks of work. No meetings/no notifications. Consistency is important to me and having scheduled deep work time allows me to get in solid hours of programming in vs scattered interrupted work.
Non-violent communication
- I have been practicing (but not yet mastering) non-violent communication. Not really sure if this counts as a “system” but seems worthwhile to think about consistently.
Social spreadsheet
- This sounds creepy but I promise it’s only used for good 🙂I keep a spreadsheet of people and interactions. This helps me remember/keep in contact with folks and helps me to be a more intentional friend/ally.

Big Habits

Sun, 22 Apr 2018 00:00:00 +0000

Disclaimer: This post was written in less than 30 mins. I’m writing this for the sake of exploring an idea and am not convinced this is a good idea. Take this as a stream of consciousness

I’ve been experimenting with doing the extreme version of a habit to make it stickier. I have this idea that instead of doing the easiest version (do 1 pushup a day), doing the harder extreme version (100 pushups a day) makes an idea/habit stickier. The idea behind doing a easy version of a habit you want to adopt is that it makes it lower friction and simple, and eventually you will be able to work more of it into your daily life. This makes sense to me - but I think that the way to make a habit seem achievable and lower friction is actually doing the opposite. By pushing ourselves to the other end of the spectrum, you learn how doable it is and you realize you’re more than capable of actually implementing a more reasonable version of your habit. (I may also enjoy starting off habits like this because I think over indexing is powerful and I generally enjoy exploring my limits 🙂)

Signals

Fri, 20 Apr 2018 00:00:00 +0000

Today we had Robin Hanson come speak at The Archive. His latest book Elephant in the Brain exposes the hidden motivations and desires in our lives. Hanson posits that in our interactions and decisions we try to signal to others certain things. Universities are an example of signaling intelligence, even if you didn’t learn that much. Even changing the conversation can be signaling.

The larger ideas from the book include a) we don’t acknowledge our real motivations for our externally-inspired actions and b) being able to recognize the latent motivations and what people actually want is critical to creating effective policies.

I left his talk with two thoughts:

How hard would it be to train yourself to be able to pick up hidden motivations?

His book and his talk reminded me of my experience in undergrad with HCI research and need-finding. I remember conducting countless user interviews and still not being able to come to a definite conclusion of what was missing, and what to design and build. Even interviewing “super/power” users was not useful. Other than becoming more self-aware, I wonder if there are easier hacks to discover the dissonance between what we say we want and what we actually want. I’m imagining this to work in a similar way as to how we can hear when a note is out of tune.

Given Hanson’s arguments, would publicly shaming bad behavior (e.g. predatory behavior) help create a more net positive world?

It seems to me that publicly shaming behavior would be an effective way to get behavioral outcomes. Disclaimer: his is a half baked thought and will dig into this more when I have a second.

“Humans aren’t what they pretend to be. But what they actually are is spectacular” — Hanson

What am I signaling by writing about signaling? 🤔

researcher at openai, currently on the mid-training team. previously on post-training and reinforcement learning. i've worked on [webgpt](https://openai.com/research/webgpt), [chatgpt](https://openai.com/blog/chatgpt), [chatgpt with browsing](https://openai.com/blog/chatgpt-plugins#browsing), [gpt-4](https://arxiv.org/abs/2303.08774)