Wilson’s Homepage

Predicting Phonemes with BERT

2022-04-23T00:00:00+10:00

Our team at Bookbot is currently developing a grapheme-to-phoneme Python package for Bahasa Indonesia. The package is highly inspired by its English counterpart, g2p. A lot of our design and methods are borrowed from that library, most notably the steps to predict phonemes. The English g2p used the following algorithm (c.f. g2p’s README):

Spells out arabic numbers and some currency symbols. (e.g. $200 -> two hundred dollars) (This is borrowed from Keith Ito’s code)
Attempts to retrieve the correct pronunciation for heteronyms based on their POS)
Looks up The CMU Pronouncing Dictionary for non-homographs.
For OOVs, we predict their pronunciations using our neural net model.

Steps 1-3 are particularly easier to develop, granted that we were able to find an online Bahasa Indonesia lexicon from ipa-dict. Step 4 however, was particularly challenging. Authors of g2p used a recurrent, sequence2sequence GRU that takes in graphemes as inputs and outputs phonemes. This approach is particularly useful because we would not need to determine the rules of conversion by hand. The neural net would do the heavy lifting prediction for us for unseen words.

Seeing their success, we attempted a similar approach. That is, we trained a recurrent sequence2sequence LSTM on the aforementioned lexicon, which you can find here. As expected, the model worked great for words that are relatively simple and words whose sub-words may have been in the training set. It also achieved a validation accuracy of over 97% – and so we thought it would suffice.

We then converted the model to ONNX for deployment purposes and soon ended up with a working prototype g2p library, using the exact same approach as the English g2p. Upon further playing around, we quickly found an issue with the seq2seq approach. Though it performed well on the held-out validation set, it quickly crumbled when given strikingly different words, for instance names of people or names of a place. On the one hand, this is not surprising given that its training data is relatively small. But we thought we could do better.

First, we realized that phonemes in the IPA format that our data was in was not too different from their corresponding graphemes. For instance, here are a few examples:

sampingnya = sampiŋɲa
tayangan = tajaŋan
bepercikan = bəpərtʃikan
deduktif = deduʔtif

You may notice that there are simple mapping rules that we could infer by hand. Indeed, we found the following rules to be sufficient

PHONETIC_MAPPING = {
    "ny": "ɲ",
    "ng": "ŋ",
    "c": "tʃ",
    "'": "ʔ",
    "aa": "aʔa",
    "ii": "iʔi",
    "oo": "oʔo",
    "əə": "əʔə",
    "j": "dʒ",
    "y": "j",
    "q": "k"
}

CONSONANTS = "bdfghjklmnprstvwxɲ"

def g2p(text):
    if text.endswith("k"):
        text = text[:-1] + "ʔ"

    for g, p in PHONETIC_MAPPING.items():
        text = text.replace(g, p)

    for c in CONSONANTS:
        text = text.replace(f"k{c}", f"ʔ{c}")

    return text

The code is written in Python, with very basic if-this-then-that rules. This approach made a lot of sense, given that changes from a grapheme to an IPA phoneme aren’t too drastic, at least in our case. A sequence2sequence model could definitely do the same, but it would probably need a larger and more diverse dataset for training.

But, that doesn’t mean that the English g2p approach using a GRU was ineffective! Notice that their phoneme is of the ARPAbet format, which is significantly more complicated than the IPA format we used. Their approach made complete sense because of the change in text domains. This is the same reason why translation tasks are better of using a sequence2sequence neural net over hand-written rules. It would take ages, if not impossible, to code up all rules of translation between 2 languages, but a recurrent model like GRU could automatically learn this “hidden translation rule” if there was one.

A problem with the letter E

But there was a huge issue with the rule-based approach we took. That is, there are 3 ways to pronounce the letter e in Indonesian, according to KBBI. The lexicon that we used further limited the pronunciation to only two ways: a closed-mid front unrounded vowel e or a mid central vowel ə. For example, the word bebek (meaning: duck) has the phoneme bebek, while the word delapan (meaning: eight) has the phoneme dəlapan. Sometimes, a word might have >1 e’s pronounced in both ways, like the word mereka (meaning: they) that is pronounced as məreka. You can hear how they sound through the Google Translate TTS here, here, and here.

To the best of our knowledge, there isn’t a linguistic rule to determine exactly how a particular e should sound like. KBBI might have phonetic assistance for this purpose, particularly homographs. Non-homographs, however, do not have phonetic assistance. I personally think that this is a huge problem, especially for new learners of the language. Native speakers like me would find this distinction of e’s as natural, but I can’t imagine being in the shoes of someone learning the language.

To be fair, the Indonesian language isn’t like the English language where there are “native speakers” to whom we can consult. The Indonesian language is a lingua franca, a standardized version of Malay, and was largely influenced by Dutch and tons of other regional languages such as Javanese, Sundanese, etc. There might not necessarily be a definitive “correct” way to pronounce the letter e of a given word, because in order to do so, we need to consult the origin of the word. Furthermore, different regions of Indonesia may pronounce the same word differently, due to their dialect. You can read more about this here and here here. Both discussions are in Indonesian, but Google Translate should do the job.

In any case, our g2p package needs a way to distinguish e’s from ə’s. Once that distinction has been made, we can simply pass it to the hand-written g2p algorithm that does the rest of the job.

Formulating the Problem

At first, we thought a sequence2sequence can do the job just fine. We can simply train on pairs of data like:

bebek & bebek
delapan & dəlapan
mereka & məreka

and then simply pass their output to the hand-written g2p rule. But after more thinking, we recalled the pitfalls of this method and thought that it would suffer from the same issues. Bad OOV performance, incorrect output length, etc. And so we re-formulated the problem differently.

Instead of treating the phonetic prediction as a generation problem, why not treat it as a de-masking problem? That is, instead of training an autoregressive model like an LSTM, why not train an autoencoder model like BERT instead?

Normally, a BERT model is trained as a word-level masked language model; think fill in the blanks problem. Given the context:

The weather is good today, the ___ is bright and blue.

Have a ____ and relax.

You can probably infer what those blanks should be. And that is exactly how BERT is trained. It sees the neighbors of the masked (emptied) word, and makes a prediction based on them. Realizing this, I saw a very intruiging possibility to implement the same mechanics for our problem with the letter e. That is, frame the problem as:

Context: b _ b _ k, Output: b e b e k
Context: d _ l a p a n, Output: d ə l a p a n
Context: m _ r _ k a, Output: m ə r e k a

and so on. The hope is that, given the neighbouring letters, the BERT model will be able to infer the right phoneme of e to use.

Per my research, I have not found someone else using the same approach. I don’t know if the idea is merely bad on paper, so I gave it a try because, why not?

Code

Dataset

This is the training dataset that I ended up with. But recall, we need to mask out the e’s later and let the model predict the suitable phonetic e. Again, this dataset originates from the ipa-dict which we pre-processed and modified. You can find our version here.

	word	target
0	- - n y a	- - n y a
1	- a n d a	- a n d a
2	- b a u r	- b a u r
3	- b e l a s	- b ə l a s
4	- c o m p e n g	- c o m p e n g
…	…	…
27547	z o h o r	z o h o r
27548	z o n a	z o n a
27549	z u h u r	z u h u r
27550	z u l k a r n a i n	z u l k a r n a i n
27551	z u r i a t	z u r i a t

Character-Level Masked Language Model

Now, I have never written a BERT Masked Language Model from scratch, so I followed a very nice guide from Keras, written by Ankur Singh. It’s very clear and easily customizable to our use case, so I went with it.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
from dataclasses import dataclass
import pandas as pd
import numpy as np

@dataclass
class Config:
    MAX_LEN = 32
    BATCH_SIZE = 128
    LR = 0.001
    VOCAB_SIZE = 32
    EMBED_DIM = 128
    NUM_HEAD = 8
    FF_DIM = 128
    NUM_LAYERS = 2

config = Config()

Tokenization and Preprocessing

The tutorial used a Keras TextVectorization layer for tokenization purposes, which I also find to be easy to use and customize. The only change I made was simplifying the text standarization function.

def get_vectorize_layer(texts, vocab_size, max_seq, special_tokens=["[MASK]"]):
    vectorize_layer = TextVectorization(
        max_tokens=vocab_size,
        output_mode="int",
        standardize=lambda input_data: tf.strings.lower(input_data),
        output_sequence_length=max_seq,
    )
    vectorize_layer.adapt(texts)

    vocab = vectorize_layer.get_vocabulary()

    vocab = vocab[2 : vocab_size - len(special_tokens)] + ["[mask]"]
    vectorize_layer.set_vocabulary(vocab)
    return vectorize_layer

vectorize_layer = get_vectorize_layer(
    df.target.values.tolist(),
    config.VOCAB_SIZE,
    config.MAX_LEN,
    special_tokens=["[mask]"],
)

This is where most of the changes were made. First, instead of masking characters at random, only a “hard-mask” was applied on both e and ə tokens, completely masking them out in every text. This meant that the 15% BERT masking, 90%/10% random masking, as well as the 10% random swaps were all removed. I found that masking other characters which are not e’s gave worse performance. I suspect that this just made the problem even harder for the model to learn since there is very minimal context.

# Get mask token id for masked language model
mask_token_id = vectorize_layer(["[mask]"]).numpy()[0][0]
e1_token_id = vectorize_layer(["e"]).numpy()[0][0]
e2_token_id = vectorize_layer(["ə"]).numpy()[0][0]

def encode(texts):
    encoded_texts = vectorize_layer(texts)
    return encoded_texts.numpy()

def get_masked_input_and_labels(encoded_texts):
    # BERT masking
    inp_mask = np.random.rand(*encoded_texts.shape) < 0
    # Do not mask special tokens
    inp_mask[encoded_texts <= 2] = False
    # Force mask e's
    inp_mask[encoded_texts == e1_token_id] = True
    inp_mask[encoded_texts == e2_token_id] = True
    # Set targets to -1 by default, it means ignore
    labels = -1 * np.ones(encoded_texts.shape, dtype=int)
    # Set labels for masked tokens
    labels[inp_mask] = encoded_texts[inp_mask]

    # Prepare input
    encoded_texts_masked = np.copy(encoded_texts)
    encoded_texts_masked[inp_mask] = mask_token_id
    # note: we don't randomly change chars and apply all masks

    # Prepare sample_weights to pass to .fit() method
    sample_weights = np.ones(labels.shape)
    sample_weights[labels == -1] = 0

    # y_labels would be same as encoded_texts i.e input tokens
    y_labels = np.copy(encoded_texts)

    return encoded_texts_masked, y_labels, sample_weights

Here’s an example of an input, label, and weights array, respectively. Notice that at the index of the letter e, the input is masked and has the mask token id of 30, with the target token id of 18 and 4, corresponding to e and ə, respectively. Also notice that the weights default to 0 for unmasked tokens and 1 for masked tokens. This is to facilitate training. Recall that the model will only be “graded” by its performance on the blanks.

get_masked_input_and_labels(encode("m e r d ə k a"))

(array([ 8, 30,  6, 16, 30,  7,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 array([ 8, 18,  6, 16,  4,  7,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 array([0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]))

# Prepare data for masked language model
x_all = encode(df.target.values)
x_masked_train, y_masked_labels, sample_weights = get_masked_input_and_labels(x_all)

mlm_ds = tf.data.Dataset.from_tensor_slices(
    (x_masked_train, y_masked_labels, sample_weights)
)
mlm_ds = mlm_ds.shuffle(1000).batch(config.BATCH_SIZE)

BERT

There’s really no difference between the code written in the Keras guide with the one I have here. I’ll just note how elegant Keras code is for a model like BERT. But in any case, this model is exactly the same as if we were to train a word-level masked language model. This time, the input tokens are just characters instead of words. Same old objective, same architecture, and so on.

def bert_module(query, key, value, i):
    # Multi headed self-attention
    attention_output = layers.MultiHeadAttention(
        num_heads=config.NUM_HEAD,
        key_dim=config.EMBED_DIM // config.NUM_HEAD,
        name="encoder_{}/multiheadattention".format(i),
    )(query, key, value)
    attention_output = layers.Dropout(0.1, name="encoder_{}/att_dropout".format(i))(
        attention_output
    )
    attention_output = layers.LayerNormalization(
        epsilon=1e-6, name="encoder_{}/att_layernormalization".format(i)
    )(query + attention_output)

    # Feed-forward layer
    ffn = keras.Sequential(
        [
            layers.Dense(config.FF_DIM, activation="relu"),
            layers.Dense(config.EMBED_DIM),
        ],
        name="encoder_{}/ffn".format(i),
    )
    ffn_output = ffn(attention_output)
    ffn_output = layers.Dropout(0.1, name="encoder_{}/ffn_dropout".format(i))(
        ffn_output
    )
    sequence_output = layers.LayerNormalization(
        epsilon=1e-6, name="encoder_{}/ffn_layernormalization".format(i)
    )(attention_output + ffn_output)
    return sequence_output


def get_pos_encoding_matrix(max_len, d_emb):
    pos_enc = np.array(
        [
            [pos / np.power(10000, 2 * (j // 2) / d_emb) for j in range(d_emb)]
            if pos != 0
            else np.zeros(d_emb)
            for pos in range(max_len)
        ]
    )
    pos_enc[1:, 0::2] = np.sin(pos_enc[1:, 0::2])  # dim 2i
    pos_enc[1:, 1::2] = np.cos(pos_enc[1:, 1::2])  # dim 2i+1
    return pos_enc


loss_fn = keras.losses.SparseCategoricalCrossentropy(
    reduction=tf.keras.losses.Reduction.NONE
)
loss_tracker = tf.keras.metrics.Mean(name="loss")


class MaskedLanguageModel(tf.keras.Model):
    def train_step(self, inputs):
        if len(inputs) == 3:
            features, labels, sample_weight = inputs
        else:
            features, labels = inputs
            sample_weight = None

        with tf.GradientTape() as tape:
            predictions = self(features, training=True)
            loss = loss_fn(labels, predictions, sample_weight=sample_weight)

        # Compute gradients
        trainable_vars = self.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)

        # Update weights
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        # Compute our own metrics
        loss_tracker.update_state(loss, sample_weight=sample_weight)

        # Return a dict mapping metric names to current value
        return {"loss": loss_tracker.result()}

    @property
    def metrics(self):
        return [loss_tracker]


def create_masked_language_bert_model():
    inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)

    word_embeddings = layers.Embedding(
        config.VOCAB_SIZE, config.EMBED_DIM, name="word_embedding"
    )(inputs)
    position_embeddings = layers.Embedding(
        input_dim=config.MAX_LEN,
        output_dim=config.EMBED_DIM,
        weights=[get_pos_encoding_matrix(config.MAX_LEN, config.EMBED_DIM)],
        name="position_embedding",
    )(tf.range(start=0, limit=config.MAX_LEN, delta=1))
    embeddings = word_embeddings + position_embeddings

    encoder_output = embeddings
    for i in range(config.NUM_LAYERS):
        encoder_output = bert_module(encoder_output, encoder_output, encoder_output, i)

    mlm_output = layers.Dense(config.VOCAB_SIZE, name="mlm_cls", activation="softmax")(
        encoder_output
    )
    mlm_model = MaskedLanguageModel(inputs, mlm_output, name="masked_bert_model")

    optimizer = keras.optimizers.Adam(learning_rate=config.LR)
    mlm_model.compile(optimizer=optimizer)
    return mlm_model

id2token = dict(enumerate(vectorize_layer.get_vocabulary()))
token2id = {y: x for x, y in id2token.items()}

bert_masked_model = create_masked_language_bert_model()
bert_masked_model.summary()

Model: "masked_bert_model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to
==================================================================================================
 input_1 (InputLayer)           [(None, 32)]         0           []

 word_embedding (Embedding)     (None, 32, 128)      4096        ['input_1[0][0]']

 tf.__operators__.add (TFOpLamb  (None, 32, 128)     0           ['word_embedding[0][0]']
 da)

 encoder_0/multiheadattention (  (None, 32, 128)     66048       ['tf.__operators__.add[0][0]',
 MultiHeadAttention)                                              'tf.__operators__.add[0][0]',
                                                                  'tf.__operators__.add[0][0]']

 encoder_0/att_dropout (Dropout  (None, 32, 128)     0           ['encoder_0/multiheadattention[0]
 )                                                               [0]']

 tf.__operators__.add_1 (TFOpLa  (None, 32, 128)     0           ['tf.__operators__.add[0][0]',
 mbda)                                                            'encoder_0/att_dropout[0][0]']

 encoder_0/att_layernormalizati  (None, 32, 128)     256         ['tf.__operators__.add_1[0][0]']
 on (LayerNormalization)

 encoder_0/ffn (Sequential)     (None, 32, 128)      33024       ['encoder_0/att_layernormalizatio
                                                                 n[0][0]']

 encoder_0/ffn_dropout (Dropout  (None, 32, 128)     0           ['encoder_0/ffn[0][0]']
 )

 tf.__operators__.add_2 (TFOpLa  (None, 32, 128)     0           ['encoder_0/att_layernormalizatio
 mbda)                                                           n[0][0]',
                                                                  'encoder_0/ffn_dropout[0][0]']

 encoder_0/ffn_layernormalizati  (None, 32, 128)     256         ['tf.__operators__.add_2[0][0]']
 on (LayerNormalization)

 encoder_1/multiheadattention (  (None, 32, 128)     66048       ['encoder_0/ffn_layernormalizatio
 MultiHeadAttention)                                             n[0][0]',
                                                                  'encoder_0/ffn_layernormalizatio
                                                                 n[0][0]',
                                                                  'encoder_0/ffn_layernormalizatio
                                                                 n[0][0]']

 encoder_1/att_dropout (Dropout  (None, 32, 128)     0           ['encoder_1/multiheadattention[0]
 )                                                               [0]']

 tf.__operators__.add_3 (TFOpLa  (None, 32, 128)     0           ['encoder_0/ffn_layernormalizatio
 mbda)                                                           n[0][0]',
                                                                  'encoder_1/att_dropout[0][0]']

 encoder_1/att_layernormalizati  (None, 32, 128)     256         ['tf.__operators__.add_3[0][0]']
 on (LayerNormalization)

 encoder_1/ffn (Sequential)     (None, 32, 128)      33024       ['encoder_1/att_layernormalizatio
                                                                 n[0][0]']

 encoder_1/ffn_dropout (Dropout  (None, 32, 128)     0           ['encoder_1/ffn[0][0]']
 )

 tf.__operators__.add_4 (TFOpLa  (None, 32, 128)     0           ['encoder_1/att_layernormalizatio
 mbda)                                                           n[0][0]',
                                                                  'encoder_1/ffn_dropout[0][0]']

 encoder_1/ffn_layernormalizati  (None, 32, 128)     256         ['tf.__operators__.add_4[0][0]']
 on (LayerNormalization)

 mlm_cls (Dense)                (None, 32, 32)       4128        ['encoder_1/ffn_layernormalizatio
                                                                 n[0][0]']

==================================================================================================
Total params: 207,392
Trainable params: 207,392
Non-trainable params: 0
__________________________________________________________________________________________________

Train!

What’s left is just for us to call .fit(), because this is Keras. The Keras guide used the Adam optimizer, which generally works well for language models.

bert_masked_model.fit(
    mlm_ds, epochs=100, callbacks=[keras.callbacks.TensorBoard(log_dir="./logs")]
)

bert_masked_model.save("bert_mlm.h5")

Epoch 1/100
216/216 [==============================] - 8s 13ms/step - loss: 0.4276
Epoch 2/100
216/216 [==============================] - 3s 13ms/step - loss: 0.3865
Epoch 3/100
216/216 [==============================] - 3s 12ms/step - loss: 0.3320
Epoch 4/100
216/216 [==============================] - 3s 12ms/step - loss: 0.3048
Epoch 5/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2887
Epoch 6/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2870
Epoch 7/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2827
Epoch 8/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2795
Epoch 9/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2939
Epoch 10/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2751
Epoch 11/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2743
Epoch 12/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2678
Epoch 13/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2671
Epoch 14/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2609
Epoch 15/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2619
Epoch 16/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2681
Epoch 17/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2689
Epoch 18/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2582
Epoch 19/100
216/216 [==============================] - 4s 16ms/step - loss: 0.2526
Epoch 20/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2559
Epoch 21/100
216/216 [==============================] - 3s 14ms/step - loss: 0.2506
Epoch 22/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2548
Epoch 23/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2584
Epoch 24/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2502
Epoch 25/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2484
Epoch 26/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2448
Epoch 27/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2502
Epoch 28/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2471
Epoch 29/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2471
Epoch 30/100
216/216 [==============================] - 4s 20ms/step - loss: 0.2422
Epoch 31/100
216/216 [==============================] - 5s 22ms/step - loss: 0.2412
Epoch 32/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2398
Epoch 33/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2500
Epoch 34/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2445
Epoch 35/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2407
Epoch 36/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2376
Epoch 37/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2351
Epoch 38/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2363
Epoch 39/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2377
Epoch 40/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2351
Epoch 41/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2467
Epoch 42/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2408
Epoch 43/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2332
Epoch 44/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2355
Epoch 45/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2371
Epoch 46/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2353
Epoch 47/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2293
Epoch 48/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2270
Epoch 49/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2258
Epoch 50/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2255
Epoch 51/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2240
Epoch 52/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2309
Epoch 53/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2336
Epoch 54/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2297
Epoch 55/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2279
Epoch 56/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2245
Epoch 57/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2239
Epoch 58/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2225
Epoch 59/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2237
Epoch 60/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2213
Epoch 61/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2210
Epoch 62/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2186
Epoch 63/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2187
Epoch 64/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2191
Epoch 65/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2165
Epoch 66/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2172
Epoch 67/100
216/216 [==============================] - 5s 23ms/step - loss: 0.2182
Epoch 68/100
216/216 [==============================] - 4s 20ms/step - loss: 0.2143
Epoch 69/100
216/216 [==============================] - 5s 23ms/step - loss: 0.2171
Epoch 70/100
216/216 [==============================] - 4s 19ms/step - loss: 0.2096
Epoch 71/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2122
Epoch 72/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2169
Epoch 73/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2134
Epoch 74/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2117
Epoch 75/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2094
Epoch 76/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2123
Epoch 77/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2134
Epoch 78/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2117
Epoch 79/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2064
Epoch 80/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2111
Epoch 81/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2130
Epoch 82/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2089
Epoch 83/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2063
Epoch 84/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2042
Epoch 85/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2032
Epoch 86/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2071
Epoch 87/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2062
Epoch 88/100
216/216 [==============================] - 3s 13ms/step - loss: 0.1999
Epoch 89/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2021
Epoch 90/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2019
Epoch 91/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2056
Epoch 92/100
216/216 [==============================] - 4s 16ms/step - loss: 0.2062
Epoch 93/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2006
Epoch 94/100
216/216 [==============================] - 3s 13ms/step - loss: 0.2034
Epoch 95/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2003
Epoch 96/100
216/216 [==============================] - 3s 12ms/step - loss: 0.2005
Epoch 97/100
216/216 [==============================] - 3s 13ms/step - loss: 0.1970
Epoch 98/100
216/216 [==============================] - 3s 13ms/step - loss: 0.1951
Epoch 99/100
216/216 [==============================] - 3s 13ms/step - loss: 0.1960
Epoch 100/100
216/216 [==============================] - 4s 20ms/step - loss: 0.1991

Inference

It’s also quite simple to perform inference once the model finished training. We first need to load the model and its weights.

# Load pretrained bert model
mlm_model = keras.models.load_model(
    "bert_mlm.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel}
)

And then write up an inference function which we can reuse later. The way it works is also quite clear. Tokenize the input tokens as integers, while masking the e’s to be predicted. Then, pad the inputs to the maximum sequence length (in our case 32) and feed the input array to the BERT model. Decoding the output involves us finding the locations of those masked inputs, finding the most probable guess, and replacing the masked tokens with that prediction. Finally, we join the tokens once in they are all assembled.

def inference(sequence):
    sequence = " ".join([c if c != "e" else "[mask]" for c in sequence])
    tokens = [token2id[c] for c in sequence.split()]
    pad = [token2id[""] for _ in range(config.MAX_LEN - len(tokens))]

    tokens = tokens + pad
    input_ids = tf.convert_to_tensor(np.array([tokens]))
    prediction = mlm_model.predict(input_ids)

    # find masked idx token
    masked_index = np.where(input_ids == mask_token_id)
    masked_index = masked_index[1]

    # get prediction at those masked index only
    mask_prediction = prediction[0][masked_index]
    predicted_ids = np.argmax(mask_prediction, axis=1)

    # replace mask with predicted token
    for i, idx in enumerate(masked_index):
        tokens[idx] = predicted_ids[i]

    return "".join([id2token[t] for t in tokens if t != 0])

inference("menyebabkannya")

'mənyəbabkannya'

Not forgetting to apply the hand-written g2p rules that we came up with.

g2p(inference("menyebabkannya"))

'məɲəbabkanɲa'

And thus we are done.

In practice, I would convert the Keras model over to ONNX so that I can run the static model with only NumPy as a dependency instead of TensorFlow/Keras. But it’s really up to your use case.

Conclusion

This little weekend experiment of mine is pretty much just a proof of concept, certainly with room for improvements. But at least, I’m happy that it worked better than the LSTM. It’s much more controllable and won’t be too shabby of a guess for OOV words.

This will be available once the g2p package we’re developing becomes open source. Hopefully it is by the time that this blog post becomes live. Otherwise, we’re still working on it :)

My HuggingFace JAX Community Week Experience

2021-07-30T00:00:00+10:00

On June 23, the HuggingFace team announced that they are planning to host a community week together with the people from the Google Cloud team. The main gist of this event was getting everyone to learn and use HuggingFace’s newly integrated JAX framework. But aside from just learning from tutorials, we were equipped with blazing fast TPUs thanks to the amazing Google Cloud team 🤯.

Hearing this, I naturally gravitated to registering for the event, and so I immediately invited my good friend, Steven Limcorn, to join me in this event as it was a group work. We hopped into a Discord call and the brainstorm begins..

📝 Plan on Paper

Right off the bat, we were thinking: Indonesian Language Model. Why? Because it is the model which we both had experience training and it would be fun to learn JAX in the process (since we usually work in PyTorch).

At the same time, we came into a dilemma. If you know a thing or two about Indonesian NLP, there are two major players in masked language modeling (MLM): IndoNLU’s IndoBERT and IndoLEM’s IndoBERT.

We thought, okay, how can we make something different? Or perhaps something better? Rambling through Indonesian datasets, the first thing that came to mind is, of course, the OSCAR dataset (16GB). But, we thought, if we wanted the model to perform better than the existing models, we should be using a larger dataset, shouldn’t we?

Muppets, some of which became catchy deep learning jargons.

Despite the dilemma, we ended up posting the project proposal on the HuggingFace forums anyway. Luckily, a day later, we got a reply from a user that suggested two alternative datasets: CC100 (36GB) and mC4 (230GB). So we thought, cool, let’s train the best Indonesian model!

⚙ Setting up

To kick things off, we began by setting up the TPU Virtual Machine as instructed by the HuggingFace team. We found no major issues and the installation went pretty smoothly. All tokenizing and training scripts were ready, so no major code modification was needed to get the project started.

Fast forward, we began with training a tokenizer. “So, which dataset will we use?”, I asked. OSCAR? CC100? mC4? Heck, if we want to train the best model, why not use the largest dataset? And so I trained a ByteLevelBPETokenizer on the Indonesian mC4 subset, which took a good hour or two.. Fast forward, the tokenization finished and we’re ready to train! Or are we…?

Being naive, I naturally ran the training script and boy was I wrong. It took ages for the gigantic mC4 dataset just to pre-process; I was impatient. “Since we’re only creating a “trial” RoBERTa Base model, why bother training it on a huge dataset?” I thought.

And so we took the step back and trained on OSCAR instead. Being 93% smaller in size, it took only a couple of minutes to train a tokenizer. Likewise, the pre-processing step only took a while before the model actually began training.

🚶‍♂️ Roaming to Thai NLP

It was at this point that I left my computer to get a haircut (true story). While the model kept training and my hair was being cut, I paused and thought: “why not participate in another project?” since others were participating in >1 project at the same time.

Returning to my computer after the haircut (and shower) ended, I browsed through existing project proposals and found a less crowded one: Thai RoBERTa project. Cool! Why not join another project that similarly works on a low-resource language? Perhaps I can learn a thing or two from it…

And so I contacted the participant who’s responsible for the project: Sakares Saengkew. We talked, exchanged ideas, and ultimately agreed to work on this project together. What I didn’t expect was becoming really good friends with someone whom I have never met in person, let alone be based in Thailand 😆.

The more we talked, the more our friendship bonded. Along the lines of conversation, we found out that we both enjoyed watching Dota 2, so that became the topic of our conversation for a good while 😂. Games aside, Sakares’ original plan kept going, though with some hurdles along the way.

🤔 Debugging

As for my Indonesian model, it kept training and training, with an estimated training time of about 18 hours. At first, we were so happy that training actually took off. Mind you, we were Linux and machine learning amateurs, so getting things off the ground was already satisfying!

Both the training and evaluation loss seems to be decreasing well, with the accuracy attaining decent results in the first few epochs “all is well”, we thought.

With some more hours to kill, I decided to train another language model still using HuggingFace’s JAX framework, but this time on a personal Google Colab notebook. It was trained on the very low-resource language of Sundanese and the training and evaluation loss decreased just fine. It also achieved a decent accuracy, but something odd came to my realization…

Despite the accuracy reports, the language model was spitting out jibberish, unreflective of the results. “Maybe something is wrong?”, I thought. Indeed, I had trouble converting the JAX model to PyTorch, due to the usage of FP16. “Aha!”, that’s where I thought the problem lies.

And so I opened a Github Issue on the matter, to which HuggingFace’s Patrick von Platen responded quickly and professionally. Apparently, a “reverse-trick” which I attempted to do to convert FP16 JAX models to FP32 was indeed the fix my model needed. What about the model’s results, though? It remained jibberish, sadly.

At this point, I thought I did something wrong along the training pipeline. “Whatever”, I said to myself, let’s just focus on the main dish: the Indonesian model.

✌ Not One, But Two

Seeing my Indonesian model training just fine, I wanted to test its intermediate results after training it for about 6-8 hours. Pulled the model weights from the HuggingFace Hub, and voila, jibberish output! The beast which I expected to have trained is no different from its Sundanese counterpart 😓.

Bert's stare, just like mine.

Now I’m left with two problems instead of one. Badly trained models, but why? Naturally, I investigated the common ground between these two models: JAX and OSCAR dataset. The former seemed innocent though, since nobody has reported a problem with it, and I’m sure the HuggingFace team has checked the framework thoroughly…

“It must be the dataset!”, I thought. But wait, while I dug through issues in HuggingFace’s Github repo, I found someone who’s facing a similar problem as I am: Birger Moëll. Like Birger, our models were spitting jibberish despite a decent training result. Eliminating the possible causes, we suspect that it is the dataset who’s the culprit of it all, or is it?

We had a short interaction within the Github Issue which Birger raised, but it translated to an even longer conversation back in the official Slack channel. We exchanged dataset cleaning ideas and discussed our plans for this event. What we didn’t realize is that we’re becoming good friends from this exchange.

💀/💸 Failure and Fortune

The event lasted for two weeks and there are countless lessons I learned along the way from the people of HuggingFace, the people I met along the way, and of course, training the model itself. We all had a happy ending with our model training at the end of the day, but it wasn’t smooth like many of us expected, or at least I did.

For instance, the Indonesian RoBERTa Base model turned out to be just fine. I pushed the final version after the entire training finished, converted the model to PyTorch, and somehow it wasn’t outputting jibberish?! All along, I could have possibly pulled the first epoch of the model, or maybe even epoch zero judging from the performance 🤦‍♂️.

I was so close to giving up, but seeing the Base model working just as intended, I was back on track and became motivated to work on this project once again. The next boss to conquer: RoBERTa Large.

I naively thought training the Large rendition would be as trivial as training the Base model. But it turned out to be even more frustrating than the first attempt… Why? Well, unlike the Base model, the RoBERTa Large didn’t like the same value of learning rate. The training loss fluctuated constantly, leading me to think that it is overshooting due to a learning rate that’s too high (I was using 2e-4).

And thus I decided to decrease it by about an order of magnitude (2e-5). It was, unsurprisingly, too low of a learning rate.. Even from the first few epochs, I can see that the model is not learning. Killed the process and increased the learning rate to 7e-5. At that point, it was about midnight, so I crossed my fingers and went to sleep. I woke up excited the next day, and just like that, it still didn’t learn 😤. Not a lucky number 7 after all…

Training loss of RoBERTa Large.

Seeing how my time was running out (the model took ~2.5 days to train), I increased the learning rate just slightly this time (8e-5). Crossed my fingers yet again and left the model to train…

As it resumed training, I was delighted to hear that my friends were finding the light at the end of their tunnels as well. Birger’s models began to return more sensible outputs, and we found out that Sakares’ slightly incorrect tokenization scheme was the culprit of the slow model training.

As for myself, I was honestly disappointed to see the Large model still suffering from the same issue of “not learning” as the evaluation loss looked somewhat flat at first. But talking to my teammate Steven, he suggested that we leave it as is this time and see how it will fare, since we’re really out of time at this point.

To my surprise, it finally learned! After about three/four epochs (~20 hours), the evaluation loss began to decrease! I can finally sleep without having anxiety about model training, for the least. We quickly realized that with the epochs we set, it was impossible for it to zip to a very high accuracy as we wanted. But either way, it served as a lesson of learning-rate tuning and taught us that a scheduler’s warmup steps are equally as important.

😮 Extension == Hope

Sometime later, the HuggingFace team announced that they will be extending the TPU access (and hence the event) for several more days than the initial deadline. For me, this meant more exploring and maximizing the tools at hand.

I decided to hop on the Thai NLP train and train a very trivial Thai GPT-2 on the OSCAR dataset. Since I had only about a day at most, I could only train for very little epochs and left the model as I slept the night. To my surprise, it actually trained well?! The evaluation loss decreased as expected, and the predictions are relatively decent for the short window of time!

I immediately notified and told Sakares to play around with the model as I barely understood Thai. And indeed, the model’s predictions were reflective of the training metrics reported.

What’s unfortunate is that our original plan of training a Thai RoBERTa was too late for the deadline. Regardless, Sakares said that it’s okay since it could still be trained using Colab Pro, if we wanted to.

A preview of our Indonesian RoBERTa model demo.

As the HuggingFace team announced their beta feature Spaces, my team and I began ideating for a demo of our trained models. We fine-tuned the Indonesian RoBERTa base models to existing downstream tasks from IndoNLU, including emotion classifier, sentiment analysis, and part-of-speech (POS) tagging. We used the first two in our model demo, as well as the pre-trained masked language model itself.

As for my Thai NLP project with Sakares, we ended up scavenging the last-minute Thai GPT-2 for model demo 😂. Birger similarly deployed various models into one awesome demo titled Language Explorer. In the end, we really found the light at the end of our tunnels.

🚀 Closing Thoughts

Although none of us managed to secure the top-15 projects, the virtual event was nonetheless a memorable one. I learned a ton from the people I met, and ultimately had fun participating in my first-ever online community event hosted by HuggingFace and Google Cloud. I cannot thank my friends and organizers enough for making this experience possible. And I cannot wait to join the next HuggingFace community event 🤗.

To Steven, Sakares, Birger, and the friendliest team behind HuggingFace & JAX, thank you.

Pneumonia Chest X-Ray Classification

2020-08-31T00:00:00+10:00

The dataset used for this task if from a Kaggle dataset by Paul Mooney. It consists of two kinds of chest x-rays, those infected by pneumonia, and the other being normal. Our main goal is to distinguish which chest corresponds to pneumonia-infected ones and which aren’t. Note that the dataset is highly imbalanced, like many medical image dataset are.

Fast.ai2 Library

Fast.ai has just released its version 2 framework. It is bundled with tons of old plus new shiny features which weren’t available previously such as its brand new medical applications. Although this task isn’t related to actually using the medical applications, it serves as a stepping-stone.

Fastbook

Aside from releasing its version 2 framework, fast.ai also released a companion-book dubbed fastbook. The book is available for free in the form of Jupyter notebooks, but one can also purchase a print version on Amazon. More importantly, this task is applying what I’ve learned from the 7th Chapter of the book called Training a State-of-the-Art Model.

Transfer Learning

Lastly, I’ve also applied Transfer Learning in this task, since I’ve seen it to perform better with it after a couple of runs. The particular model I’ll be using is EfficientNetB3A, with weights from Ross Wightman’s timm library.

Code

import torch
import fastai
from fastai.vision.all import *
from fastai.vision.core import *
from fastai.callback import *
from fastai.metrics import *
import numpy as np
from sklearn.metrics import precision_score, recall_score, confusion_matrix
import matplotlib.pyplot as plt
from timm import create_model

Load Data

The dataset is very imbalanced. Firstly, it has more pneumonia-infected chest x-rays compared to normal ones. Regardless, I’ve tried to oversample using PyTorch’s WeightedRandomSampler as it didn’t show much of an improvement. Secondly, it has a very small validation dataset - 16 images in total. As such, measuring the model by its validation accuracy seems unwise.

path = Path("chest_xray/chest_xray")

Data Augmentation

First up in the loading process is data augmentation. This includes normalizing the images with Imagenet stats since the pretrained model also used the same stats. Moreover, I’ll apply default augmentative transforms provided by fast.ai, coupled with a randomly resized crop transform.

batch_tfms = [Normalize.from_stats(*imagenet_stats), *aug_transforms()]

def get_dls(bs, size):
    dblock = DataBlock(blocks     = (ImageBlock, CategoryBlock),
                       get_items  = get_image_files,
                       get_y      = parent_label,
                       splitter   = GrandparentSplitter(valid_name='val'),
                       item_tfms  = RandomResizedCrop(size, min_scale=0.75),
                       batch_tfms = batch_tfms)
    return dblock.dataloaders(path, bs=bs, num_workers=0).cuda()

dls = get_dls(64, 224)

dls.show_batch()

Model

As mentioned, we’ll be using a pretrained model called EfficientNetB3A. The few blocks of code below are from Zachary Mueller’s Practical-Deep-Learning-for-Coders-2.0 notebook tutorials. In particular, his notebook titled 05 EfficientNet and Custom Pretrained Models showed how to create a timm body, load pretrained weights, create a model head accordingly, and combine the two together.

def create_timm_body(arch:str, pretrained=True, cut=None):
    model = create_model(arch, pretrained=pretrained)
    if cut is None:
        ll = list(enumerate(model.children()))
        cut = next(i for i,o in reversed(ll) if has_pool_type(o))
    if isinstance(cut, int):
        return nn.Sequential(*list(model.children())[:cut])
    elif callable(cut):
        return cut(model)
    else:
        raise NamedError("cut must be either integer or function")

body = create_timm_body('efficientnet_b3a', pretrained=True)

nf = num_features_model(nn.Sequential(*body.children())) * (2)
head = create_head(nf, dls.c)

After creating the model here, we’ll apply a Kaiming Normal initialization to the second half of the model. Kaiming He’s normalization technique is introduced on this paper.

model = nn.Sequential(body, head)
apply_init(model[1], nn.init.kaiming_normal_)

We’ll use LabelSmoothingCrossEntropy and MixUp callback as suggested in fastbook. Both the loss function and callback may contribute to improving the model’s accuracy. You can find papers introducing Label Smoothing here and Mixup here.

learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(), metrics=accuracy, cbs=MixUp())

Since the model takes up a lot of GPU memory, using one GPU wasn’t enough. Luckily I have two NVIDIA GeForce GTX 980M, so I split the computation to both of them using PyTorch’s DataParallel.

if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    learn.model = nn.DataParallel(learn.model)

Let's use 2 GPUs!

Training Model

Once everything has been setup, we can find a good learning rate to train the model.

learn.lr_find()

c:\users\wilso\appdata\local\programs\python\python38\lib\site-packages\torch\cuda\nccl.py:14: UserWarning: PyTorch is not compiled with NCCL support
  warnings.warn('PyTorch is not compiled with NCCL support')

SuggestedLRs(lr_min=0.006918309628963471, lr_steep=9.120108734350652e-05)

Here we’ll train the model for 10 epochs with one-cycle policy, add a 0.1 weight decay.

learn.fit_one_cycle(10, 6e-3, wd=0.1, cbs=SaveModelCallback(fname='best-val-loss'))
learn.save('efficientnetb3a-1')

epoch	train_loss	valid_loss	accuracy	time
0	0.762553	1.285270	0.500000	02:28
1	0.560403	1.788918	0.500000	02:27
2	0.497399	0.727971	0.562500	02:26
3	0.460750	0.842557	0.625000	02:25
4	0.532170	8.171339	0.625000	02:25
5	0.493482	2.005133	0.687500	02:30
6	0.435214	0.956962	0.562500	03:33
7	0.397469	0.727003	0.562500	03:12
8	0.377309	0.713967	0.625000	02:27
9	0.380401	0.636497	0.625000	02:25

Better model found at epoch 0 with valid_loss value: 1.2852699756622314.
Better model found at epoch 2 with valid_loss value: 0.7279710173606873.
Better model found at epoch 7 with valid_loss value: 0.7270027995109558.
Better model found at epoch 8 with valid_loss value: 0.7139670252799988.
Better model found at epoch 9 with valid_loss value: 0.6364966630935669.

Path('models/efficientnetb3a-1.pth')

learn.recorder.plot_loss()

Testing Model

As mentioned, the validation dataset is too small to measure our model’s performance. Fortunately, the dataset gave a large enough test dataset which we’ll be using.

Load Test Data

The method I’ll be using here and for the rest of this notebook is a patchy solution. Specifically, I’ll create a test dataloader and replace the old validation dataset with it.

def get_test_dls(bs, size, test_folder):
    dblock = DataBlock(blocks     = (ImageBlock, CategoryBlock),
                       get_items  = get_image_files,
                       get_y      = parent_label,
                       splitter   = GrandparentSplitter(valid_name=test_folder),
                       item_tfms  = Resize(size),
                       batch_tfms = batch_tfms)
    return dblock.dataloaders(path, bs=bs, num_workers=0).cuda()

test_dl = get_test_dls(64, 224, 'test')

learn.dls = test_dl

Test Accuracy

preds, targs = learn.get_preds()
accuracy(preds, targs)

tensor(0.9231)

Analyze Results

The model achieved a 92% accuracy for the test data. However, using accuracy as a measure of performance in an unbalanced dataset is unwise. If say we have 95 normal chest images and 5 pneumonia-infected ones, freely guessing 100 of them to be normal would still output a high 95% accuracy. Hence, Precision and Recall is a better metric to use in this case.

According to the Scikit Learn docs, precision is intuitively the ability of the classifier not to label as positive a sample that is negative. Whereas recall is intuitively the ability of the classifier to find all the positive samples.

We can plot the results of the test predictions and visualize using a confusion matrix. In fact, plotting such diagram is available in the fast.ai library. However, for some reason I couldn’t get it to work in this new update. Thus, I decided to simply copy the actual fast.ai interpret code and modify it to fix the issue.

I found that the confusion_matrix code broke the plotting process which is dependent on it. To fix the issue, I’ve replaced the confusion matrix with Scitkit Learn’s. Lastly, I specified the function to also print the recall and precision metrics, both of which are from Scikit Learn.

def plot_confusion_matrix(y_pred, y_true, vocab):
    y_pred = y_pred.argmax(dim=-1)
    cm = confusion_matrix(y_true, y_pred)

    fig = plt.figure(figsize=(8,8), dpi=60)
    plt.imshow(cm, interpolation='nearest', cmap="Blues")
    plt.title("Confusion Matrix")
    tick_marks = np.arange(len(vocab))
    plt.xticks(tick_marks, vocab, rotation=90)
    plt.yticks(tick_marks, vocab, rotation=0)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        coeff = f'{cm[i, j]}'
        plt.text(j, i, coeff, horizontalalignment="center", verticalalignment="center", color="white" if cm[i, j] > thresh else "black")

    ax = fig.gca()
    ax.set_ylim(len(vocab)-.5,-.5)

    plt.tight_layout()
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.grid(False)

    print(f"Precision: {precision_score(y_true, y_pred):.3f}")
    print(f"Recall: {recall_score(y_true, y_pred):.3f}")

plot_confusion_matrix(preds, targs, dls.vocab)

Precision: 0.896
Recall: 0.992

The model achieved 89% precision and 99% recall!

Closing Remarks

Despite all of the issues and troubles I’ve stumbled upon during the project, I’ve learned to be more flexible in utilizing the tools available. I’ve also attempted this task over a year ago as a beginner in deep learning. To my surprise, I’ve actually solved it previously using a VGG19 pretrained model in Tensorflow/Keras and attained quite a satisfying result as well.

In any case, this mini project taught me tons and am excited to learn even more deep learning related topics. Hope you’ve learned something!

Text Generation using minGPT and fast.ai

2020-08-24T00:00:00+10:00

Andrej Karpathy, Tesla’s AI Director released minGPT, a mini version to OpenAI’s GPT. Normally a GPT would have billions of parameters and would take hours to train. Karpathy’s approach is to provide a smaller version of GPT, hence the name minGPT.

minGPT + fast.ai

Fast.ai has just released its version 2.0. This version is a total rewrite to its precursor. It works with other various PyTorch libraries and could also integrate with purely PyTorch code. Morgan Mcguire (morganmcg1 on Github) shared a code whereby the author incorporated Karpathy’s minGPT with fast.ai. It is from Mcguire’s code from which this project works upon. Credits to Morgan Mcguire for the code. I do not own the code, I simply changed minor bits (data, hyperparameters) in the overall code.

Yabes Elia & Zilbest

Yabes Elia is an editor for esports article. He was and is my current editor. Before that, he used to blog in his own page, Zilbest.com. The blog focused on several topics, including Philosophy, Romance, and Psychology. After reading his blog posts, I got the idea to train a language model upon his writing. I thought it would be interesting to let a deep learning model learn a person’s style of language. Credits to mas Yabes Elia, for allowing me to use his blog post at Zilbest.com as data source.

Code

The following code is based on morganmcg1’s “A Quick Demo of Andrej Karpathy’s minGPT Play Char Demo”. Only fragments of the important blocks of code were included.

Loading Data

The data is simply a .txt file filled with Yabes Elia’s articles on Zilbest. I’ve uploaded the .txt file to my Google Drive, loaded it, and showed the first 100 items.

raw_text = open(drive_path/'yabes-elia.txt', 'r').read()
raw_text[0:100]

'“You will never be happy if you continue to search for what happiness consists of. You will never li'

len(raw_text)

Transforms

class CharTransform(Transform):
    def __init__(self, data, block_size):
        chars = list(set(data))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))

        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
        self.n_sequences = math.ceil(len(self.data) / (self.block_size + 1))

    def encodes(self, o):
        i = np.random.randint(0, len(self.data) - (self.block_size + 1))
        chunk = self.data[i:i+self.block_size+1]
        dix = [self.stoi[s] for s in chunk]
        return torch.tensor(dix)

    def decodes(self, o):
        t = ''.join([self.itos[s.item()] for s in o])
        return TitledStr(t)

sl = 128
block_size = sl
n_samples = math.ceil(len(raw_text) / (block_size + 1))

tls = TfmdLists(list(range(n_samples)), tfms=[CharTransform(raw_text, 128)], split_idx=0, dl_type=LMDataLoader)

data has 227914 characters, 93 unique.

show_at(tls.train, 0)

Faktanya, mengubah sejarah dunia itu tidak akan pernah semudah membalikkan telapak tangan, atau dalam hal ini, menuliskan komenta

bs = 256
dls = tls.dataloaders(bs=bs, seq_len=sl)

dls.show_batch(max_n=2)

	text	text_
0	ibadi? Well, saya orang praktis. Saat saya masih jadi Managing Editor PC Gamer Indonesia, saya tentu lebih pro dengan MOBA di PC	badi? Well, saya orang praktis. Saat saya masih jadi Managing Editor PC Gamer Indonesia, saya tentu lebih pro dengan MOBA di PC.
1	al tidur sianah yang bisa memberikan jawaban jujur tentang siapa kita, bukan kuis-kuis di dunia maya yang tidak jelas algoritman	l tidur sianah yang bisa memberikan jawaban jujur tentang siapa kita, bukan kuis-kuis di dunia maya yang tidak jelas algoritmany

DropOuput Callback

Replacing fast.ai Learner’s self.learn.pred by its first element.

class DropOutput(Callback):
    def after_pred(self):
        self.learn.pred = self.pred[0]

Model: minGPT

mconf = GPTConfig(dls.char_transform.vocab_size, sl, n_layer=8, n_head=8, n_embd=512)
model = GPT(mconf)

learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), opt_func=partial(Adam, sqr_mom=0.95, wd=0.1),
                cbs=[DropOutput])

Training Model

As per fast.ai practice, we let the Learner find the ideal Learning Rate, in our case we got about $0.003 \approx 3e-3$.

learn.lr_find()

/usr/local/lib/python3.6/dist-packages/fastprogress/fastprogress.py:74: UserWarning: Your generator is empty.
  warn("Your generator is empty.")

SuggestedLRs(lr_min=0.0033113110810518267, lr_steep=2.0892961401841603e-05)

With that, we proceeded to training the model for 100 epochs and the LR which we’ve found optimal.

learn.fit_one_cycle(100, 3e-3)

epoch	train_loss	valid_loss	time
0	3.254104	None	00:14
1	3.155325	None	00:15
2	3.099205	None	00:15
3	3.032215	None	00:14
4	2.936109	None	00:14
5	2.849230	None	00:14
6	2.779433	None	00:14
7	2.719887	None	00:14
8	2.667851	None	00:14
9	2.624543	None	00:14
10	2.585148	None	00:14
11	2.552182	None	00:14
12	2.523161	None	00:14
13	2.498031	None	00:14
14	2.476660	None	00:14
15	2.455955	None	00:14
16	2.441366	None	00:14
17	2.427677	None	00:14
18	2.414104	None	00:14
19	2.397461	None	00:14
20	2.383328	None	00:14
21	2.368615	None	00:14
22	2.352587	None	00:14
23	2.335341	None	00:14
24	2.323342	None	00:14
25	2.305508	None	00:14
26	2.286461	None	00:14
27	2.262887	None	00:14
28	2.237531	None	00:14
29	2.211838	None	00:14
30	2.186196	None	00:14
31	2.156658	None	00:14
32	2.128527	None	00:14
33	2.104312	None	00:14
34	2.074495	None	00:14
35	2.046017	None	00:14
36	2.018104	None	00:14
37	1.990814	None	00:14
38	1.963953	None	00:14
39	1.938050	None	00:14
40	1.910195	None	00:14
41	1.882767	None	00:14
42	1.859885	None	00:14
43	1.833001	None	00:14
44	1.805273	None	00:14
45	1.777778	None	00:14
46	1.749810	None	00:14
47	1.721224	None	00:14
48	1.694282	None	00:14
49	1.668665	None	00:14
50	1.641540	None	00:14
51	1.614098	None	00:14
52	1.587708	None	00:14
53	1.560743	None	00:14
54	1.534708	None	00:14
55	1.510127	None	00:14
56	1.486278	None	00:14
57	1.461563	None	00:14
58	1.438166	None	00:14
59	1.415540	None	00:14
60	1.392969	None	00:14
61	1.371182	None	00:14
62	1.351205	None	00:14
63	1.331026	None	00:14
64	1.311882	None	00:14
65	1.293381	None	00:14
66	1.274096	None	00:14
67	1.256531	None	00:14
68	1.237806	None	00:14
69	1.221424	None	00:14
70	1.204520	None	00:14
71	1.189105	None	00:14
72	1.172827	None	00:14
73	1.156720	None	00:14
74	1.140753	None	00:14
75	1.125648	None	00:14
76	1.111875	None	00:14
77	1.097298	None	00:14
78	1.083305	None	00:14
79	1.069097	None	00:14
80	1.056546	None	00:14
81	1.044658	None	00:14
82	1.033119	None	00:14
83	1.021210	None	00:14
84	1.009997	None	00:14
85	0.999994	None	00:14
86	0.989661	None	00:14
87	0.979982	None	00:14
88	0.970661	None	00:14
89	0.961383	None	00:14
90	0.953398	None	00:14
91	0.946190	None	00:14
92	0.939140	None	00:14
93	0.932855	None	00:14
94	0.926477	None	00:14
95	0.921115	None	00:14
96	0.915792	None	00:14
97	0.911426	None	00:14
98	0.907237	None	00:14
99	0.904241	None	00:14

/usr/local/lib/python3.6/dist-packages/fastprogress/fastprogress.py:74: UserWarning: Your generator is empty.
  warn("Your generator is empty.")

learn.recorder.plot_loss()

Testing Model

After training, we can feed the model a contextual phrase/sentence and let it generate the rest of the text. We sampled the model’s result and let it predict the next 2000 steps.

from minGPT.mingpt.utils import sample

Context 1: “Karena itu,”

In English: “Therefore,”.

from minGPT.mingpt.utils import sample

context = "Karena itu,"
x = torch.tensor([dls.char_transform.stoi[s] for s in context], dtype=torch.long)[None,...].to(dls.device)
y = sample(model, x, 2000, temperature=0.9, sample=True, top_k=5)[0]
completion = ''.join([dls.char_transform.itos[int(i)] for i in y])
print(completion)

Karena itu, sistem gurus muluk disemungkinan di sini adalah kemuncegan banyak karakteristik pahlawan saya yang berbeda atas hidup kita bisa mendapatkan kehidupan hasil, keturunan, ataupun hilang rasa, satu hal yang bisa dijelaskan dengan kata-kata Anda tadi, kemungkinan besar, Anda terkritik dengan sendiri. Jika Anda tidak tahu kalah itu jauh lebih sulit dan lebih tertarik meraih pada konfirmasi dengan keatan kita, ataupun hal-hal lainnya karena sebenarnya ada satu hal yang menulis.

Saya pribadi, jika saya sangat merasa menyenangkan untuk komenangkap ini mungkin bisa berpikir diterakhir yang membutuhkan bahwa pribadi jika Anda tidak ada yang suka daripada satu kawan saya sudah berpasangan dalam menghubungan sesuatu Anda tidak suka dengan sekalipun, kategori saya yakin Anda juga masih sering berpikir saya bisa memaknakan para pembela atau malah sarat dengan personal tentang skeptisisme.

Saya juga tidak akan berubah waktu.

Namun, kebalikan dari kreatif seperti kita drumah, dan perspektif seorang istri ini.

Misalnya saja seperti ini, saya tidak pernah menghantarkan penutup artikel ini. Setidaknya, saya memang sudah bekerja dari sudut pandang menghasilkan kebetulanan saja, selanjutnya seperti ini. Satu hal yang membuat saya pernah menuliskan saya bekerja keras dan kembali buku sosial dan sebagai kuburan tahun berapa buku saya adalah sebagian besar tadi sebenarnya sudah tidur dari pasangan.

Di dunia riil, saya pribadi memiliki alam hidup yang berbeda dari segi yang saya pernah berada di depan kita, dan keluarga kita mau menikmati keseluarga ketimbang dua memahami segera pandai bagaimana relevan.

Dari sejumlah satu komunitas adalah seperti grafis di bawah ini untuk diri sendiri. Misalnya, satu hal yang sama-sama sekali, dan kuis-kuis di jaman sekarang ini, ada banyak h kawan-kawan saya yang memegang tidak berbagai semua saya kira semua pasti tidak menyebutkan berakhir demikian? Kita juga tidak akan lelahnya selalu berada dengan satu cara yang sama (saya). Namun juga saya tahu

Context 2: “Filosofi saya adalah”

In English: “My philosophy is”.

context = "Filosofi saya adalah"
x = torch.tensor([dls.char_transform.stoi[s] for s in context], dtype=torch.long)[None,...].to(dls.device)
y = sample(model, x, 2000, temperature=0.9, sample=True, top_k=5)[0]
completion = ''.join([dls.char_transform.itos[int(i)] for i in y])
print(completion)

Filosofi saya adalah ketidakpastian dan sang pernah berhasil mengolahnya kebanyakan soal image yang mengatakan bahwa saya menghabiskan waktu untuk berubah dambaan hati Anda, dan sebelum tulisan ini juga sangat baik itu memperti sebelumnya.

Saya kira semua suka saya dibelikan banyak mobile.

Ditambah lagi, atau proses berpikir lebih jauh masing-masing. Pasalnya, merasa tidak memuaskan keras untuk memperkayakan diri dengan kepentingan yang saya tadi, seperti pemilik solusi yang lebih beruntung ketimbang mendengarkan gilita semua itu tidak aktif dan marah terhadap kebebahagiaan di kondisi lainnya sebelumnya.

Akhirnya, saya pribadi juga melihat ketika mana yang semua hal yang bisa Anda tidak akan mengeluhi kegagalanan, saya kira semua bisa sampai ke titik ini – tulisan saya ditujukan di sini adalah kesatuan yang bisa kita ahadapi di kepentingan industri ini membutuhkan sebagian tadi berpikir – karena mencerita jadi sebuah pasangan atau berpikir lebih jauh.

Maksud saya terhasal menghadapi sesuai dengan keputusan yang saya rasakan. Setiap kita pasti punya keinginan bisa jadi profesional, berpikir berbasiskan bisa jadi salah satu cenderung untuk mencari tahu alias karena sifat kepintaran tersebut di sana.

Misalnya, terasal dari satu tim/ gumen, pasti pacaran tadi pacarnya, pernah saya bisa mengajak keluarga dalam membela kerap bersisa dapat membebaskan apa yang kita percayai adalah kesuksesan dan berbagi berpikir internet dan menggelitik. Tutup jawaban memang mudah mendorong untuk mencari kesadar dan satu pasangan Anda, selama 15 tahun yang berbeda dari kondisi yang lainnya.

Namun demikian, kecenderungan untuk melarang lebih besar ketimbang harus kita sedih memiliki personalitas kita bisa jadi tolak ukur dan kepintaran seseorang seperti seperti bahkan sebuah soal game, seperti seperti yang seperti apakah yang bisa semesta dan mengurus rumah tangga.

Saya adalah rekaman sepenuhnya dengan tangan Anda, Anda akan pernah mendengar adalah orang-orang yang berpikiran – siedak akan berarti saya

Context 3: “Bagi saya, hidup”

In English: “For me, life”.

context = "Bagi saya, hidup"
x = torch.tensor([dls.char_transform.stoi[s] for s in context], dtype=torch.long)[None,...].to(dls.device)
y = sample(model, x, 2000, temperature=0.9, sample=True, top_k=5)[0]
completion = ''.join([dls.char_transform.itos[int(i)] for i in y])
print(completion)

Bagi saya, hidup itu sebenarnya manusia itu bisa berubah-ubah dan kesamaan Anda…

So, saya pribadi mencari saya, kemungkinan besar, Anda juga akan memuaskan keruntungan keinginan sosial dan memproses bebagai sebuah kehidupannya seperti karena pil berargumen bahwa pada pengecualian. Jika Anda bisa membaca itu saja yang sebenarnya tak punya kesedihan menuntut dunia. Dari perspektif seorang berbeda jadi berusaha dengan masalah dunia nyata, termasuk sebagian dan berbeda. Sebelum kita, meminilah kepuasannya sesama seperti ini adalah satu hal yang pasti pernah merasakan hal yang sama, termasuk dalam hidup itu terjadi.

Siapakah yang saya tidak mengalahkan pusat ini seringkali disadari. Dalam perspektif yang membuat Anda pernah dengan percaya setiap orang idealisme itu biasanya merupakannya.

Namun setiap kita punya keingintahuan yang sama sama seperti ini, saya percaya bahwa adalah rekan besar tadi setiap orang yang paling suka merasa melihat hal tersebut tertarik untuk menjadi bagian dari kebencian Andscommbias yang bernada dari segi sebenarnya bisa berkurang terus bekerja atau bisa memiliki cerita semakin banyak orang suami itu terjadi ketika kita masih mencari kesalahan prestasi atau tidak berawal dengan argumen yang namanya pendapat yang berbeda, dengan sedikit juga akan lakukan dari segudang seksual. Tokoh karena kita punya sudah berpasangan tahun lalu bagaimana jika kita berada di misalnya. Namun, kenyataannya, banyak orang tua, anak ‘multiplaya yang digunakan oleh orang lain – meski tidak ada yang lainnya.

Sebenarnya apa? Kegalauan, keyakinan Anda tujuan selalu mengerti pasangan Anda selalu merasa saat mengakui sesuatu di saat Anda.

Memang, saya sudah menyarankan pertama atau sistem berbeda di sini saat ini.

Akhirnya, tidak sama seperti yang saya rasakan. Saya kira saya tahu bahwa kita bisa saja memiliki hasrat tersebut karena tidak akan pernah terlalu memang negatif lainnya.

Misalnya saja seperti ini, saya kira saya juga tidak mau berhadapan dengan soal lainnya.

Sayangnya, m

Closing Remarks

Conclusion

To sum up, here are several of my remarks:

McGuire showed how easy it is to integrate fast.ai with PyTorch models and libraries.
Fast.ai abstracts the need to dive into repetitive task of creating a Trainer for the model, learning rate scheduling, etc.
Karpathy’s minGPT is very versatile. Despite having much less parameters to OpenAI’s GPT, it still showed good results.
Although some of the sentences pretty much didn’t have proper grammar, it’s still interesting to let the model write text in the style of mas Yabes Elia.

I’ve learned a lot by simply modifying McGuire’s code. As a novice in DL, Language Modelling is certainly something new for me. I’m excited to see what DL is capable of doing across applications. I hope you’ve learned something like I did!

Credits

morganmcg1’s A Quick Demo of Andrej Karpathy’s minGPT Play Char Demo.
karpathy’s minGPT.
Yabes Elia’s Zilbest blog posts.

MNIST Classification with Quantum Neural Network

2020-07-14T00:00:00+10:00

Tensorflow is one of the most used deep learning frameworks today, bundled with many features for end-to-end deep learning processes. Recently, they have just announced a new library on top of Tensorflow, called Tensorflow Quantum. Tensorflow Quantum integrates with Cirq, which provides quantum computing algorithms, and the two works well to do tasks involving Quantum Machine Learning.

Quantum Computer Simulator

Tensorflow Quantum provides a default backend Simulator which is written in C++. It is possible, although slower, to run the backend with a Cirq Simulator, or any other backends like a real quantum computer. However, since real quantum computers of today are still very much noisy and sensitive to inference, the QNN is ran on the C++ simulator backend for simplicity. The aim is to experiment with available hybrid quantum-classical algorithms and see the potential of Quantum Machine Learning once fault-tolerant Quantum Computers become available.

Photograph of the Sycamore processor | Erik Lucero

Quantum Neural Networks

One of the realization of Quantum Machine Learning is the implementation of a Quantum Neural Network (QNN), which unlike Hybrid Neural Networks discussed in the previous blog, is purely ran on a quantum circuit with only quantum gates. It does not combines both classical and quantum neural network layers, and works quite differently from how a classical neural network does - at least for now.

MNIST Classification

Again, we’ll be classifying images from the MNIST dataset with the QNN. The following blocks of code were based on a tutorial from Tensorflow Quantum, called MNIST classification. The algorithm used is based on a paper by Farhi et al., and is a must-see paper to see the concepts and the why’s of the QNN being implemented.

Code

Loading Data

Rescaling Images

As mentioned, we’ll be using the MNIST dataset as usual, which is originally 28x28 pixels each. We’ll be rescaling the images from $[0, 255]$ to $[0.0, 1.0]$ range.

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train[..., np.newaxis]/255.0, x_test[..., np.newaxis]/255.0

print("Number of original training examples:", len(x_train))
print("Number of original test examples:", len(x_test))

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
Number of original training examples: 60000
Number of original test examples: 10000

Since the final “output layer” or the readout qubit in this case is only 1, we will only classify 2 distinct classes: 3s and 6s.

def filter_36(x, y):
    keep = (y == 3) | (y == 6)
    x, y = x[keep], y[keep]
    y = y == 3
    return x,y

x_train, y_train = filter_36(x_train, y_train)
x_test, y_test = filter_36(x_test, y_test)

print("Number of filtered training examples:", len(x_train))
print("Number of filtered test examples:", len(x_test))

Number of filtered training examples: 12049
Number of filtered test examples: 1968

print(y_train[0])

plt.imshow(x_train[0, :, :, 0])
plt.colorbar()

True

Downsampling Images

The images are then downsampled to 4x4 pixels each since we’ll only be using 17 qubits, 16 for the images, and 1 as the readout. This does lower down the resolution of the original image to the point of not representing how it looks originally. But due to the limitation of number of qubits simulatable, downsampling to low resolution images is required.

x_train_small = tf.image.resize(x_train, (4,4)).numpy()
x_test_small = tf.image.resize(x_test, (4,4)).numpy()

print(y_train[0])

plt.imshow(x_train_small[0,:,:,0], vmin=0, vmax=1)
plt.colorbar()

True

Removing Contradicting Images

Additionally, there are ambiguous labels in our dataset whereby 1 image has more than 1 labels. We’ll remove those contradicting image-label pairs from the dataset.

def remove_contradicting(xs, ys):
    mapping = collections.defaultdict(set)
    for x,y in zip(xs,ys):
       mapping[tuple(x.flatten())].add(y)

    new_x = []
    new_y = []
    for x,y in zip(xs, ys):
      labels = mapping[tuple(x.flatten())]
      if len(labels) == 1:
          new_x.append(x)
          new_y.append(list(labels)[0])
      else:
          pass

    num_3 = sum(1 for value in mapping.values() if True in value)
    num_6 = sum(1 for value in mapping.values() if False in value)
    num_both = sum(1 for value in mapping.values() if len(value) == 2)

    print("Number of unique images:", len(mapping.values()))
    print("Number of 3s: ", num_3)
    print("Number of 6s: ", num_6)
    print("Number of contradictory images: ", num_both)
    print()
    print("Initial number of examples: ", len(xs))
    print("Remaining non-contradictory examples: ", len(new_x))

    return np.array(new_x), np.array(new_y)

x_train_nocon, y_train_nocon = remove_contradicting(x_train_small, y_train)

Number of unique images: 10387
Number of 3s:  4961
Number of 6s:  5475
Number of contradictory images:  49

Initial number of examples:  12049
Remaining non-contradictory examples:  11520

Encoding Data as Quantum Circuits

We have to find a way to represent our images as qubits, and the method implemented in the tutorial is pretty straightforward. We set a certain threshold value, in our case 0.5, and if our pixel value is greater than that, we’ll append Cirq’s X-gate, which flips the qubit state from a $0$ to a $1$ (i.e. signifying the existence of a pixel value in a qubit).

THRESHOLD = 0.5

x_train_bin = np.array(x_train_nocon > THRESHOLD, dtype=np.float32)
x_test_bin = np.array(x_test_small > THRESHOLD, dtype=np.float32)

def convert_to_circuit(image):
    """Encode truncated classical image into quantum datapoint."""
    values = np.ndarray.flatten(image)
    qubits = cirq.GridQubit.rect(4, 4)
    circuit = cirq.Circuit()
    for i, value in enumerate(values):
        if value:
            circuit.append(cirq.X(qubits[i]))
    return circuit


x_train_circ = [convert_to_circuit(x) for x in x_train_bin]
x_test_circ = [convert_to_circuit(x) for x in x_test_bin]

def convert_to_circuit(image):
    """Encode truncated classical image into quantum datapoint."""
    values = np.ndarray.flatten(image)
    qubits = cirq.GridQubit.rect(4, 4)
    circuit = cirq.Circuit()
    for i, value in enumerate(values):
        if value:
            circuit.append(cirq.X(qubits[i]))
    return circuit


x_train_circ = [convert_to_circuit(x) for x in x_train_bin]
x_test_circ = [convert_to_circuit(x) for x in x_test_bin]

Let’s see how one of our training data now looks like once encoded into a circuit. Do note that qubits without operations aren’t printed out.

SVGCircuit(x_train_circ[0])

Sample Training Data as Circuit | Tensorflow Quantum

Lastly, in order to enable the usage of the newly created datapoint, we have to convert it from a circuit back into a tensor.

x_train_tfcirc = tfq.convert_to_tensor(x_train_circ)
x_test_tfcirc = tfq.convert_to_tensor(x_test_circ)

Quantum Neural Network

Now that we have encoded our data that is able to flow through a Tensorflow Quantum’s layers, we’ll begin to create our model. The type of QNN which is implemented in the paper utilizes two-qubit gates that connects every data qubit in the circuit to the readout qubit. At the end of the circuit, the expectation of the readout qubit will then be measured as the basis of our model’s classification.

Building Circuit Layers

Each layer uses $n$ instances of the same gate, with each of the data qubits acting on the readout qubit. The following class adds a layer of that gate to the circuit.

class CircuitLayerBuilder():
    def __init__(self, data_qubits, readout):
        self.data_qubits = data_qubits
        self.readout = readout

    def add_layer(self, circuit, gate, prefix):
        for i, qubit in enumerate(self.data_qubits):
            symbol = sympy.Symbol(prefix + '-' + str(i))
            circuit.append(gate(qubit, self.readout)**symbol)

Let’s see how it would look like in a sample circuit.

demo_builder = CircuitLayerBuilder(data_qubits = cirq.GridQubit.rect(4,1),
                                   readout=cirq.GridQubit(-1,-1))

circuit = cirq.Circuit()
demo_builder.add_layer(circuit, gate = cirq.XX, prefix='xx')
SVGCircuit(circuit)

Sample Circuit | Tensorflow Quantum

As you can see, all data qubits (4 in this case) are connected with the readout qubit via an Ising ($XX$) Coupling gate.

Creating Quantum Model

With the quantum layer class ready for use, we can create the quantum model for our QNN. Instead of only using a single Ising ($XX$) Coupling Gate, we’ll also add Ising ($ZZ$) Coupling Gate for every data qubit. These gates have their respective parameters, which our model will learn to optimize later on.

Notice that we’re adding two intial gates to the readout qubit, an $X$ gate to convert it into the state $1$, and an $H$ to set our qubit in superposition. After all the Ising Coupling gates, we’ll finally append another $H$ gate to our readout qubit to bring it out of superposition, before finally doing a $Z$-measurement to obtain the expectation value.

def create_quantum_model():
    data_qubits = cirq.GridQubit.rect(4, 4)  # a 4x4 grid.
    readout = cirq.GridQubit(-1, -1)         # a single qubit at [-1,-1]
    circuit = cirq.Circuit()

    # Prepare the readout qubit.
    circuit.append(cirq.X(readout))
    circuit.append(cirq.H(readout))

    builder = CircuitLayerBuilder(
        data_qubits = data_qubits,
        readout=readout)

    # Then add layers (experiment by adding more).
    builder.add_layer(circuit, cirq.XX, "xx1")
    builder.add_layer(circuit, cirq.ZZ, "zz1")

    # Finally, prepare the readout qubit.
    circuit.append(cirq.H(readout))

    return circuit, cirq.Z(readout)

model_circuit, model_readout = create_quantum_model()

The model’s pretty huge since it has 17 qubits in total, and if we try to see how it looks when laid out on a flat circuit, it looks like the following:

The Quantum Neural Network Model

Wrapping Model-Circuit in TF-Quantum Model

To bring all things we’ve built together, Tensorflow Quantum model/circuit interfaces with the normal Keras Sequential model. We’ll prepend an input layer which takes the encoded data from earlier, before finally feeding it into the quantum circuit. Since the parameters of the quantum circuits are the one we would like the model to learn upon, we’ll wrap it with the tfq.layers.PQC layer which returns the expectation value of the readout qubit.

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(), dtype=tf.string),
    tfq.layers.PQC(model_circuit, model_readout),
])

The PQC layer will return its results within the range $[-1, 1]$, and using the hinge-loss is suitable although it requires us encoding the target labels like the following:

y_train_hinge = 2.0*y_train_nocon-1.0
y_test_hinge = 2.0*y_test-1.0

It should be noted that we could instead shift the model’s output range to $[0, 1]$ and treat it as the probability the model assigns to class 3 to be used with the usual tf.losses.BinaryCrossentropy loss function.

We then specify a hinge accuracy metric which handles $[-1, 1]$ as the target labels argument.

def hinge_accuracy(y_true, y_pred):
    y_true = tf.squeeze(y_true) > 0.0
    y_pred = tf.squeeze(y_pred) > 0.0
    result = tf.cast(y_true == y_pred, tf.float32)

    return tf.reduce_mean(result)

Lastly, we’ll do the usual model.compile(), passing it our loss function, optimizer, and the metrics to be recorded.

model.compile(
    loss=tf.keras.losses.Hinge(),
    optimizer=tf.keras.optimizers.Adam(),
    metrics=[hinge_accuracy])

print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
pqc (PQC)                    (None, 1)                 32
=================================================================
Total params: 32
Trainable params: 32
Non-trainable params: 0
_________________________________________________________________
None

Training Quantum Neural Network

With everything in place and ready for training, we’ll begin the training of our model. Luckily, Tensorflow Quantum provides a default Differentiator which handles backpropagation through the quantum circuit, so we do not need to handle that manually. It is possible however, to provide it with our own Differentiator function, but we won’t be doing that here.

We’ll first decide the number of epochs, batch size, and the number of examples to be used for training. As there are quite many training images, we can always use a subset of it just to decrease training duration and to just see the model learn.

EPOCHS = 3
BATCH_SIZE = 32

NUM_EXAMPLES = 500

x_train_tfcirc_sub = x_train_tfcirc[:NUM_EXAMPLES]
y_train_hinge_sub = y_train_hinge[:NUM_EXAMPLES]

qnn_history = model.fit(
      x_train_tfcirc_sub, y_train_hinge_sub,
      batch_size=32,
      epochs=EPOCHS,
      verbose=1,
      validation_data=(x_test_tfcirc, y_test_hinge))

qnn_results = model.evaluate(x_test_tfcirc, y_test)

Train on 500 samples, validate on 1968 samples
Epoch 1/3
500/500 [==============================] - 301s 602ms/sample - loss: 0.9929 - hinge_accuracy: 0.6199 - val_loss: 0.9887 - val_hinge_accuracy: 0.6739
Epoch 2/3
500/500 [==============================] - 300s 600ms/sample - loss: 0.9849 - hinge_accuracy: 0.6777 - val_loss: 0.9808 - val_hinge_accuracy: 0.6774
Epoch 3/3
500/500 [==============================] - 301s 602ms/sample - loss: 0.9756 - hinge_accuracy: 0.6746 - val_loss: 0.9687 - val_hinge_accuracy: 0.6809
1968/1968 [==============================] - 34s 17ms/sample - loss: 0.9687 - hinge_accuracy: 0.6809

Note that the training accuracy reports the average over the epoch. While the validation accuracy is evaluated at the end of each epoch. Here, our model obtained about 0.68 validation hinge accuracy, and just like any other quantum or hybrid neural networks, this value varies from trials to trials. The highest accuracy I have obtained with the same exact subdataset and circuit was 0.80.

Classical Neural Network

A classical neural network will definitely outperform this QNN, even if we use a very simple classical Convolutional Neural Network (CNN). The tutorial showed an example of a CNN based off LeNet from a Keras tutorial.

def create_classical_model():
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Conv2D(32, [3, 3], activation='relu', input_shape=(28,28,1)))
    model.add(tf.keras.layers.Conv2D(64, [3, 3], activation='relu'))
    model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
    model.add(tf.keras.layers.Dropout(0.25))
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(128, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.5))
    model.add(tf.keras.layers.Dense(1))
    return model


model = create_classical_model()
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 24, 24, 64)        18496
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 12, 12, 64)        0
_________________________________________________________________
dropout (Dropout)            (None, 12, 12, 64)        0
_________________________________________________________________
flatten (Flatten)            (None, 9216)              0
_________________________________________________________________
dense (Dense)                (None, 128)               1179776
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129
=================================================================
Total params: 1,198,721
Trainable params: 1,198,721
Non-trainable params: 0
_________________________________________________________________

model.fit(x_train,
          y_train,
          batch_size=128,
          epochs=1,
          verbose=1,
          validation_data=(x_test, y_test))

cnn_results = model.evaluate(x_test, y_test)

Train on 12049 samples, validate on 1968 samples
12049/12049 [==============================] - 7s 557us/sample - loss: 0.0397 - accuracy: 0.9854 - val_loss: 0.0053 - val_accuracy: 0.9990
1968/1968 [==============================] - 0s 144us/sample - loss: 0.0053 - accuracy: 0.9990

In just a single epoch, the classical CNN was able to achieve 0.99 validation accuracy. Although it looks like a simple CNN, it does however, get fed by the original 28x28 pixels image and has 1.2M parameters. Hence it’s not really fair to compare it to our QNN.

To put them into a fair level, we’ll create a 37-parameter classical neural network which also resizes the images to 4x4 pixels each.

def create_fair_classical_model():
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Flatten(input_shape=(4,4,1)))
    model.add(tf.keras.layers.Dense(2, activation='relu'))
    model.add(tf.keras.layers.Dense(1))
    return model


model = create_fair_classical_model()
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
flatten_1 (Flatten)          (None, 16)                0
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 34
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 3
=================================================================
Total params: 37
Trainable params: 37
Non-trainable params: 0
_________________________________________________________________

model.fit(x_train_bin,
          y_train_nocon,
          batch_size=128,
          epochs=20,
          verbose=2,
          validation_data=(x_test_bin, y_test))

fair_nn_results = model.evaluate(x_test_bin, y_test)

Train on 11520 samples, validate on 1968 samples
Epoch 1/20
11520/11520 - 1s - loss: 0.7959 - accuracy: 0.4551 - val_loss: 0.7675 - val_accuracy: 0.4853
Epoch 2/20
11520/11520 - 0s - loss: 0.7290 - accuracy: 0.5030 - val_loss: 0.7130 - val_accuracy: 0.4868
Epoch 3/20
11520/11520 - 0s - loss: 0.6995 - accuracy: 0.5031 - val_loss: 0.6968 - val_accuracy: 0.4868
Epoch 4/20
11520/11520 - 0s - loss: 0.6918 - accuracy: 0.5034 - val_loss: 0.6925 - val_accuracy: 0.4878
Epoch 5/20
11520/11520 - 0s - loss: 0.6847 - accuracy: 0.5103 - val_loss: 0.6793 - val_accuracy: 0.4924
Epoch 6/20
11520/11520 - 0s - loss: 0.6553 - accuracy: 0.5845 - val_loss: 0.6425 - val_accuracy: 0.6316
Epoch 7/20
11520/11520 - 0s - loss: 0.5934 - accuracy: 0.6980 - val_loss: 0.5676 - val_accuracy: 0.7429
Epoch 8/20
11520/11520 - 0s - loss: 0.5298 - accuracy: 0.8106 - val_loss: 0.5105 - val_accuracy: 0.8216
Epoch 9/20
11520/11520 - 0s - loss: 0.4782 - accuracy: 0.8536 - val_loss: 0.4658 - val_accuracy: 0.8323
Epoch 10/20
11520/11520 - 0s - loss: 0.4386 - accuracy: 0.8595 - val_loss: 0.4318 - val_accuracy: 0.8333
Epoch 11/20
11520/11520 - 0s - loss: 0.4068 - accuracy: 0.8617 - val_loss: 0.4039 - val_accuracy: 0.8338
Epoch 12/20
11520/11520 - 0s - loss: 0.3811 - accuracy: 0.8635 - val_loss: 0.3813 - val_accuracy: 0.8338
Epoch 13/20
11520/11520 - 0s - loss: 0.3599 - accuracy: 0.8641 - val_loss: 0.3624 - val_accuracy: 0.8328
Epoch 14/20
11520/11520 - 0s - loss: 0.3421 - accuracy: 0.8648 - val_loss: 0.3465 - val_accuracy: 0.8328
Epoch 15/20
11520/11520 - 0s - loss: 0.3270 - accuracy: 0.8720 - val_loss: 0.3329 - val_accuracy: 0.8714
Epoch 16/20
11520/11520 - 0s - loss: 0.3140 - accuracy: 0.8874 - val_loss: 0.3210 - val_accuracy: 0.8714
Epoch 17/20
11520/11520 - 0s - loss: 0.3028 - accuracy: 0.8876 - val_loss: 0.3109 - val_accuracy: 0.8714
Epoch 18/20
11520/11520 - 0s - loss: 0.2931 - accuracy: 0.8876 - val_loss: 0.3019 - val_accuracy: 0.8714
Epoch 19/20
11520/11520 - 0s - loss: 0.2846 - accuracy: 0.8876 - val_loss: 0.2941 - val_accuracy: 0.8714
Epoch 20/20
11520/11520 - 0s - loss: 0.2771 - accuracy: 0.8873 - val_loss: 0.2872 - val_accuracy: 0.8714
1968/1968 [==============================] - 0s 78us/sample - loss: 0.2872 - accuracy: 0.8714

Unsurprisingly, the model performed better and arguably more stable than the QNN for obvious reasons. The data is very much classical, so its reasonable why a classical neural network would outperform a quantum one.

qnn_accuracy = qnn_results[1]
cnn_accuracy = cnn_results[1]
fair_nn_accuracy = fair_nn_results[1]

sns.barplot(["Quantum", "Classical, full", "Classical, fair"],
            [qnn_accuracy, cnn_accuracy, fair_nn_accuracy])

Experiments

After learning how to create a QNN from the tutorial, I decided to play around with the number of parameters in the model. Instead of using only 1 Ising $(XX)$ Coupling Gate and 1 Ising $(ZZ)$ Coupling Gate, I’ve decided to use 2 of each kinds, which adds additional 32 parameters to the model, summing to 64 parameters in total.

def create_quantum_model():
    data_qubits = cirq.GridQubit.rect(4, 4)
    readout = cirq.GridQubit(-1, -1)
    circuit = cirq.Circuit()

    circuit.append(cirq.X(readout))
    circuit.append(cirq.H(readout))

    builder = CircuitLayerBuilder(
        data_qubits = data_qubits,
        readout=readout)

    builder.add_layer(circuit, cirq.XX, "xx1")
    builder.add_layer(circuit, cirq.XX, "xx2")
    builder.add_layer(circuit, cirq.ZZ, "zz1")
    builder.add_layer(circuit, cirq.ZZ, "zz2")

    circuit.append(cirq.H(readout))

    return circuit, cirq.Z(readout)

print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
pqc (PQC)                    (None, 1)                 64
=================================================================
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
None

I have also used a total of 1000 sample images instead of only 500, just for fun. The rest are kept identical.

EPOCHS = 3
BATCH_SIZE = 32

NUM_EXAMPLES = 1000

qnn_history = model.fit(
      x_train_tfcirc_sub, y_train_hinge_sub,
      batch_size=32,
      epochs=EPOCHS,
      verbose=1,
      validation_data=(x_test_tfcirc, y_test_hinge))

qnn_results = model.evaluate(x_test_tfcirc, y_test)

Train on 1000 samples, validate on 1968 samples
Epoch 1/3
1000/1000 [==============================] - 1693s 2s/sample - loss: 0.9964 - hinge_accuracy: 0.6748 - val_loss: 0.9851 - val_hinge_accuracy: 0.7999
Epoch 2/3
1000/1000 [==============================] - 1704s 2s/sample - loss: 0.9271 - hinge_accuracy: 0.8066 - val_loss: 0.8194 - val_hinge_accuracy: 0.7964
Epoch 3/3
1000/1000 [==============================] - 1593s 2s/sample - loss: 0.6629 - hinge_accuracy: 0.7988 - val_loss: 0.5120 - val_hinge_accuracy: 0.7964
1968/1968 [==============================] - 50s 25ms/sample - loss: 0.5120 - hinge_accuracy: 0.7964

As you can see, the validation hinge accuracy this time is about 0.79 and a much lower validation loss, which is better than our 32-parameter model previously. It should be noted again that these values change from trials to trials, so a 1-time attempt do not represent the model’s performance entirely.

Closing Remarks

Issues with Quantum Neural Network

As discussed in the previous post, there are still issues regarding QNNs and Quantum Computers in general. There are no analytical way to get the gradients of the quantum layers yet, and sometimes the circuit’s gradient vanishes as the model learns. There’s definitely a huge area of possible improvements as well as research to the possibilities of Quantum Neural Network in tackling the limits of a classical neural network.

Conclusion

It’s been a ride learning how the Quantum Neural Network was implemented. It is very much different from how a classical neural network is implemented, and there are many factors to consider since the capabilities of a Quantum Computer and its simulators are still limited. However, it was still a mind-blowing experience to take a glimpse of the future potential of Quantum Computers and what it can offer to the Machine Learning domain.

Credits

Portions of this page are modifications based on work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License.

MNIST Classification with Hybrid Quantum-Classical Neural Network

2020-07-13T00:00:00+10:00

Qiskit is IBM’s open-source framework to do quantum processes which provides users access to both simulators and real Quantum Computers. Today, the Quantum Computer available is still in the Noisy Intermediate-Scale Quantum (NISQ) era and is very much sensitive to any forms of interference. Unlike real Quantum Computers, simulators provided by Qiskit aren’t noisy and is great for prototyping.

Hybrid Quantum-Classical Neural Network

Qiskit and PyTorch provides a way to connect classical neural networks with quantum circuit, thus creating a hybrid quantum-classical NN. A tutorial is provided under the Qiskit textbook, and will be the basis of the code shown in this post.

Forward Pass

How a hybrid NN works in forward pass is shown in the following diagram:

Hybrid Quantum-Classical Neural Network | Qiskit Textbook

As shown above, the neural network will have its usual classical layers at the start, a quantum “layer” in between, and followed by classical layers again. It is the parameters of the quantum layer which the neural network will learn to optimize.

The layers used in the classical part is arbitrary, however it should be noted that the output of the classical layers at the start should conform to the input of the quantum layer (which we’ll see later in code). Similarly, the output of the quantum layer should be in-line with the input of the following classical layer.

Backward Pass

This raises a question especially during the backpropagation process. The derivative of the quantum layer is required to perform gradient descent - a critical step to optimizing the model. To tackle the problem, we’ll be using the parameter shift rule to find its gradient, which is calculated as follows:

Gradient of Quantum Layer | Qiskit Textbook

The parameter shift rule is parallel to how finite difference works: making a small shift and calculating the change in the output with respect to the small shift. Details won’t be discussed here.

MNIST Classification

MNIST is a go-to dataset for image classification as it is simple for a beginner. Similarly, we’ll be using MNIST to test out how our hybrid NN performs. In this case however, we’ll be only classifying 2 digits instead of the usual 10.

Code: Classifying 0s and 1s

Quantum Circuit

As mentioned above, we’ll create a quantum circuit whose parameter we’ll let the neural network tweak as it learns. The example given in the textbook is a very simple, 1-qubit circuit with two gates, a Hadamard and a $RY$ gate. A $RY$ rotation has a parameter called $\theta$ which is precisely the parameter to be optimized.

Quantum Circuit | Qiskit Textbook

After going the two gates, the qubit is then measured. It is the result of this measurement which we’ll use as the final output of the neural network. A 1-qubit measurement has only two possible outputs, and the two possible outputs in our case corresponds to the two possible classes which an image belong to. To measure the $z$-basis output, we’ll be calculating the $\sigma_z$ expected value the same way as we would calculate expected value in statistics.

\[\sigma_z = \sum_{i} z_i \cdot p(z_i)\]

Later, we’ll specify the circuit how many shots or trials we’d like to make.

Let’s implement the circuit in Qiskit!

class QuantumCircuit:
    def __init__(self, n_qubits, backend, shots):
        # --- Circuit definition ---
        self._circuit = qiskit.QuantumCircuit(n_qubits)

        all_qubits = [i for i in range(n_qubits)]
        self.theta = qiskit.circuit.Parameter('theta')

        self._circuit.h(all_qubits)
        self._circuit.barrier()
        self._circuit.ry(self.theta, all_qubits)

        self._circuit.measure_all()
        # ---------------------------

        self.backend = backend
        self.shots = shots

    def run(self, thetas):
        job = qiskit.execute(self._circuit,
                             self.backend,
                             shots = self.shots,
                             parameter_binds = [{self.theta: theta} for theta in thetas])
        result = job.result().get_counts(self._circuit)

        counts = np.array(list(result.values()))
        states = np.array(list(result.keys())).astype(float)

        # Compute probabilities for each state
        probabilities = counts / self.shots
        # Get state expectation
        expectation = np.sum(states * probabilities)

        return np.array([expectation])

Testing Quantum Circuit

Just for fun, the textbook gave a test implementation of the circuit if we were to run it as usual. We’ll specify that we’ll need 1 qubit, provide the simulator to be used, give it 100 shots and use $\pi$ as our angle.

simulator = qiskit.Aer.get_backend('qasm_simulator')

circuit = QuantumCircuit(1, simulator, 100)
print('Expected value for rotation pi: {}'.format(circuit.run([np.pi])[0]))
circuit._circuit.draw(output='mpl')

Expected value for rotation pi: 0.5

Quantum-Classical Class

After creating the designated circuit, we can utilize it to create a hybrid class/layer with PyTorch. We specify the forward pass to be pretty much running the circuit, and the backward pass to be the parameter shift rule we discussed earlier.

class HybridFunction(Function):
    @staticmethod
    def forward(ctx, input, quantum_circuit, shift):
        """ Forward pass computation """
        ctx.shift = shift
        ctx.quantum_circuit = quantum_circuit

        expectation_z = ctx.quantum_circuit.run(input[0].tolist())
        result = torch.tensor([expectation_z])
        ctx.save_for_backward(input, result)

        return result

    @staticmethod
    def backward(ctx, grad_output):
        """ Backward pass computation """
        input, expectation_z = ctx.saved_tensors
        input_list = np.array(input.tolist())

        shift_right = input_list + np.ones(input_list.shape) * ctx.shift
        shift_left = input_list - np.ones(input_list.shape) * ctx.shift

        gradients = []
        for i in range(len(input_list)):
            expectation_right = ctx.quantum_circuit.run(shift_right[i])
            expectation_left  = ctx.quantum_circuit.run(shift_left[i])

            gradient = torch.tensor([expectation_right]) - torch.tensor([expectation_left])
            gradients.append(gradient)
        gradients = np.array([gradients]).T
        return torch.tensor([gradients]).float() * grad_output.float(), None, None

With that we can create an actual PyTorch layer which inherits from nn.Module which just applies whatever we’ve implemented in HybridFunction.

class Hybrid(nn.Module):
    def __init__(self, backend, shots, shift):
        super(Hybrid, self).__init__()
        self.quantum_circuit = QuantumCircuit(1, backend, shots)
        self.shift = shift

    def forward(self, input):
        return HybridFunction.apply(input, self.quantum_circuit, self.shift)

Loading Data

Training Dataset

As mentioned, we’ll use MNIST but only two of its classes, specifically 0s and 1s. We’ll load up the dataset from PyTorch datasets for training and testing purposes. Only 100 samples were used for training and 50 for testing in the example.

n_samples = 100

X_train = datasets.MNIST(root='./data', train=True, download=True,
                         transform=transforms.Compose([transforms.ToTensor()]))

# Leaving only labels 0 and 1
idx = np.append(np.where(X_train.targets == 0)[0][:n_samples],
                np.where(X_train.targets == 1)[0][:n_samples])

X_train.data = X_train.data[idx]
X_train.targets = X_train.targets[idx]

train_loader = torch.utils.data.DataLoader(X_train, batch_size=1, shuffle=True)

n_samples_show = 6

data_iter = iter(train_loader)
fig, axes = plt.subplots(nrows=1, ncols=n_samples_show, figsize=(10, 3))

while n_samples_show > 0:
    images, targets = data_iter.__next__()

    axes[n_samples_show - 1].imshow(images[0].numpy().squeeze(), cmap='gray')
    axes[n_samples_show - 1].set_xticks([])
    axes[n_samples_show - 1].set_yticks([])
    axes[n_samples_show - 1].set_title("Labeled: {}".format(targets.item()))

    n_samples_show -= 1

Testing Dataset

n_samples = 50

X_test = datasets.MNIST(root='./data', train=False, download=True,
                        transform=transforms.Compose([transforms.ToTensor()]))

idx = np.append(np.where(X_test.targets == 0)[0][:n_samples],
                np.where(X_test.targets == 1)[0][:n_samples])

X_test.data = X_test.data[idx]
X_test.targets = X_test.targets[idx]

test_loader = torch.utils.data.DataLoader(X_test, batch_size=1, shuffle=True)

Hybrid Neural Network

With most of the things in-place, we can begin to create our model. The classical layers we’ll use are normal convolution, dropout and linear layers. Notice that the final linear layer fc2 only has 1 output since our quantum layer has only 1 parameter. Also, the final output of the forward pass concatenates the two probabilities into one tensor which we’ll later pass to our loss function.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
        self.dropout = nn.Dropout2d()
        self.fc1 = nn.Linear(256, 64)
        self.fc2 = nn.Linear(64, 1)
        self.hybrid = Hybrid(qiskit.Aer.get_backend('qasm_simulator'), 100, np.pi / 2)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = self.dropout(x)
        x = x.view(-1, 256)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        x = self.hybrid(x)
        return torch.cat((x, 1 - x), -1)

Training Neural Network

Finally, we’ll train our model just as we would train a normal image classification model. We’ve implemented all the backward pass processes in the quantum layer, so doing loss.backward() would correspond to the parameter shift rule previously.

We’ll train for 20 epochs and record the loss after each iteration.

plt.plot(loss_list)
plt.title('Hybrid NN Training Convergence')
plt.xlabel('Training Iterations')
plt.ylabel('Neg Log Likelihood Loss')model = Net()
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_func = nn.NLLLoss()

epochs = 20
loss_list = []

model.train()
for epoch in range(epochs):
    total_loss = []
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        # Forward pass
        output = model(data)
        # Calculating loss
        loss = loss_func(output, target)
        # Backward pass
        loss.backward()
        # Optimize the weights
        optimizer.step()

        total_loss.append(loss.item())
    loss_list.append(sum(total_loss)/len(total_loss))
    print('Training [{:.0f}%]\tLoss: {:.4f}'.format(
        100. * (epoch + 1) / epochs, loss_list[-1]))

Training [5%]	Loss: -0.6274
Training [10%]	Loss: -0.7605
Training [15%]	Loss: -0.7898
Training [20%]	Loss: -0.8343
Training [25%]	Loss: -0.8573
Training [30%]	Loss: -0.8514
Training [35%]	Loss: -0.8776
Training [40%]	Loss: -0.8414
Training [45%]	Loss: -0.8811
Training [50%]	Loss: -0.8226
Training [55%]	Loss: -0.8174
Training [60%]	Loss: -0.8588
Training [65%]	Loss: -0.8629
Training [70%]	Loss: -0.8767
Training [75%]	Loss: -0.8635
Training [80%]	Loss: -0.8688
Training [85%]	Loss: -0.8795
Training [90%]	Loss: -0.9021
Training [95%]	Loss: -0.8732
Training [100%]	Loss: -0.8694

plt.plot(loss_list)
plt.title('Hybrid NN Training Convergence')
plt.xlabel('Training Iterations')
plt.ylabel('Neg Log Likelihood Loss')

Text(0, 0.5, 'Neg Log Likelihood Loss')

Testing Neural Network

As seen in the diagram above, our loss has gradually decreased and it seems that the model had learned well. To see how it fairs, let’s test it out with the test data we’ve set apart earlier.

model.eval()
with torch.no_grad():

    correct = 0
    for batch_idx, (data, target) in enumerate(test_loader):
        output = model(data)

        pred = output.argmax(dim=1, keepdim=True)
        correct += pred.eq(target.view_as(pred)).sum().item()

        loss = loss_func(output, target)
        total_loss.append(loss.item())

    print('Performance on test data:\n\tLoss: {:.4f}\n\tAccuracy: {:.1f}%'.format(
        sum(total_loss) / len(total_loss),
        correct / len(test_loader) * 100)
        )

Performance on test data:
	Loss: -0.8713
	Accuracy: 100.0%

Notice that the model has achieved 100% accuracy with the small test dataset, which is reasonable.

n_samples_show = 6
count = 0
fig, axes = plt.subplots(nrows=1, ncols=n_samples_show, figsize=(10, 3))

model.eval()
with torch.no_grad():
    for batch_idx, (data, target) in enumerate(test_loader):
        if count == n_samples_show:
            break
        output = model(data)

        pred = output.argmax(dim=1, keepdim=True)

        axes[count].imshow(data[0].numpy().squeeze(), cmap='gray')

        axes[count].set_xticks([])
        axes[count].set_yticks([])
        axes[count].set_title('Predicted {}'.format(pred.item()))

        count += 1

Code: Classifying 3s and 7s

With what the model can achieve, I tried to change the dataset used. Instead of using 0s and 1s which look fairly different from each other, I tried to replace them with 3s and 7s to see how the model performs. The processes except the data-loading is pretty much identical.

Loading Data

Training Dataset

Here we’ll specify that we want 3s and 7s, and encode their labels to 0 and 1 respectively.

n_samples = 100

X_train = datasets.MNIST(root='./data', train=True, download=True,
                         transform=transforms.Compose([transforms.ToTensor()]))

# Leaving only labels 3 and 7
idx = np.append(np.where(X_train.targets == 3)[0][:n_samples],
                np.where(X_train.targets == 7)[0][:n_samples])

X_train.data = X_train.data[idx]
X_train.targets = X_train.targets[idx]
# Encode into 0 and 1
X_train.targets = torch.tensor(list(map(lambda x: 0 if x == 3 else 1, X_train.targets)))

train_loader = torch.utils.data.DataLoader(X_train, batch_size=1, shuffle=True)

n_samples_show = 6

data_iter = iter(train_loader)
fig, axes = plt.subplots(nrows=1, ncols=n_samples_show, figsize=(10, 3))

while n_samples_show > 0:
    images, targets = data_iter.__next__()

    axes[n_samples_show - 1].imshow(images[0].numpy().squeeze(), cmap='gray')
    axes[n_samples_show - 1].set_xticks([])
    axes[n_samples_show - 1].set_yticks([])
    axes[n_samples_show - 1].set_title("Labeled: {}".format(targets.item()))

    n_samples_show -= 1

Testing Dataset

Exact same process of specifying 3s and 7s and encoding the label.

n_samples = 50

X_test = datasets.MNIST(root='./data', train=False, download=True,
                        transform=transforms.Compose([transforms.ToTensor()]))

idx = np.append(np.where(X_test.targets == 3)[0][:n_samples],
                np.where(X_test.targets == 7)[0][:n_samples])

X_test.data = X_test.data[idx]
X_test.targets = X_test.targets[idx]
X_test.targets = torch.tensor(list(map(lambda x: 0 if x == 3 else 1, X_test.targets)))

test_loader = torch.utils.data.DataLoader(X_test, batch_size=1, shuffle=True)

Training Neural Network

I used the exact same training loop as before.

model = Net()
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_func = nn.NLLLoss()

epochs = 20
loss_list = []

model.train()
for epoch in range(epochs):
    total_loss = []
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = loss_func(output, target)
        loss.backward()
        optimizer.step()

        total_loss.append(loss.item())
    loss_list.append(sum(total_loss)/len(total_loss))
    print('Training [{:.0f}%]\tLoss: {:.4f}'.format(
        100. * (epoch + 1) / epochs, loss_list[-1]))

Training [5%]	Loss: -0.4957
Training [10%]	Loss: -0.5000
Training [15%]	Loss: -0.4913
Training [20%]	Loss: -0.5009
Training [25%]	Loss: -0.5024
Training [30%]	Loss: -0.4997
Training [35%]	Loss: -0.6483
Training [40%]	Loss: -0.6767
Training [45%]	Loss: -0.6585
Training [50%]	Loss: -0.6675
Training [55%]	Loss: -0.7013
Training [60%]	Loss: -0.7226
Training [65%]	Loss: -0.7191
Training [70%]	Loss: -0.7031
Training [75%]	Loss: -0.7167
Training [80%]	Loss: -0.7193
Training [85%]	Loss: -0.7220
Training [90%]	Loss: -0.7300
Training [95%]	Loss: -0.7376
Training [100%]	Loss: -0.7249

Somehow, the model’s loss converged a bit smoother than it did before, although a huge jump did occur in the first few iterations.

plt.plot(loss_list)
plt.title('Hybrid NN Training Convergence')
plt.xlabel('Training Iterations')
plt.ylabel('Neg Log Likelihood Loss')

Text(0, 0.5, 'Neg Log Likelihood Loss')

Testing Neural Network

Similarly, same process of testing the results as I did before, except having to decode 0 and 1 into 3s and 7s just for convenience.

model.eval()
with torch.no_grad():

    correct = 0
    for batch_idx, (data, target) in enumerate(test_loader):
        output = model(data)

        pred = output.argmax(dim=1, keepdim=True)
        correct += pred.eq(target.view_as(pred)).sum().item()

        loss = loss_func(output, target)
        total_loss.append(loss.item())

    print('Performance on test data:\n\tLoss: {:.4f}\n\tAccuracy: {:.1f}%'.format(
        sum(total_loss) / len(total_loss),
        correct / len(test_loader) * 100)
        )

Performance on test data:
	Loss: -0.7454
	Accuracy: 91.0%

n_samples_show = 8
count = 0
fig, axes = plt.subplots(nrows=1, ncols=n_samples_show, figsize=(10, 3))

model.eval()
with torch.no_grad():
    for batch_idx, (data, target) in enumerate(test_loader):
        if count == n_samples_show:
            break
        output = model(data)

        pred = output.argmax(dim=1, keepdim=True)

        axes[count].imshow(data[0].numpy().squeeze(), cmap='gray')

        axes[count].set_xticks([])
        axes[count].set_yticks([])
        axes[count].set_title('Predicted {}'.format(3 if pred.item() == 0 else 7))

        count += 1

Notice that the model has achieved a lower testing accuracy due to numerous possible reasons, but details won’t matter.

Closing Remarks

Benefits of Hybrid Neural Networks

All the circuits we’ve used are classically simulatable, which means we’re not leveraging the potential of quantum computation, such as entanglement. The authors of the textbook also mentioned that the model would’ve trained equally, or even better without the quantum layer.

Without us utilizing quantum phenomenas/properties, the results will probably be similar to that of using a normal, classical neural network. However for now, we can always test out these kinds of networks to see if there are in fact possible benefits of using such kinds of network. It would require a more sophisticated quantum layer to possibly achieve greater “quantum advantage”.

Experienced Issues

Although the results look reasonable here in this post, I did get questionable results in one of my earliest tries. Despite using a simulator, it seemed that the network at a certain trial didn’t learn, or the qubit’s results were just very unlucky after each measurement. The loss stayed at around $-0.5$ after 20 epochs, and achieved only 50% accuracy during testing - no better than a random guess. Here is the loss graph for the network I’ve just mentioned:

Fluctuative Loss of a Hybrid NN

Overall Results

It’s very amusing to see how we can fuse quantum and classical layers together to create such neural networks, even if there’s no particular advantage of doing so. Regardless, we can always move up from here and apply the simpler concepts which the textbook has shown and see whether we can put hybrid NN to good use in the future!

Credits

Asfaw, A., Bello, L., Ben-Haim, Y., Bravyi, S., Capelluto, L., Vazquez, A. C., . . . Wootton, J. (2020). Learn Quantum Computation Using Qiskit. Retrieved from http://community.qiskit.org/textbook

Color Restoration with Generative Adversarial Network

2020-07-10T00:00:00+10:00

Fast.ai has a two-part Deep Learning Course, the first being Practical Deep Learning for Coders, and the second being Deep Learning from the Foundations, both having different approaches and intended for different audiences. In the 7th lecture of Part 1, Jeremy Howard taught a lot about modern architectures such as Residual Network (ResNet) , U-Net, and Generative Adversarial Network (GAN).

Generative Adversarial Networks

GANs were first invented by Ian Goodfellow, one of the modern figures in the Deep Learning world. GANs could be used for various tasks such as Style Transfer, Pix2Pix, create CycleGAN, etc. Today what I’ll be experimenting with is Image Restoration.

Style Transfer Result | Tensorflow Tutorials

Image Restoration

There are different elements of an image which one can attempt to restore, and the example shown by Jeremy was restoring low resolution images into higher resolution images, which produces something like the following

Image Restoration Result | fast.ai

Jeremy also mentioned that GANs would also be capable of not only restoring an image’s resolution, but other elements such as clearing JPEG-like artifacts, different kinds of noise, or even restoring colors. And with that, I immediately hooked to finish the lecture and try out what I’ve learned, and thus came this project.

Color Restoration

Instead of turning low resolution images to high resolution images, I instead wanted to build a network which will be able to recolor black and white images. The approach is to do so is still similar in terms of how a GAN works, except with a few tweaks which we’ll discuss further down.

Code Source

Since it is the first time I’ve worked with generative networks like GANs, I decided to base my code heavily on a fast.ai notebook, lesson7-superres-gan.ipynb.

The code provided below isn’t complete and only the important blocks of code were taken.

The GAN Approach

A GAN is sort of like a game between two entities, one being the artist (formally generator) and the other being the critic (formally discriminator). Both of them have their own respective roles: the artist has to produce an image, while the critic has to decide whether the image produced by the artist is a real image or a fake/generated image.

The two of them have to get better at what they do, the critic has to get better at differentiating real from fake images, while the artist has to improve the image produced to fool the critic. The implementation of this concept to a task like image restoration is pretty much like the aforementioned. That is, the artist has to produce a higher resolution image from the low resolution image, while the critic also learns to distinguish between the two possibilities.

Now, to apply that to color restoration, instead of differentiating low resolution from high resolution images, the critic has to classify artist-generated images from colored images, and while doing so the artist has to learn how to better recolor the images it produces to outsmart the critic.

Data Modification

In order to build a network that is able to both learn to recolor images and to classify real from fake images, we need to provide it two sets of data, namely a colored image and its corresponding black-and-white image. To do so, we used the Pets dataset from Oxford IIT which are colored, and created a function to grayscale the images. Jeremy called the function to do such task as a crappifier, which in our case only grayscales the images. Once we have our colored and grayscaled images, we can use it later to train the network.

from PIL import Image, ImageDraw, ImageFont

class crappifier(object):
    def __init__(self, path_lr, path_hr):
        self.path_lr = path_lr
        self.path_hr = path_hr

    def __call__(self, fn, i):
        dest = self.path_lr/fn.relative_to(self.path_hr)
        dest.parent.mkdir(parents=True, exist_ok=True)
        img = PIL.Image.open(fn)
        img = img.convert('L')
        img.save(dest, quality=100)

Grayscaled Images

Pre-train Generator/Artist

Now, we will begin to train our generator first before using it in a GAN. The architecture we’ll use is a U-Net, with ResNet34 as its base model and all it’s trained to do is to recolor the images so it looks more like its colored-counterpart. Notice also that we’re using Mean Squared Error or MSELossFlat as our loss function.

arch = models.resnet34
loss_gen = MSELossFlat()

learn_gen = unet_learner(data_gen, arch, wd=wd, blur=True, norm_type=NormType.Weight,
                         self_attention=True, y_range=y_range, loss_func=loss_gen)

Once we have the generative model, we can train the model head for a few epochs, unfreeze, and train for several more epochs.

learn_gen.fit_one_cycle(2, pct_start=0.8)

epoch	train_loss	valid_loss	time
0	0.109306	0.111038	02:37
1	0.096312	0.102479	02:40

learn_gen.unfreeze()

learn_gen.fit_one_cycle(3, slice(1e-6,1e-3))

epoch	train_loss	valid_loss	time
0	0.089206	0.100583	02:41
1	0.087562	0.094716	02:44
2	0.086839	0.094106	02:45

The resulting generated images after a total of 5 epochs looks like the following

Generated Images

As you can see, the generator did poorly on some areas of the image, while it did great in others. Regardless, we’ll save those generated images to be used as the fake images dataset for the critic to learn from.

Train Discriminator/Critic

After generating two sets of images, we’ll feed the data to a critic and let it learn to distinguish between real images from the artist-generated images. Below is a sample batch of data, where the real images are labelled simply as images and the generated ones as image_gen

Real and Generated Images

To create the critic, we’ll be using fast.ai’s built-in gan_critic, which is just a simple Convolutional Neural Network with residual blocks. Unlike the generator, the loss function we’ll use is Binary Cross Entropy, since we only have two possible predictions, and also wrap it with AdaptiveLoss.

loss_critic = AdaptiveLoss(nn.BCEWithLogitsLoss())

learn_critic = Learner(data_crit, gan_critic(), metrics=accuracy_thresh_expand, loss_func=loss_critic, wd=wd)

Once the Learner has been created, we can proceed with training the critic for several epochs.

learn_critic.fit_one_cycle(6, 1e-3)

epoch	train_loss	valid_loss	accuracy_thresh_expand	time
0	0.170356	0.105095	0.958804	03:34
1	0.041809	0.022646	0.992365	03:27
2	0.026520	0.013480	0.996638	03:26
3	0.011859	0.005585	0.999117	03:25
4	0.012674	0.005655	0.999288	03:25
5	0.013518	0.005413	0.999288	03:24

GAN

With both of the generator and the critic pretrained, we can finally use both of them together and commence the game of outsmarting each other found in GANs. We will be utilizing AdaptiveGANSwitcher, which basically goes switches between generator to critic or vice versa when the loss goes below a certain threshold.

switcher = partial(AdaptiveGANSwitcher, critic_thresh=0.65)

Wrapping both the generator and the critic inside a GAN learner:

learn = GANLearner.from_learners(learn_gen, learn_crit, weights_gen=(1.,50.), show_img=False, switcher=switcher,
                                 opt_func=partial(optim.Adam, betas=(0.,0.99)), wd=wd)

A particular callback we’ll use is called GANDiscriminativeLR, which handles multiplying the learning rate for the critic.

learn.callback_fns.append(partial(GANDiscriminativeLR, mult_lr=5.))

Finally, we can train the GAN for 40 rounds before we use a larger image size to train for another 10 rounds.

lr = 1e-4
learn.fit(40, lr)

epoch	train_loss	valid_loss	gen_loss
0	3.718557	3.852783	03:27
1	3.262025	3.452096	03:29
2	3.241105	3.499610	03:29
3	3.098072	3.511492	03:31
4	3.161309	3.211511	03:30
5	3.108723	2.590987	03:29
6	3.049329	3.215695	03:29
7	3.156122	3.255158	03:29
8	3.039921	3.255423	03:30
9	3.136142	3.109873	03:30
10	2.969435	3.096309	03:30
11	2.967517	3.532753	03:30
12	3.066835	3.302504	03:28
13	2.979472	3.147814	03:29
14	2.848181	3.229101	03:29
15	2.981036	3.370961	03:30
16	2.874022	3.646701	03:32
17	2.816335	3.517284	03:33
18	2.886316	3.336793	03:33
19	2.851927	3.596783	03:33
20	2.885449	3.560956	03:33
21	3.081255	3.357426	03:31
22	2.812135	3.340290	03:33
23	2.933871	3.475993	03:32
24	3.084240	3.034758	03:31
25	2.983608	3.113349	03:33
26	2.746827	2.865806	03:32
27	2.789029	3.173259	03:33
28	2.952777	3.227012	03:32
29	2.825185	3.053979	03:34
30	2.782907	3.444182	03:34
31	2.805190	3.343132	03:33
32	2.901620	3.299375	03:33
33	2.744463	3.279421	03:32
34	2.818238	3.048206	03:32
35	2.755671	2.975504	03:32
36	2.764382	3.075425	03:32
37	2.714343	3.076662	03:32
38	2.805259	3.291719	03:32
39	2.787018	3.172551	03:32

learn.data = get_data(16, 192)
learn.fit(10, lr/2)

epoch	train_loss	valid_loss	gen_loss
0	2.789968	3.127500	08:28
1	2.842687	3.226334	08:22
2	2.764777	3.127393	08:24
3	2.783910	3.183345	08:23
4	2.731649	3.279976	08:21
5	2.652934	3.143363	08:23
6	2.664248	2.998718	08:22
7	2.777635	3.185632	08:27
8	2.718668	3.357025	08:26
9	2.660009	2.887908	08:23

The resulting training images looks like the following

GAN Produced Images

And as you can see, our model was able to recolor the images to a certain extent of accuracy. This is not bad, but GANs do have their weaknesses which we’ll discuss in the last section. Before we wrap up the GAN section, let’s try to feed the model external images, that is images that it hasn’t seen before.

Recoloring External Images

The following pet images were taken randomly from the internet. I’ve manually grayscaled the images and before letting the model predict its output.

GAN Produced Images

The colors produced, especially the animal’s fur is less saturated than it’s original image. However the natural background like grass and the sky is still acceptable, although different from the original.

Lastly, I tried to feed an image which is not a cat nor a dog. I tried to feed it images of actual people. The top row is a black-and-white picture which is already grayscaled when I received it. Whereas the bottom row’s image went through the same process as the images right above.

GAN Produced Images

Few things to notice here for the first prediction, the model is biased towards green and yellow colors, hence the floor color of the first output. Secondly, aside from coloring the person in front, the model also colored the person on the phone’s screen.

On the other hand, the second prediction was great at coloring the backdrop of mountains and the sky, but is bad at coloring the supposedly bright-red car as well as coloring the person as it remained mostly grey.

The most likely reason behind the poor recoloring of a person is because of the dataset being used to train the GAN on, which are Pets in this case.

Closing Remarks

Weaknesses of GANs

GANs are well known for being troublesome to be handled, especially during training, hence the fancy configuration and knobs which we have to have in order for it to behave well. Moreover, they take quite long hours to train in comparison to other architectures.

Possible Replacement of GANs

Just like shown in the remaining of Lecture 7, there are other architectures which are as good or even better than GANs, one of which is to use Feature Loss coupled with U-Nets, with shorter training hours and better results in several cases. I have tried doing that approach, but will not be discussing that here.

Conclusion

GANs are great, the tasks they can do vary from one architecture to another, and is one of the methods to let a model “dream” and have their own forms of creativity. However, they have certain weaknesses which includes long training time and careful tweaking requirements. They are definitely modern, and doing reasearch in the domain is still very much open and fun to do if you’re into this particular field.

That’s it! Thanks for your time and I hope you’ve learned something!

Handwritten Javanese Script Classification

2020-07-06T00:00:00+10:00

Aksara Jawa, or the Javanese Script is the core of writing the Javanese language and has influenced various other regional languages such as Sundanese, Madurese, etc. The script is now rarely used on a daily basis, but is sometimes taught in local schools in certain provinces of Indonesia.

Specific Form of Aksara

The Javanese Script which we will be classifiying is specifically Aksara Wyanjana’s Nglegena, or its basic characters. The list consists of 20 basic characters, without their respective Pasangan characters.

Dataset

Since I have not been able to find a handwritten Javanese Script dataset on the internet, I have decided to contact one of my English highschool teachers who has once showed my class her ability to write Javanese Script. The characters were written on paper, scanned, and edited manually. Credits to Mm. Martha Indrati for the help!

Image Classification

This project is very much inspired from datasets like MNIST and QMNIST which are handwritten digits and is a go-to dataset for starting to learn image classification. The end goal of this project is to be able to create a deep learning model which will be able to classify handwritten Javanese Script to a certain degree of accuracy.

Code

The main framework to be used is fastai-v2, which sits on top of PyTorch. Fastai-v2 is still under development as of the time of this writing, but is ready to be used for basic image classification tasks.

from fastai2.vision.all import *
import torch

Load Data

The data has been grouped per class folder, which we’ll load up and later split into training (70%) and validation (30%) images.

path = Path("handwritten-javanese-script-dataset")

Notice we’re using a small batch size of 5, mainly because we only have 200 images in total.

Here we’ll apply cropping and resizing as transformations to our image since most of the characters do not fully occupy the image size. Additionally, we’ll resize to 128px.

dblock = DataBlock(blocks     = (ImageBlock(cls=PILImageBW), CategoryBlock),
                   get_items  = get_image_files,
                   splitter   = GrandparentSplitter(valid_name='val'),
                   get_y      = parent_label,
                   item_tfms  = [CropPad(90), Resize(128, method=ResizeMethod.Crop)])

dls = dblock.dataloaders(path, bs=5, num_workers=0)

dls.show_batch()

There are only 20 types of characters in the type of Aksara which we’ll be classifying.

dls.vocab

(#20) ['ba','ca','da','dha','ga','ha','ja','ka','la','ma'...]

Model

We’ll be using XResNet50 as the model, which is based on the Bag of Tricks paper and is an “extension” to the ResNet50 architecture. We’ll pass our data, tell which metrics we’d like to observe, utilize LabelSmoothingCrossEntropy, and add MixUp as our callback.

learn = Learner(dls, xresnet50(c_in=1, n_out=dls.c), metrics=accuracy, loss_func=LabelSmoothingCrossEntropy(), cbs=MixUp)

Training Model

With all things in place, let’s finally train the model to learn from the given dataset and predict which class the image belongs to.

learn.lr_find()

SuggestedLRs(lr_min=0.0003019951749593019, lr_steep=6.309573450380412e-07)

learn.fit_one_cycle(30, 3e-4, cbs=SaveModelCallback(monitor='accuracy', fname='best_model'), wd=0.4)

epoch	train_loss	valid_loss	accuracy	time
0	3.067268	3.108827	0.050000	00:04
1	2.929908	2.669373	0.333333	00:04
2	2.769148	2.293764	0.383333	00:04
3	2.588481	2.215439	0.316667	00:04
4	2.416248	2.324036	0.283333	00:04
5	2.324458	1.983255	0.533333	00:04
6	2.189000	2.105889	0.383333	00:04
7	2.078479	2.350886	0.333333	00:04
8	1.922369	2.823610	0.216667	00:05
9	1.790820	1.584189	0.650000	00:05
10	1.683853	1.509675	0.583333	00:04
11	1.598790	1.570487	0.650000	00:04
12	1.528586	1.256149	0.833333	00:04
13	1.484508	1.623523	0.566667	00:04
14	1.437240	1.340925	0.750000	00:04
15	1.345987	1.138785	0.816667	00:05
16	1.350891	1.370259	0.716667	00:04
17	1.297572	1.453033	0.666667	00:04
18	1.318248	1.330522	0.750000	00:04
19	1.263931	1.023822	0.900000	00:04
20	1.247242	1.063768	0.900000	00:04
21	1.234829	1.009032	0.933333	00:05
22	1.203268	0.968369	0.950000	00:04
23	1.178766	0.965601	0.916667	00:04
24	1.156069	0.939599	0.933333	00:04
25	1.183693	0.943586	0.933333	00:04
26	1.166053	0.933629	0.933333	00:04
27	1.162939	0.936014	0.933333	00:04
28	1.132883	0.936722	0.933333	00:04
29	1.138776	0.946842	0.933333	00:04

Better model found at epoch 0 with accuracy value: 0.05000000074505806.
Better model found at epoch 1 with accuracy value: 0.3333333432674408.
Better model found at epoch 2 with accuracy value: 0.38333332538604736.
Better model found at epoch 5 with accuracy value: 0.5333333611488342.
Better model found at epoch 9 with accuracy value: 0.6499999761581421.
Better model found at epoch 12 with accuracy value: 0.8333333134651184.
Better model found at epoch 19 with accuracy value: 0.8999999761581421.
Better model found at epoch 21 with accuracy value: 0.9333333373069763.
Better model found at epoch 22 with accuracy value: 0.949999988079071.

learn.recorder.plot_loss()

learn.save('stage-1')

Analyze Results

After training, let’s see how well our model learned. Any incorrect prediction in a random batch will have its label colored red.

learn.show_results()

Instead of only viewing a batch, let’s analyze the results from the entire validation dataset.

interp =  ClassificationInterpretation.from_learner(learn)

This confusion matrix lists all the actual versus predicted labels. The darker the blue on the diagonal line, the better our model is at predicting.

interp.plot_confusion_matrix(figsize=(8,8), dpi=60)

On the other hand, this type of interpretation shows several of the predicted images, what our model thinks it is, and how confident it is with that prediction.

interp.plot_top_losses(9, figsize=(10,9))

Predicting External Images

To see how our model’s regularization fairs, let’s attempt to feed it an external data and see what it predicted.

from PIL import Image

def open_image_bw_resize(source) -> PILImageBW:
    return PILImageBW(Image.open(source).resize((128,128)).convert('L'))

The following character is supposed to be ma and was picked randomly from available images on the internet.

test0 = open_image_bw_resize('test-image-0.jpg')
test0.show()

Feed it through the model and see its output.

learn.predict(test0)[0]

'ma'

Luckily, the model was able to predict the character correctly. To challenge the model even more, I tried to write Javanese Script characters myself and see what the model predicts. Do note that I do not have any background in writing Javanese Scripts, so pardon my skills.

The following character is supposed to be ca.

test1 = open_image_bw_resize('test-image-1.jpg')
test1.show()

learn.predict(test1)[0]

'ca'

This character is supposed to be wa.

test2 = open_image_bw_resize('test-image-2.jpg')
test2.show()

learn.predict(test2)[0]

'ca'

Well that’s an incorrect guess, which is reasonable firstly because of my poor handwriting skills, and secondly the model was trained on a person’s particular style of handwriting - which in this case is my teacher’s. There could be many other factors which caused the incorrect guess, such as overfitting by the model, small dataset and possibly more.

Closing Remarks

There are several possible improvements which could be made, one of which is to increase the variety and the size of the dataset, since the model is only training on a single person’s handwriting. It’ll be better in terms of regularization to add other people’s handwriting into the mix as well.

That’s it for this mini project of mine. Thanks for your time and I hope you’ve learned something!

Hash Tables, Collisions, and Separate Chaining

2020-04-11T00:00:00+10:00

According to Wikipedia, a hash table or sometimes called hash map is a a data structure that implements an associative array abstract data type, a structure that can map keys to values.

Unlike Linked Lists, Hash Tables allow for an O(1) time complexity when searching, which is a powerful tool knowing that a Linked List requires O(n) complexity. However, there may be possible collisions when inserting a new data whose key has already been used. This causes the search time complexity to have a O(n) worst case, just like a Linked List.

We’ll implement a Hash Table using the C language. For each key in the hash table, we’ll implement it using a node of a Singly Linked List to cater for separate chaining.

The complete code for this post can be found here.

Header Files

The only header files we’ll be using are the following

#include 
#include 
#include 
#include 

Hash Table Size

To keep the size of the hash table constant, we will define the maximum number of keys using the define keyword. There will be only 26 keys which is based on the 26 alphabets.

#define MAX_N 26

Node Struct

Each node in the linked list only consists a name string and a pointer next which points to the next node in the linked list.

typedef struct node {
    char name[200];
    struct node* next;
} node;

Function Prototypes

Since we are writing in C, we need to first prototype every function we’re going to implement below our main function. Here is the list of functions we’ll be implementing

node* create_node(const char* name);
int hash(const char* name);
void insert(node* root[], const char* name);
char* search(node* root[], const char* name);
void print_list(node* head, int idx);
void print_table(node* root[]);

Creating a Node

To create our node, we will implement the following function. It is almost identical to the one in the Singly Linked List blog post.

node* create_node(const char* name) {
    node* student = (node*) malloc(sizeof(node));

    strcpy(student->name, name);
    student->next = NULL;

    return student;
}

Hashing Function

The hashing function we’ll be using is called the division method. Specifically, we take the first letter of the string, convert it to its lower case letter, and return its ASCII equivalent.

int hash(const char* name) {
    return tolower(name[0]) - 'a';
}

With that we will get only 26 possible keys for a string, which is the same as the MAX_N we’ve defined earlier. There are various ways to create a better hashing function with aims to reduce collisions, which we’ll discuss later in the blog.

Inserting a Node into the Hash Table

First we utilize create_node() to allocate the memory required and create the student node. Then, with the hashing function we can get the corresponding key for the given string name.

Inserting the first node of a key is as simple as getting the address of the corresponding head and setting it to be the newly created student.

However, we will need to address possible collisions and we do this by a method called separate chaining, which basically means appending the next node with the same key to the linked list.

void insert(node* root[], const char* name) {
    node* student = create_node(name);

    int key = hash(name);

    node** head = &root[key];

    if (*head == NULL) { // if the head of a particular key is still NULL;
        *head = student;
    } else { // separate chaining, i.e. push the new node to the back of the linked list.
        node* curr = *head;
        while(curr->next != NULL) {
            curr = curr->next;
        }
        curr->next = student;
    }
}

Searching for a name inside the Hash Table.

Searching is one of the most powerful features of a Hash Table as discussed previously. To implement searching by name, we will be using linear search since we are using linked lists.

Just like insertion, we cater to the possible scenarios like a NULL head and traversing through a linked list otherwise.

char* search(node* root[], const char* name) {
    int key = hash(name);

    node* head = root[key];

    if (head == NULL) {
        return NULL;
    } else {
        node* curr = head;
        while(curr != NULL) {
            if (strcmp(curr->name, name) == 0) {
                return curr->name;
            }
            curr = curr->next;
        }
        return NULL;
    }
}

Printing a Linked List

Since each key in the hash table uses a linked list, we need to prepare a function which prints a linked list. We would also like to print the index of the linked list in the hash table, so we pass another parameter called idx.

void print_list(node* head, int idx) {
    if (head == NULL) {
        return;
    } else {
        printf("[%d] ", idx);
        node* curr = head;
        while (curr != NULL) {
            printf("%s", curr->name);
            curr = curr->next;
            if (curr != NULL) {
                printf(" -> ");
            }
        }
        printf("\n");
    }
}

Printing the Hash Table

Finally, we can implement a function to print every non-empty linked lists and its contents.

void print_table(node* root[]) {
    for (int i = 0; i < MAX_N; ++i) {
        print_list(root[i], i);
    }
}

Main Function

We’ll demonstrate how the main function looks like.

int main(void) {

    node* root[MAX_N] = {NULL};

    insert(root, "Apple");
    insert(root, "Orange");
    insert(root, "Papaya");
    insert(root, "Avocado");
    insert(root, "Blueberry");
    insert(root, "Peach");
    insert(root, "Plum");

    char* find_banana = search(root, "Banana");
    if (find_banana != NULL) {
        printf("%s found\n", find_banana);
    }

    char* find_avocado = search(root, "Avocado");
    if (find_avocado != NULL) {
        printf("%s found\n", find_avocado);
    }

    print_table(root);

    return 0;

}

The output looks something like this

Avocado found
[0] Apple -> Avocado
[1] Blueberry
[14] Orange
[15] Papaya -> Peach -> Plum

A Better Hash Table and Hashing Function

As said previously, the previously implemented function is not the best when trying to reduce collisions. We would like to avoid collisions as much as possible to maintain O(1) search time.

A reference code was given by my lecturer as an example of a good hashing function and hash table, as well as explanations as to why certain decisions were made.

Credit belongs to the author of this code.

There are several aspects to improve a hash table, namely its size and the hashing function being used.

Size of Hash Table

Firstly, the size of the hash table plays a role in the distribution of the key in the hash table.

A good size would be a prime number, since it has very few factors. While non-prime numbers cause distribution of keys to be not uniformly distributed.

Simply put, a non-uniform distribution of keys causes other keys which are not factors of the size of the hash table to be of high probability in being empty.

For example, if our choice was to use 12 as the size of the hash table. The key 3, a factor of 12, along with its multiples (0, 3, 6, 9, …) will be more likely to be filled while others empty, thus increasing the chance of collision.

Hash Function

Aside from being fast to be computed, a good hashing function distributes keys as uniformly possible.

To do so, we sum the ASCII equivalents of every character in the string to make the key as unique as possible.

We also add a so-called zero-padding if ever an empty string is allowed to prevent it affecting universality.

In addition, every time we sum the ASCII, we add a base number, which is strictly greater than the number of different values of each individual letters. Doing so further increases the range of the possible keys hence reducing collisions.

For example, since there are 26 possible lowercase letters, a base number like 31 is preferable. The base number 31 is also used by a method called hashCode() in Java’s String class.

A more detailed explanation can be found in this Wikipedia page about Universal hashing.

Code Implementation

With the said changes, our hash table starts off to be of size 97.

int hashTable[97];

While the hash function looks like

int hash(const char *str) {
    int len = strlen(str);

    int base = 31;

    int MODPRIME = 97;

    int ret = 0;

    for (int i = 0; i < len; i++) {
        ret = (ret * base) + (str[i] - 'a' + 1);
        ret = ret % MODPRIME;
    }

    return (ret * base) % MODPRIME;
}

Conclusion

It is very interesting to see how small details of a hash table can greatly affect its performance and the math behind it. With the capabilities of a hash table, searching can be greatly improved in comparison to the previously discussed data structures.

To read up more on hash tables, the reference code also linked to very resourceful notes from UC San Diego.

Doubly Linked List in C

2020-03-05T00:00:00+11:00

After learning how to implement Singly Linked List, we’re going to implement Doubly Linked List, which is similar to Singly Linked List, but with the addition of a prev pointer which points to the node before it.

We’ll implement a Doubly Linked List using the C language. The complete code for this post can be found here.

The following code is based on a lecture by Rhio Sutoyo, S.Kom., M.Sc. in Data Structures course.

Header Files

The only header files we’ll be using are the following

#include 
#include 
#include 

Node Struct

A node is just a single element inside the list, which in this case represents a student’s information with their name and gpa. Also, it has a pointer to the next and previous node.

typedef struct node {
    char name[200];
    double gpa;
    struct node* next;
    struct node* prev;
} node;

Notice that we also use typedef which allows us to omit the struct keyword in the instantiation of a node.

Function Prototypes

Since we are writing in C, we need to first prototype every function we’re going to implement below our main function. Here is the list of functions we’ll be implementing

node* create_node(const char* name, double gpa);
void sorted_push(const char* name, double gpa);
void delete_node(const char* key);
void print_list(void);
void print_reversed_list(void);

Global Head and Tail

For this example, we will create a global variable called head and tail, which denotes the first element and the last element in the list respectively.

node *head, *tail;

Creating a Node

To create our student node, we will implement the following function.

node* create_node(const char* name, double gpa) {
    // allocate memory of size 'node';
    node* student = (node*) malloc(sizeof(node));
    // create a new node based on the given arguments;
    strcpy(student->name, name);
    student->gpa = gpa;

    return student;
}

Sorted Push

Instead of implementing push front, back, or middle, we’re going to create a function which will automatically insert a node in ascending order of gpa.

void sorted_push(const char* name, double gpa) {
    // create a new node;
    node* student = create_node(name, gpa);

    if (head == NULL) { // if list is empty;
        head = student;
        tail = student;
        head->next = NULL;
        tail->next = NULL;
        head->prev = NULL;
        tail->prev = NULL;
    } else {
        node* curr = head;
        // traverse to the node with gpa greater than the one being pushed;
        while (curr != NULL && curr->gpa < student->gpa) {
            curr = curr->next;
        }

        if (curr == head) { // if the head already has a value greater than the new node's;
            // append old head to the new node;
            student->next = head;
            head->prev = student;
            // set new node as new head;
            head = student;
            head->prev = NULL;
        } else if (curr == NULL) { // if we've reached the node after tail, i.e. all values are less than the value being pushed;
            // append new node to tail;
            tail->next = student;
            student->prev = tail;
            // set new node as new tail;
            tail = student;
            tail->next = NULL;
            free(curr);
        } else { // if we have to push the new node in the middle;
            // connect the current's previous node to the new node;
            curr->prev->next = student;
            student->prev = curr->prev;
            // connect curr as the next of the new node;
            student->next = curr;
            curr->prev = student;
        }
    }
}

Delete a Node Based on Name

We can delete a particular node based on its name.

void delete_node(const char* key) {
    if (head == NULL) { // if list is empty;
        printf("List is empty.\n");
    } else {
        node* curr = head;
        // traverse to the node to be deleted;
        while (curr != NULL && strcmp(curr->name, key) != 0) {
            curr = curr->next;
        }

        if (curr == NULL) { // if key is not in the list;
            printf("\"%s\" is not in the list.\n", key);
        } else if (curr == head && curr == tail) { // if key the only node in the list;
            // delete node;
            free(curr);
            // reset head and tail;
            head = NULL;
            tail = NULL;
        } else if (curr == head) { // if key is head;
            // set old head's next as new head;
            head = head->next;
            head->prev = NULL;
            // free old head since its no longer used;
            free(curr);
        } else if (curr == tail) { // if key is tail;
            // set the node before old tail to be the new tail;
            tail = tail->prev;
            tail->next = NULL;
            // fre old tail;
            free(curr);
        } else {
            // skip the node being deleted;
            curr->prev->next = curr->next;
            curr->next->prev = curr->prev;
            // free the deleted node;
            free(curr);
        }
    }
}

Print Linked List

For convenience, create a function to print the entire list.

void print_list(void) {
    if (head == NULL) { // if list is empty;
        printf("List is empty.\n");
    } else {
        node* curr = head;
        // traverse through each node;
        while (curr != NULL) {
            printf("Name: %-10s GPA: %.2lf\n", curr->name, curr->gpa);
            curr = curr->next;
        }
    }
}

Print Linked List in Reverse Order

Since we have the prev pointer, we can easily print the list in reverse order.

void print_reversed_list(void) {
    if (head == NULL) { // if list is empty;
        printf("List is empty.\n");
    } else {
        node* curr = tail;
        // traverse through each node backwards;
        while (curr != NULL) {
            printf("Name: %-10s GPA: %.2lf\n", curr->name, curr->gpa);
            curr = curr->prev;
        }
    }
}

Main Function

Lastly, we’ll demonstrate how the main function looks like.

int main(void) {

    sorted_push("Steven", 3.5); // [Steven]
    sorted_push("Bill", 2.0); // [Bill, Steven]
    sorted_push("John", 3.7); // [Bill, Steven, John]
    sorted_push("Ace", 2.5); // [Bill, Ace, Steven, John]

    delete_node("Ace"); // [Bill, Steven, John]

    print_list();

    return 0;

}

Conclusion

With doubly linked list, we can easily move forward and backward from a node, which will highly ease the process of adding a node, printing in reverse order, and others which singly linked list would have a difficulty of doing.