For resource-poor languages such as Hmong, large datasets of annotated questions are unavailable, which means that producing an automated question classifier is a potentially challenging task. Currently, a dataset containing 411 annotated Hmong questions is publicly available. The challenge here is to produce a question classifier with adequate accuracy using this available dataset.

What we are exploring here is how well certain models perform with an intentionally limited set of data. This will allow us to gain a better understanding about what kinds of model architectures would work best in the long-term for resource-poor languages, which in most cases will never have the kind of robust data that produce SOTA results in more prominent languages.

We will test five different models with our dataset and compare their accuracy:

Three-layer MLP using word embeddings
Double BiLSTM
Ordered Bidirectional GRU with a Simple RNN
Three-layer MLP using a CountVectorizer
Three-layer MLP using a bigram CountVectorizer and weights regularization

Let’s begin by importing the modules we need to preprocess the data.

import os
import sys
import re

import numpy as np
from gensim.models import KeyedVectors
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from sklearn.model_selection import train_test_split

pos_tag_interface_path = os.path.expanduser(os.path.join('~',
                                                         'python_workspace',
                                                         'medical_corpus_scripting',
                                                         'pos_tagger_interface'))
sys.path.append(pos_tag_interface_path)
from POS_Tagger import HmongPOSTagger

Next, we load the KeyedVectors based on the models that are part of the Hmong Medical Corpus (HMC):

Subword embeddings based on the annotated HMC data (to handle syllable-based spacing in Hmong)
POS tag embeddings based on the annotated HMC data
Token embeddings based on the SCH (soc.culture.hmong) Corpus

subword_embeddings = KeyedVectors.load('subword_model.h5', mmap='r')
tag_embeddings = KeyedVectors.load('tag_alone_model.h5', mmap='r')
token_embeddings = KeyedVectors.load('word2vec_Hmong_SCH.model', mmap='r')

Next, we load the annotated question dataset from the relevant file.

def load_question_data(filename):
    f = open(filename, 'r')
    data = [w.strip().split(' | ') for w in f.readlines() if '???' not in w and '<' not in w]
    f.close()
    print(len(data))
    return data

data = load_question_data('question_type_training_set_2.txt')

The dataset contains an uneven distribution of examples: it reflects the kinds of questions for which annotated data are available. This means we need to produce a smaller dataset with an even amount of examples for each category.

Ensuring this, however, means that our dataset of 411 examples will shrink down to 135, given 15 examples for each category.

from collections import Counter

c = Counter(w[1] for w in data)

sorted_data = sorted(list(c.items()), key=lambda item: item[1], reverse=True)
print(sorted_data)

first_filtered_data = [w for w in data if c[w[1]] >= 15]
#filtered_data = [w for w in data if c[w[1]] > 2]

total_count = {w[0]: 0 for w in sorted_data}

filtered_data = []
for item in first_filtered_data:
    if total_count[item[1]] < 15:
        filtered_data.append(item)
        total_count[item[1]] += 1

print(len(filtered_data))

print(total_count.items())

[('Reason', 118), ('Polar', 83), ('Action', 20), ('Description', 18), ('Person', 18), ('Location', 18), ('Time', 15), ('Duration', 15), ('Destination', 15), ('Name', 13), ('Number', 10), ('Event', 10), ('Year', 10), ('Clan', 7), ('Opinion', 6), ('Thing', 5), ('Choice', 4), ('Source', 4), ('Kind', 3), ('Translation', 3), ('Month', 2), ('Meaning', 2), ('Manner', 2), ('Goal', 2), ('Country', 2), ('Spirit', 2), ('Date', 1), ('River', 1), ('Curse', 1), ('Day', 1)]
135
dict_items([('Reason', 15), ('Polar', 15), ('Action', 15), ('Description', 15), ('Person', 15), ('Location', 15), ('Time', 15), ('Duration', 15), ('Destination', 15), ('Name', 0), ('Number', 0), ('Event', 0), ('Year', 0), ('Clan', 0), ('Opinion', 0), ('Thing', 0), ('Choice', 0), ('Source', 0), ('Kind', 0), ('Translation', 0), ('Month', 0), ('Meaning', 0), ('Manner', 0), ('Goal', 0), ('Country', 0), ('Spirit', 0), ('Date', 0), ('River', 0), ('Curse', 0), ('Day', 0)])

For the next step, we split the data into questions and labels using zip.

questions, labels = zip(*filtered_data)

Now, we load the Hmong POS tagger and tag the words that make up the questions. This will produce tags of the sequence subword-POS, as in B-NN for the first syllable of a word, and the unit in question is a noun.

def tag_question_data(questions):
    tagger = HmongPOSTagger()
    tokenized_questions = [re.sub('([\?,;])', ' \g<1>', q).split(' ') for q in questions]
    return tokenized_questions, tagger.tag_words(tokenized_questions)

tokenized_questions, tags = tag_question_data(questions)

Here, we split the subword tags and the POS tags, placing them in separate sentence sets.

def split_subword_pos_tags(tags):
    """This function takes tags of type B-NN (subword-POS)
    and produces separate lists for subword tags and POS tags"""
    subword_tags = []
    pos_tags = []
    for sent in tags:
        subword_sent = []
        pos_sent = []
        for word in sent:
            if word == '-PAD-': # unknown word that needs reassignment
                subword = 'B'
                pos = 'FW'
            else:
                subword, pos = word.split('-')
            subword_sent.append(subword)
            pos_sent.append(pos)
        subword_tags.append(subword_sent)
        pos_tags.append(pos_sent)
    return subword_tags, pos_tags

subword_tags, pos_tags = split_subword_pos_tags(tags)

The next step converts the words, tags, and labels to integers using keras.preprocessing.text.Tokenizer and sets the values for -PAD- and -OUT-.

# notes: keras.preprocessing.text.one_hot, text_to_word_sequence
# special pad values are used because Keras Tokenizer does not permit 0 as a value
tokenizer = Tokenizer()
tokenizer.fit_on_texts(tokenized_questions)
# automatically converts words to sequences of numbers
sequences = tokenizer.texts_to_sequences(tokenized_questions)
word_pad_value = max(tokenizer.word_index.values()) + 1
word_out_value = word_pad_value + 1

pos_tag_tokenizer = Tokenizer()
pos_tag_tokenizer.fit_on_texts(pos_tags)
pos_sequences = pos_tag_tokenizer.texts_to_sequences(pos_tags)
pos_pad_value = max(pos_tag_tokenizer.word_index.values()) + 1
pos_out_value = pos_pad_value + 1

subword_tag_tokenizer = Tokenizer()
subword_tag_tokenizer.fit_on_texts(subword_tags)
subword_sequences = subword_tag_tokenizer.texts_to_sequences(subword_tags)
subword_pad_value = max(subword_tag_tokenizer.word_index.values()) + 1
subword_out_value = subword_pad_value + 1

# can use label_tokenizer.sequences_to_texts once done
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
label_sequences = [l[0] for l in label_tokenizer.texts_to_sequences(labels)]

Now, we set the maximum length of the sentences, define the variable word_index for use later, and then pad the input sequences.

MAX_LENGTH = max(len(s) for s in sequences)

word_index = tokenizer.word_index

padded_sequences = pad_sequences(sequences,
                                 maxlen=MAX_LENGTH,
                                 padding='post',
                                 value=word_pad_value)
padded_pos_sequences = pad_sequences(pos_sequences,
                                     maxlen=MAX_LENGTH,
                                     padding='post',
                                     value=pos_pad_value)
padded_subword_sequences = pad_sequences(subword_sequences,
                                         maxlen=MAX_LENGTH,
                                         padding='post',
                                         value=subword_pad_value)

Next, before we split the data into training and test sets, we must first convert the input sentences using CountVectorizer. This ensures that the models using the CountVectorizer data use the same sentences as the other models to ensure we can compare them. The first step in doing this is to produce string sentences that contain the original string of the word, the subword position tag, and the POS tag. Here, I combine all three into individual words processed as single units, as intuitively this will treat mobBNN (that is, the noun mob ‘sickness’ at the beginning of a word as B-NN) as a different word from mobBVV (the verb mob ‘be sick’ at the beginning of a word as B-VV).

from sklearn.feature_extraction.text import CountVectorizer

def join_data(tokenized_questions, subword_tags, pos_tags):
    joined_data = []
    for i, q in enumerate(tokenized_questions):
        joined_sent = []
        for j, word in enumerate(q):
            joined_sent.append(''.join([word, subword_tags[i][j], pos_tags[i][j]]))
        joined_data.append(' '.join(joined_sent))
    
    print(joined_data[:2])
    return joined_data

Next, we create a CountVectorizer object for unigrams and create vectors for each sentence.

#CountVectorizer needs sentences made of strings.
joined_data = join_data(tokenized_questions, subword_tags, pos_tags)

vectorizer = CountVectorizer()
vectorizer.fit(joined_data)
vectors = np.array([v.toarray()[0] for v in vectorizer.transform(joined_data)])

['UaBVV casINN kojBPN lamBFW haisBVV liBPP koBFW rauBPP kuvBPN ?OPU', 'LeejBCL nusBFW ,OPU kojBPN pabBVV kuvBPN puasBAD tauBVV ?OPU']

Now, we create a CountVectorizer object for bigrams.

pure_vectorizer_bigrams = CountVectorizer(ngram_range=(2,2))
pure_vectorizer_bigrams.fit(joined_data)
pure_bigram_vectors = np.array([v.toarray()[0] for v in pure_vectorizer_bigrams.transform(joined_data)])

Once all the input preprocessing is complete, we can split the dataset into training and test sets. We have several different kinds of data as input: the word token sequences, vector sequences representing both unigrams and bigrams, the POS tag sequences, and the subword position tag sequences, along with the output question label sequences. Each of these are split into training and test sets, where all of the training sets contain the same sentences in the same order, as do the test sets.

labels_matrix = to_categorical(np.asarray(label_sequences))

TEST_SIZE = 0.2
VALIDATION_SIZE = 0.2

(X_train, 
 X_test, 
 X_vector_train, 
 X_vector_test, 
 X_bigram_vector_train,
 X_bigram_vector_test,
 X_pos_train,
 X_pos_test,
 X_subword_train,
 X_subword_test,
 y_train, 
 y_test) = train_test_split(padded_sequences, 
                                   vectors,
                                   pure_bigram_vectors,
                                   padded_pos_sequences,
                                   padded_subword_sequences,
                                   label_sequences,
                                   test_size=TEST_SIZE)

word_set = set(word_value for sent in X_train for word_value in sent)
pos_set = set(pos_value for sent in X_pos_train for pos_value in sent)
subword_set = set(subword_value for sent in X_subword_train for subword_value in sent)

Once this is complete, we define a function that will create embedding matrices for each input set that requires it for three of the models we will consider below.

def create_embeddings_matrix(input_set, index_dict, word2vec_source_object, convert_caps=False):
    '''Creates an embedding matrix from preexisting Word2Vec model for use in the Keras Embedding layer
    @param input_set: the set object containing all unique X entries in X_train as numeral values
    @param index_dict: the Tokenizer.word_index dict object containing numerical conversions
    @param word2vec_source_object: the KeyedVectors object containing the vector values for embedding'''
    # $PAD and $OUT remain zeros; they are max(index_dict.values()) + 1 and + 2, respectively
    pad_out_tags_length = 2
    embedding_matrix = np.zeros((max(index_dict.values()) + pad_out_tags_length + 1,
                                 word2vec_source_object.vector_size))
    for token, numeral in index_dict.items():
        if numeral in input_set:
            try:
                if convert_caps == True:
                    word2vec_token_value = token.upper()
                else:
                    word2vec_token_value = token
                embedding_vector = word2vec_source_object.wv[word2vec_token_value]
            except KeyError:
                embedding_vector = None
            if embedding_vector is not None:
                embedding_matrix[numeral] = embedding_vector
    return embedding_matrix

Now, we use the new function to create the matrices for the word inputs, the POS tag inputs, and the subword position tag inputs.

words_embedding_matrix = create_embeddings_matrix(word_set,
                                                  tokenizer.word_index,
                                                  token_embeddings)
pos_embedding_matrix = create_embeddings_matrix(pos_set, 
                                                pos_tag_tokenizer.word_index, 
                                                tag_embeddings, 
                                                True)
subword_embedding_matrix = create_embeddings_matrix(subword_set, 
                                                    subword_tag_tokenizer.word_index,
                                                    subword_embeddings, 
                                                    True)

Here, we create another function to produce input matrices from the embedding matrices we just created.

def produce_input_matrix(sequences, embedding_matrix):
    output_sequences = []
    for sent in sequences:
        output_sent = []
        for word in sent:
            output_sent.append(embedding_matrix[word])
        output_sequences.append(output_sent)
    return output_sequences

Now, we produce sequence matrices using the function we just created. These will be the input sets for the first three models we’ll consider.

X_train_sequence_matrix = produce_input_matrix(X_train, 
                                               words_embedding_matrix)
X_test_sequence_matrix = produce_input_matrix(X_test, 
                                              words_embedding_matrix)
X_pos_train_sequence_matrix = produce_input_matrix(X_pos_train, 
                                                   pos_embedding_matrix)
X_pos_test_sequence_matrix = produce_input_matrix(X_pos_test, 
                                                  pos_embedding_matrix)
X_subword_train_sequence_matrix = produce_input_matrix(X_subword_train, 
                                                       subword_embedding_matrix)
X_subword_test_sequence_matrix = produce_input_matrix(X_subword_test, 
                                                      subword_embedding_matrix)

Now, we define how large our output vectors should be by defining y_classes on the basis of the total number of categories found among the question labels in our dataset. Then, we convert our y_train and y_test sets to one-hot vectors.

y_classes = max(label_tokenizer.word_index.values()) + 1
y_train = to_categorical(y_train, num_classes=y_classes)
y_test = to_categorical(y_test, num_classes=y_classes)

Next, we import the necessary libraries from Keras to build the models.

from keras.models import Model
from keras.layers import Input, Embedding, Activation, Flatten, Add
from keras.layers import Dense, LSTM, Bidirectional, GRU, SimpleRNN
from keras.callbacks import EarlyStopping
from keras.regularizers import l2

Next, we create a function that will provide plots of the loss and accuracy results.

from matplotlib import pyplot as plt

%matplotlib inline

def plot_metrics(history_obj):
    plt.plot(history_obj.history['loss'], label='Training loss')
    plt.plot(history_obj.history['val_loss'], label='Validation loss')
    plt.legend(loc="upper left")
    plt.show()
    
    plt.plot(history_obj.history['accuracy'], label='Training accuracy')
    plt.plot(history_obj.history['val_accuracy'], label='Validation accuracy')
    plt.legend(loc="upper left")
    plt.show()

Next, we create a function that can train and test each model.

def run_model(model, countvectorizer=False, bigrams=False, validation=VALIDATION_SIZE):
    if countvectorizer:
        if bigrams:
            model_history = model.fit(np.array(X_bigram_vector_train),
                                      y_train,
                                      batch_size=4, 
                                      epochs=50, 
                                      validation_split=validation)
            plot_metrics(model_history)
            scores = model.evaluate(np.array(X_bigram_vector_test), y_test)
        else:
            model_history = model.fit(np.array(X_vector_train),
                                      y_train,
                                      batch_size=4, 
                                      epochs=50, 
                                      validation_split=validation)
            plot_metrics(model_history)
            scores = model.evaluate(np.array(X_vector_test), y_test)
    else:
        model_history = model.fit([X_train_sequence_matrix, 
                                   X_pos_train_sequence_matrix,
                                   X_subword_train_sequence_matrix], 
                                  y_train,
                                  batch_size=4, 
                                  epochs=50, 
                                  validation_split=validation)
        plot_metrics(model_history)
        scores = model.evaluate([X_test_sequence_matrix, 
                                 X_pos_test_sequence_matrix,
                                 X_subword_test_sequence_matrix], 
                                y_test)
    print("Training set accuracy: {result:.2f} percent".format(result= \
                                                               model_history.history['accuracy'][-1]*100))
    if validation > 0.0:
        print("Validation set accuracy: {result:.2f} percent".format(result= \
                                                                 model_history.history['val_accuracy'][-1]*100))
    print("Accuracy: {result:.2f} percent".format(result=(scores[1]*100)))

At this point, we are ready to build and test the models. Our first model is a multilayer perceptron (MLP). This network will have three Dense layers: two hidden layers using a rectified linear unit (ReLU) activation and one output layer using a softmax activation. We also use a Flatten layer between the second hidden layer and the output layer, as our input is sentences made up of words and our output is a single label. That is, each question as input is two-dimensional, while each label as output is one-dimensional, meaning that we need to reduce the dimensionality—exactly what Flatten does.

# Ordered MLP
word_input = Input(shape=(MAX_LENGTH, 150))
pos_input = Input(shape=(MAX_LENGTH, 150))
subword_input = Input(shape=(MAX_LENGTH, 150))
addition_layer = Add()([word_input, pos_input, subword_input])
dense_1 = Dense(256, activation='relu')(addition_layer)
dense_2 = Dense(256, activation='relu')(dense_1)
flatten_layer = Flatten()(dense_2)
dense_3 = Dense(y_classes, activation='softmax')(flatten_layer)

model = Model(inputs=[word_input, pos_input, subword_input], outputs=dense_3)
model.compile(optimizer='adam', 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

print(model.summary())

Model: "model_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_13 (InputLayer)           (None, 34, 150)      0                                            
__________________________________________________________________________________________________
input_14 (InputLayer)           (None, 34, 150)      0                                            
__________________________________________________________________________________________________
input_15 (InputLayer)           (None, 34, 150)      0                                            
__________________________________________________________________________________________________
add_4 (Add)                     (None, 34, 150)      0           input_13[0][0]                   
                                                                 input_14[0][0]                   
                                                                 input_15[0][0]                   
__________________________________________________________________________________________________
dense_15 (Dense)                (None, 34, 256)      38656       add_4[0][0]                      
__________________________________________________________________________________________________
dense_16 (Dense)                (None, 34, 256)      65792       dense_15[0][0]                   
__________________________________________________________________________________________________
flatten_2 (Flatten)             (None, 8704)         0           dense_16[0][0]                   
__________________________________________________________________________________________________
dense_17 (Dense)                (None, 10)           87050       flatten_2[0][0]                  
==================================================================================================
Total params: 191,498
Trainable params: 191,498
Non-trainable params: 0
__________________________________________________________________________________________________
None

Now we run the MLP model using the function we defined above. As we can see from the graphs, the model effectively converged after only two iterations in regard to the loss, but, interestingly, the validation accuracy continued to increase even as the validation loss slowly increased. In any case, this MLP model produced an accuracy of only 18.52% on the test data, where the variance produced by the limited number of training data proved too problematic for the MLP with the hyperparameters chosen.

run_model(model)

Train on 86 samples, validate on 22 samples
Epoch 1/50
86/86 [==============================] - 1s 15ms/step - loss: 2.9749 - accuracy: 0.2558 - val_loss: 2.6418 - val_accuracy: 0.3182
...
Epoch 50/50
86/86 [==============================] - 1s 7ms/step - loss: 1.5699e-04 - accuracy: 1.0000 - val_loss: 2.8688 - val_accuracy: 0.4545

27/27 [==============================] - 0s 824us/step
Training set accuracy: 100.00 percent
Validation set accuracy: 45.45 percent
Accuracy: 18.52 percent

Our second model is a BiLSTM model, which contains two Bidirectional LSTM (Long Short-Term Memory) hidden layers and a Dense layer with a softmax activation as output.

word_bilstm2_input = Input(shape=(MAX_LENGTH, 150))
pos_bilstm2_input = Input(shape=(MAX_LENGTH, 150))
subword_bilstm2_input = Input(shape=(MAX_LENGTH, 150))
addition_bilstm2_layer = Add()([word_bilstm2_input, pos_bilstm2_input, subword_bilstm2_input])
bilstm2_layer = Bidirectional(LSTM(256, return_sequences=True))(addition_bilstm2_layer)
bilstm2_layer_2 = Bidirectional(LSTM(256))(bilstm2_layer)
final_bilstm2_layer = Dense(y_classes, activation='softmax')(bilstm2_layer_2)

model_bilstm2 = Model(inputs=[word_bilstm2_input, 
                              pos_bilstm2_input, 
                              subword_bilstm2_input], 
                      outputs=final_bilstm2_layer)
model_bilstm2.compile(loss='categorical_crossentropy', 
                      optimizer='adam', 
                      metrics=['accuracy'])

print(model_bilstm2.summary())

Model: "model_8"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_16 (InputLayer)           (None, 34, 150)      0                                            
__________________________________________________________________________________________________
input_17 (InputLayer)           (None, 34, 150)      0                                            
__________________________________________________________________________________________________
input_18 (InputLayer)           (None, 34, 150)      0                                            
__________________________________________________________________________________________________
add_5 (Add)                     (None, 34, 150)      0           input_16[0][0]                   
                                                                 input_17[0][0]                   
                                                                 input_18[0][0]                   
__________________________________________________________________________________________________
bidirectional_4 (Bidirectional) (None, 34, 512)      833536      add_5[0][0]                      
__________________________________________________________________________________________________
bidirectional_5 (Bidirectional) (None, 512)          1574912     bidirectional_4[0][0]            
__________________________________________________________________________________________________
dense_18 (Dense)                (None, 10)           5130        bidirectional_5[0][0]            
==================================================================================================
Total params: 2,413,578
Trainable params: 2,413,578
Non-trainable params: 0
__________________________________________________________________________________________________
None

Here, we run the BiLSTM. The results are markedly higher than with our MLP, which is unsurprising, given that the BiLSTM architecture is more appropriate than MLP for ordered sentence data. While our validation loss got progressively worse after only three iterations, the validation accuracy continued to improve until the fifth iteration, and leveled off. The validation accuracy at 72.73% means that this model still struggles with high variance, but the accuracy on the test set at 88.89% means that this model looks promising if the variance is adequately addressed.

run_model(model_bilstm2)

Train on 86 samples, validate on 22 samples
Epoch 1/50
86/86 [==============================] - 25s 291ms/step - loss: 1.8337 - accuracy: 0.3837 - val_loss: 1.2817 - val_accuracy: 0.5909
...
Epoch 50/50
86/86 [==============================] - 24s 278ms/step - loss: 6.3236e-05 - accuracy: 1.0000 - val_loss: 1.6352 - val_accuracy: 0.7273

27/27 [==============================] - 1s 30ms/step
Training set accuracy: 100.00 percent
Validation set accuracy: 72.73 percent
Accuracy: 88.89 percent

Our third model is a Bidirectional GRU (gated recurrent unit) with a simple RNN (recursive neural network) layer and Dense layer with a softmax activation as output.

# Bidirectional GRU with simple RNN
word_bigru_input = Input(shape=(MAX_LENGTH, 150))
pos_bigru_input = Input(shape=(MAX_LENGTH, 150))
subword_bigru_input = Input(shape=(MAX_LENGTH, 150))
addition_bigru_layer = Add()([word_bigru_input, pos_bigru_input, subword_bigru_input])
bigru_layer = Bidirectional(GRU(256, return_sequences=True))(addition_bigru_layer)
rnn_bigru_layer = SimpleRNN(256, activation='relu')(bigru_layer)
dense_bigru_layer = Dense(y_classes, activation='softmax')(rnn_bigru_layer)

model_bigru = Model(inputs=[word_bigru_input, pos_bigru_input, subword_bigru_input], outputs=dense_bigru_layer)
model_bigru.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model_bigru.summary())

Model: "model_9"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_19 (InputLayer)           (None, 34, 150)      0                                            
__________________________________________________________________________________________________
input_20 (InputLayer)           (None, 34, 150)      0                                            
__________________________________________________________________________________________________
input_21 (InputLayer)           (None, 34, 150)      0                                            
__________________________________________________________________________________________________
add_6 (Add)                     (None, 34, 150)      0           input_19[0][0]                   
                                                                 input_20[0][0]                   
                                                                 input_21[0][0]                   
__________________________________________________________________________________________________
bidirectional_6 (Bidirectional) (None, 34, 512)      625152      add_6[0][0]                      
__________________________________________________________________________________________________
simple_rnn_2 (SimpleRNN)        (None, 256)          196864      bidirectional_6[0][0]            
__________________________________________________________________________________________________
dense_19 (Dense)                (None, 10)           2570        simple_rnn_2[0][0]               
==================================================================================================
Total params: 824,586
Trainable params: 824,586
Non-trainable params: 0
__________________________________________________________________________________________________
None

Now we run the BiGRU. The results suggest that the training loss has not yet converged, which, taken with the training accuracy at 96.51%, would mean the model should be run for more iterations. However, the validation loss and accuracy suggest that the training set and validation sets give quite divergent results. Furthermore, the test set accuracy at 37.04% means the variance adversely affects this model architecture as severely as with the MLP.

run_model(model_bigru)

Train on 86 samples, validate on 22 samples
Epoch 1/50
86/86 [==============================] - 16s 190ms/step - loss: 2.3191 - accuracy: 0.1047 - val_loss: 2.3406 - val_accuracy: 0.0909
...
Epoch 50/50
86/86 [==============================] - 14s 164ms/step - loss: 0.1099 - accuracy: 0.9651 - val_loss: 3.7930 - val_accuracy: 0.4545

27/27 [==============================] - 0s 10ms/step
Training set accuracy: 96.51 percent
Validation set accuracy: 45.45 percent
Accuracy: 37.04 percent

Next, we try CountVectorizer data with a three-layer MLP.

# CountVectorizer with MLP
bag_word_input = Input(shape=(len(vectors[0]),))
bag_dense_1 = Dense(256, activation='relu')(bag_word_input)
bag_dense_2 = Dense(256, activation='relu')(bag_dense_1)
bag_dense_3 = Dense(y_classes, activation='softmax')(bag_dense_2)
model_cv = Model(inputs=bag_word_input, outputs=bag_dense_3)
model_cv.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

print(model_cv.summary())

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 340)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 256)               87296     
_________________________________________________________________
dense_4 (Dense)              (None, 256)               65792     
_________________________________________________________________
dense_5 (Dense)              (None, 10)                2570      
=================================================================
Total params: 155,658
Trainable params: 155,658
Non-trainable params: 0
_________________________________________________________________
None

Next, we run the model. This MLP model using CountVectorizer data performs much better than the MLP using word embeddings above. The validation set loss effectively reaches bottom at the seventh iteration, and reaches its maximum accuracy of 72.73% at around iteration 21, meaning that 50 iterations is unnecessary, and in fact slightly hurts the results. Nevertheless, like the other models, this model struggles with variance produced by the small data size, given a final test accuracy of 74.07% versus a training set accuracy of 100.00%.

run_model(model_cv, countvectorizer=True)

Train on 86 samples, validate on 22 samples
Epoch 1/50
86/86 [==============================] - 1s 14ms/step - loss: 2.2271 - accuracy: 0.1977 - val_loss: 2.1032 - val_accuracy: 0.2273
...
Epoch 50/50
86/86 [==============================] - 0s 4ms/step - loss: 2.7790e-04 - accuracy: 1.0000 - val_loss: 1.2247 - val_accuracy: 0.6818

27/27 [==============================] - 0s 565us/step
Training set accuracy: 100.00 percent
Validation set accuracy: 68.18 percent
Accuracy: 74.07 percent

Our fifth and final model is a MLP using Bigram CountVectorizer data as input and L2 regularization to prevent early convergence.

# Bigram CountVectorizer with MLP and regularization
bireg_word_input = Input(shape=(len(pure_bigram_vectors[0]),))
bireg_dense_1 = Dense(256, activation='relu', kernel_regularizer=l2(l=0.003))(bireg_word_input)
bireg_dense_2 = Dense(256, activation='relu', kernel_regularizer=l2(l=0.003))(bireg_dense_1)
bireg_dense_3 = Dense(y_classes, activation='softmax')(bireg_dense_2)
model_bireg = Model(inputs=bireg_word_input, outputs=bireg_dense_3)
model_bireg.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

print(model_bireg.summary())

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_3 (InputLayer)         (None, 992)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 256)               254208    
_________________________________________________________________
dense_7 (Dense)              (None, 256)               65792     
_________________________________________________________________
dense_8 (Dense)              (None, 10)                2570      
=================================================================
Total params: 322,570
Trainable params: 322,570
Non-trainable params: 0
_________________________________________________________________
None

Again here, we run the model. With Bigram CountVectorizer data and L2 regularization, the model still struggles: the validation accuracy oscillates wildly and declines the longer the model is trained. In the end, a validation set accuracy of 63.64% and a test set accuracy of 85.19% means that L2 regularization did not succeed in handling the variance issue in this case. However, the test set accuracy of 85.19% suggests that with other approaches to handle the variance, this model architecture could prove quite fruitful.

run_model(model_bireg, countvectorizer=True, bigrams=True, validation=0.2)

Train on 86 samples, validate on 22 samples
Epoch 1/50
86/86 [==============================] - 1s 17ms/step - loss: 3.8374 - accuracy: 0.1744 - val_loss: 3.4474 - val_accuracy: 0.1818
...
Epoch 50/50
86/86 [==============================] - 1s 7ms/step - loss: 0.0896 - accuracy: 1.0000 - val_loss: 0.9851 - val_accuracy: 0.6364

27/27 [==============================] - 0s 447us/step
Training set accuracy: 100.00 percent
Validation set accuracy: 63.64 percent
Accuracy: 85.19 percent

Conclusion¶

The results of the five models are shown in the table below.

Model type	Input data type	Training Accuracy	Validation Accuracy	Test Accuracy
MLP	Word embedding	100.00%	45.45%	18.52%
BiLSTM	Word embedding	100.00%	72.73%	88.89%
BiGRU/ RNN	Word embedding	96.51%	45.45%	37.04%
MLP	CountVectorizer unigram vectors	100.00%	68.18%	74.07%
MLP	CountVectorizer bigram vectors	100.00%	63.64%	85.19%

The results show that BiLSTM is the best architecture for the current problem. The small size of the dataset produced a relatively high degree of variance for all model types, which is naturally expected.

To improve accuracy, that is, to reduce variance in this case, more data combined with the BiLSTM model is the best way to proceed. Given the resource-poor nature of the Hmong language, a set of 30-50 additional examples targeting areas of ambiguity between class types (i.e., question words such as li cas ‘what, why, how’) would be a realistic solution to achieve >90% accuracy.

Question classification with limited annotated data

Conclusion¶

Leave a comment Cancel reply

Conclusion¶

Share this:

Related

Leave a comment Cancel reply