The Hmong Medical Corpus stores its tagged text data in a SQL database. To use this data with Stanford CoreNLP, it must first be converted into CoNLL-U format. This post shows how this is done.

First, let’s import the libraries needed.

from itertools import groupby
import os
import sqlite3

Next, let’s load the database. The Hmong Medical Corpus database is a SQLite database, so we load it through the sqlite3 module.

conn = sqlite3.Connection('hmcorpus.db')
crsr = conn.cursor()

Now, we use a SQL query to acquire the data we need. We can call the read_sql_query function in Pandas to facilitate creating a Pandas DataFrame.

sql_query = """SELECT doc_ind, sent_ind, token_form, type_form, pos_label, loc FROM tokens
JOIN types ON types.ind=tokens.word_type_ind
JOIN word_loc ON types.word_loc=word_loc.ind
JOIN pos ON pos.ind=types.pos_type;"""
query = pd.read_sql_query(sql_query, conn)
df = pd.DataFrame(query)

Next, we read in a CSV file that maps the part of speech tags from those specific to the Hmong Medical Corpus project to those used in the Universal POS tag set.

conv = pd.read_csv('mapping_to_upos.txt', sep='\t')

print(conv)

   XPOS   UPOS
0    CL   NOUN
1    NN   NOUN
2    PU  PUNCT
3    FW      X
4    VV   VERB
5    PP    ADP
6    QU    NUM
7    LC    ADP
8    AD    ADV
9    DT    DET
10   CC  CCONJ
11   CV   NOUN
12   RL   NOUN
13   CS  SCONJ
14   PN   PRON
15   NR  PROPN
16   CM   PART
17   ON   INTJ
18   JJ    ADJ

We now assign descriptive column names to our DataFrame created above.

df.columns = ['doc_ind', 'sent_ind', 'FORM', 'type_form', 'XPOS', 'word_pos']

df.tail()

Now, we add the ID column that will appear in the final CoNLL-U files.

df['ID'] = df.index + 1

df.head(20)

The next step is the most challenging in this process: converting syllable-based tokens reflecting Hmong orthography to word-based tokens required by the CoNLL-U formatting standards. We begin by finding all syllables labeled with a word_pos value of ‘I’ (for “internal”).

i_hits = df[df['word_pos']=='I']

i_hits.tail()

Here, we create a new DataFrame quads where we are going to combine the non-initial syllables. The DataFrame is named quads because the maximum word length in Hmong is four syllables.

quads = i_hits[['type_form', 'word_pos']]

quads.head()

Now, we reorganize quads so that each row contains four syllables with their corresponding word position tags. This is done in reverse such that type_form_L1 is the form one syllable to the left, type_form_L2 is two syllables to the left, and so on.

l1 = df.loc[quads.index - 1, ['type_form', 'word_pos']]
l1.index = l1.index + 1
quads = quads.join(l1, rsuffix="_L1")
quads.head()
#l1.head()
l2 = df.loc[quads.index - 2, ['type_form', 'word_pos']]
l2.index = l2.index + 2
quads = quads.join(l2, rsuffix="_L2")
quads.head()
l3 = df.loc[quads.index - 3, ['type_form', 'word_pos']]
l3.index = l3.index + 3
quads = quads.join(l3, rsuffix="_L3")
quads.head(10)

Next, if the syllable content in a row belongs to a different word from that found in column type_form, we erase that content so that the DataFrame only contains content belonging to a single word in a row.

m = quads['word_pos_L1'] != 'I'
quads.loc[m, ['type_form_L2', 'type_form_L3', 'word_pos_L2', 'word_pos_L3']] = ['', '', '', '']
m = quads['word_pos_L2'] != 'I'
quads.loc[m, ['type_form_L3', 'word_pos_L3']] = ['', '']

We then reset the index in order to use the original index from the original dataset as a means to determine which rows represent portions of the same word, to ensure that duplicates are eliminated. We do this by creating an offset column, where the value is shifted one from the original index, which is now its own column.

The rationale for this is straightforward: if the offset value for the row in question is one more than the index value, then the current row is a duplicate that represents only a portion of the full word. In other words, there is another row further down that contains the complete word.

quads = quads.reset_index()
quads['offset'] = quads['index'].shift(periods=-1)
quads.head(20)

Since some words have more than two syllables, they occupy more than one row in the quads DataFrame, and the following line of code allows only rows with the complete word to appear in quads.

quads = quads[quads['index'] + 1 != quads['offset']]

Next, we create a FORM column in quads that contains the complete word, combining the content of the type_form_XX columns together with underscores. Using underscores for the syllable breaks is the practice used in CoNLL files for Vietnamese, which has the same syllable-based spacing as Hmong, so we adopt the practice here.

quads['FORM'] = quads['type_form_L3'] + '_' + \
                quads['type_form_L2'] + '_' + \
                quads['type_form_L1'] + '_' + \
                quads['type_form']
quads['FORM'] = quads['FORM'].str.lstrip('_')

Below, we can see the results in the FORM column on the right.

quads.head(10)

Next, we assign a head_pos column to quads, which determines the position of the initial syllable in our original DataFrame. Then we set head_pos to be the new index and reduce quads to the two columns we need to merge into our original DataFrame: the index head_pos indicating the position where the combined word needs to appear, and FORM containing the newly combined full word.

quads['head_pos'] = quads['index'] - quads['FORM'].str.count('_')

quads.set_index('head_pos', inplace=True)
quads = quads_final.loc[:, ['FORM']]
quads.head(20)

Next, we update the combined words in the original DataFrame containing the full POS-tagged corpus.

df.update(quads)

Next, we need to update all of the POS tags so that a single POS tag that correctly reflects the role of the full word appears in the corpus DataFrame.

First, we need to handle words comprised of quantifier + classifier sequences, where the part of speech of the resulting combination is a classifier. We do this by using a temporary DataFrame where we extract all of the positions where a classifier appears in non-initial position. When this is the case, we select out all instances where the preceding syllable is a quantifier, and assign the tag CL (“classifier”). We then update the corpus DataFrame.

dg = df.loc[df[(df['XPOS']=='CL') & (df['word_pos']=='I')].index - 1, ['XPOS']]
dg = dg[dg['XPOS']=='QU']
dg['XPOS'] = 'CL'
df.update(dg)

Second, we handle words comprised of the associative-reciprocal prefix sib + verb as a verb. We do this by finding each instance where the first three letters of the word is sib. Every word that begins with sib is a verb in our corpus, so we can use a simple assignment.

df.loc[df['FORM'].str[:3]=='sib', 'XPOS'] = 'VV'

Third, the ubiquitous unit li cas “what”, as a unit, is, in Hmong, a demonstrative used in questions.

df.loc[df['FORM']=='li_cas', 'XPOS'] = 'DT'

Now that all of the POS tags in the XPOS column have been updated we can now add the UPOS column with the equivalent values from the Universal POS tagset.

df = df.join(conv.set_index("XPOS"), rsuffix="_match", on=["XPOS"])

We can now drop every row where the type_form is a non-initial syllable, leaving only complete words in the corpus DataFrame.

df = df[df['word_pos'] != 'I']
df.head(20)

Since our ultimate goal is to create CoNLL-U files that will enable training of a Stanford CoreNLP POS-tagging model, we can assign the rest of the required rows with underscores.

df['LEMMA'] = '_'
df['FEATS'] = '_'
df['HEAD'] = '_'
df['DEPREL'] = '_'
df['DEPS'] = '_'
df['MISC'] = '_'

Now, we use drop with inplace=True to remove the two columns containing the syllable forms and word position tags that we used for processing from the original database.

df.drop(columns=['type_form', 'word_pos'], inplace=True)
df.head()

Next, we retrieve the set of unique doc_ind and sent_ind combinations as a Numpy array.

sentence_ids = df.groupby(['doc_ind', 'sent_ind']).size().reset_index().loc[:, ['doc_ind', 'sent_ind']].values

Here, we define which sentences from the corpus will appear as part of the testing dataset for training later. We select out sentences 7 and 14 from each of the documents in the corpus. Each document contains at least 14 sentences, so this selection will be suitable.

test_ids = [7, 14]

Finally, we create the CoNLL-U files that will be used for training and testing of our Stanford CoreNLP POS-tagging model.

We iterate through the sentence IDs to create a separate DataFrame for each sentence with its own consecutive index, to match CoNLL formatting requirements. Within each sentence, we no longer need the document and sentence numbers, and so we drop these and reorder the remaining columns to match the CoNLL specification. Then we write to file using to_csv.

f = open('hmcorpus_train.conllu', 'a')
g = open('hmcorpus_test.conllu', 'a')
for id in sentence_ids:
    sent_df = df[(df['doc_ind']==id[0]) & (df['sent_ind']==id[1])].reset_index(drop=True)
    sent_df.loc[:, 'ID'] = sent_df.index + 1
    sent_df.drop(columns=['doc_ind', 'sent_ind'], inplace=True)
    new_columns = ['ID', 'FORM', 'LEMMA', 'UPOS', 'XPOS', 'FEATS', 'HEAD', 'DEPREL', 'DEPS', 'MISC']
    sent_df = sent_df[new_columns]
    if id[1] in test_ids:
        sent_df.to_csv(g, sep='\t', header=False, index=False)
        g.write('\n')
    else:
        sent_df.to_csv(f, sep='\t', header=False, index=False)
        f.write('\n')
f.close()

Conclusion

Altogether, using the methodology above, we can create CoNLL-U files based on our syllable-tokenized SQL database tables to use with Stanford CoreNLP. In the next post, we will train a Stanford CoreNLP POS-tagging model.

Word embeddings created by Word2Vec can be utilized in exploring the semantic distributions of nouns associated with nominal classifiers. In this post, we explore using dendrogram analysis and k-means clustering with word embeddings as a means to form hypotheses for research involving these distributions.

Nominal classifiers are known to have a range of semantic values that often form a sort of semantic network reflected in the nouns with which they co-occur. This means that the co-occurring nouns will often have various semantic relationships or fall into various semantic groups that would enable us to determine the various categories found in these semantic networks. For those lacking a linguistics background, a nominal classifier is a part of speech not found in European languages, but is ubiquitous in East and Southeast Asian languages, and should not be confused with classifiers in the machine learning sense.

We can perform this exploration by comparing word embeddings of the words that co-occur with a given nominal classifier, either through a dendrogram or k-means clustering.

Here, our goal is to produce a semantic analysis of the Hmong nominal classifier tus.

Ideally, we would use some form of semantic ontology-based system (e.g. WordNet) for word embeddings, but this does not yet exist for a resource-poor language like Hmong, meaning that our best option is the raw text found in the approximately 12-million-token soc.culture.hmong (SCH) corpus.

To enable this sort of analysis, we make the following assumption: given that the word embeddings are trained based on their context in Word2Vec, and this context encodes both syntactic and semantic information, the most similar words will share both syntactic context and semantic values. At the same time, many words will be moderately similar—they may share semantic values but not many syntactic contexts (still desirable for this approach), or many syntactic contexts but not semantic values (the drawback to the approach).

This means that as we pursue the analysis, we must remember that the word embeddings are not purely semantic, but reflect both syntactic and semantic properties of the words, and so some words will appear in certain groups because of its syntactic properties rather than semantics. Nevertheless, because in a significant portion of cases semantics will be a major determining factor for similarity, useful results emerge from this approach that provide a strong basis for hypothesis formation enabling further research, so long as the results are considered judiciously.

Let’s begin.

Import libraries.

The first step is to import the relevant modules and classes. The NLTK library is used to manipulate the data from the corpus that we’re going to use, and Word2Vec is used to convert the corpus vocabulary into vectors that can be manipulated for the analysis.

Load training corpus.

The next step is to navigate to the folder containing the raw corpus, and importing it using the PlaintextCorpusReader class from the nltk.corpus module.
The corpus we’re using is the SCH Corpus, a publicly-available corpus of Hmong text that derives from forum posts online on soc.culture.hmong Usenet.

os.chdir(os.path.expanduser(os.path.join('~','sch_corpus')))

hmong = PlaintextCorpusReader('.', '.*').sents()

Train word embeddings with Word2Vec.

Next, we use the Word2Vec class from gensim.models to create our word vectors. The argument window is set to 10 to indicate that a window of 10 around the chosen word should be used to train the vectors. size is the size of the vector for each word, set here to 150 to enable a reasonably robust yet compact set of vectors. iter is the number of iterations used in training; here, I’ve set it to 50.

model = Word2Vec(sentences=hmong, window=10, size=150, iter=50, workers=10)

Carry out data preprocessing to produce a high-quality set of nouns.

This step uses BigramCollocationFinder from nltk.collocations to find all the bigram collocations in the corpus. We do this because we want to have nouns that collocate with the Hmong classifier tus. We save a deep copy of the bigrams for later use using copy.copy.

bigrams = finder.from_words([w for sent in hmong for w in sent])
bigrams_copy = copy.copy(bigrams)

Next, we apply several filters to limit the bigrams we’re considering to only those that contain tus or its variant form tug, and to ensure the co-occurrence is reasonably common in the corpus, since we want nouns that commonly occur with tus.

bigrams.apply_ngram_filter(lambda x, y: x.lower() not in ['tus', 'tug'])
bigrams.apply_freq_filter(20)

For this step, we draw bigram collocations from the remaining set of bigrams based on a relatively strong degree of co-occurrence, ranked by the chi_sq measure from nltk.collocations.BigramAssocMeasures. We limit the output to the 700 highest ranked bigrams based on this measure, as lower ranked members represent instances where the relationship between tus and the second word is not particularly strong. Then we extract out only the second members from each collocation-chi_sq score pair, as the first will be the classifier tus itself.

out = bigrams.score_ngrams(measures.chi_sq)[:700]
out_proc = [w[0][1] for w in out]

Then, we select out the 500 most common bigrams in the entire corpus, extract the nouns from the collocation-frequency pairs, and take the lowercase versions of each one. Then we limit our set of nouns to consider based on their presence both in the 500 most common bigrams list, as well as the list of 700 with the highest chi-squared score. This ensures a balance between the most common nouns with tus in the corpus and those that more strongly correlate with tus in particular.

finds = bigrams.ngram_fd.most_common(500)
finds_proc = [w[0][1] for w in finds]
finds_proc_lower = [w.lower() for w in finds_proc]
total_proc = [w.lower() for w in out_proc if w in finds_proc_lower]

Next we need to clean the list of nouns we’re considering to only include what are obviously nouns, only full words, and only nouns from White Hmong. For languages with better available resources, we would use a POS tagger at an earlier stage of the process, where this would be done automatically; here, a list of non-nouns in our set has been manually provided.

In Hmong, classifiers like tus can be followed by content that are not nouns, such as relative clauses or localizers—that is, a special class of words indicating relative spatial position common in Asian languages; in these cases, the noun is either omitted or is zero.

Also, Hmong has two common orthographies: one that puts spaces between syllables, and another that puts spaces between words. As a result, we need to remove all syllables that are not complete words, selecting only those that are complete words.

Finally, the SCH corpus contains data from both White Hmong and Green Mong. Including data from both would create confusion in our analysis, so we explicitly limit our nouns to those from White Hmong.

non_nouns_to_exclude = ['puav', 'me', 'hluas', 'kws', 'laus', 'twg', 'uas', \
                        'laug', 'ub', 'mos2', '22', 'hlob', 'loj', 'coj', '.', ',', \
                        'ntawd', 'yog', 'tod', 'swb', 'li', 'tuag', '#', 'sau', \
                        'niag', 'tias', 'lawm', 'ib', 'mos', 'muab', '/', 'muaj', \
                        'nrog', 'rau', 'luag', 'ua', 'los', 'nws', 'txawm', 'hais', \
                        'thaum', 'lawv', 'tsi', 'es', 'phem', 'nuav', 'tej', 'has', \
                        'xav', 'hov', 'kuv', 'ces', 'ntawm', 'tawm', 'lwm', '(', 'kiag',\
                        'hu', 'cov', 'ntseeg', 'mus', 'ko', 'mas', 'tiag', 'to', \
                        'yam', 'tag', 'nawb', 'pom', 'miv', 'no', 'peb', 'sib', 'hlub', \
                        'twb', 'thiab', 'pab', 'leej', 'tsis', '...', 'kawg', 'kom', \
                        'xwb', 'tau', 'tshiab', 'noj', 'tus', 'qub', 'lub', 'txoj', \
                        'nyuas', 'thib', 'ntse', 'nyuag', 'thiaj', 'tshab', 'nua', 'koj',\
                        'tham', 'yau', 'tham', 'saib', 'hauv', 'yees', 'teb', 'luj', \
                        'txiav', 'tswj', 'xub', 'thaub', 'cuav', 'puas', 'txheeb', 'puag', \
                        'ruam', 'siab', 'tsim', 'pluag', 'yus', 'tuav', 'rog', 'txawj',\
                        'mob', 'tub']
partial_words_to_exclude = ['poj', 'tij', 'quas', 'xf', 'dr', 'ntsuj', 'tib', 'tuab', \
                            'teeb', 'yeeb', 'xeeb', 'kas', 'cawm', 'zuj', 'npau', 'cuj',\
                            'cwj', 'xov', 'kav', 'kab', 'txheej', 'xib', 'huab', 'pej',\
                            'phooj']
green_mong_to_exclude = ['mivnyuas', 'nam', 'dlaab', 'puj', 'moob', 'tuabneeg', 'quasyawg',\
                         'quaspuj', 'dlev', 'tsaj', 'nav', 'qab']
total_proc = [w for w in total_proc if w not in non_nouns_to_exclude]
total_proc = [w for w in total_proc if w not in partial_words_to_exclude]
total_proc = [w for w in total_proc if w not in green_mong_to_exclude]
total_proc = list(set(total_proc))

The next step provides English glosses for the Hmong words, for readers’ convenience.

total_proc_english = ['stick', 'director', 'animal', 'hook', 'aunt', 'scent', 'price', 'brothers', 'pastor', 'doctor',\
                     'money', 'crossbow', 'grandfather', 'policy', 'judge', 'pig', 'human being', 'God', 'fish',\
                     'phallus', 'spirit', 'flag', 'responsibility', 'grandmother', 'water buffalo', 'behavior',\
                     'boss', 'email', 'person', 'finger', 'friend', 'bird', 'boss', 'soul', 'marriage negotiator',\
                     'creator god', 'daughter-in-law', 'form', 'tree trunk', 'cousin', 'cow', 'brother', 'member',\
                     'uncle', 'bridge', 'wife', 'system', 'leader', 'daughter', 'politician', 'enemy', 'leader', 'leader',\
                     'way', 'characteristic', 'brother', 'mother', 'government official', 'rib', 'chicken', 'grandfather',\
                     'symbol', 'tongue', 'man', 'brother', 'pillar', 'young woman', 'servant', 'horse', 'oneself',\
                     'phone', 'sister', 'Hmong', 'seed', 'snake', 'image', 'dog', 'root', 'river', 'letter', 'mistake',\
                     'rat', 'behavior', 'child', 'boss', 'president', 'tiger', 'female', 'father/husband', 'emperor',\
                     'bone', 'guest', 'son-in-law', 'life']
total_proc_dict = {h: e for h, e in zip(total_proc, total_proc_english)}

Next, we retrieve the vectors from the model for the resulting set of nouns we’ve chosen.

total_proc_vectors = [model.wv[w] for w in total_proc]

Plotting the dendrogram.

To plot a dendrogram, we need to import matplotlib and the dendrogram and linkage classes from scipy.cluster.hierarchy.

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

Now, we plot the final dendrogram. The first step is to create a linkage object containing our noun vector list. Then we plot the figure using matplotlib.pyplot and the dendrogram class. We use leaf_font_size to set the font size of the labels in the plot, and leaf_rotation to ensure the labels for each noun are rotated vertically to ensure they are legible. The final item, leaf_label_func, is defined with a lambda function, where a label is mapped to the word in our noun list total_proc that corresponds to the vector in the noun vector list total_proc_vectors.

l = linkage(total_proc_vectors, method='complete', metric='seuclidean')
plt.figure(figsize=(25,10))
plt.ylabel('word')
plt.xlabel('distance')
dendrogram(l, leaf_font_size=8., leaf_rotation=90., leaf_label_func=lambda v: total_proc[v])
plt.show()

The dendrogram groups primarily family and other relational terms on the left in blue and green, including niam ‘mother’, yawm ‘grandmother’, txiv ‘father’, and vauv ‘son-in-law’, with an additional group in mustard yellow, including kwvtij ‘male siblings’ and phoojywg ‘friend(s)’.

A group of fairly abstract terms related to humans appears in cyan toward the left, including cwjpwm ‘behavior’ and kheej ‘oneself’. Terms for human social roles appears in purple toward the left, such as the English borrowing judge and qhua ‘guest’, while an additional grouping dominated by English loans representing professional roles appears in cyan near the right, including doctor and leader, though notably, abstract English-sourced terms are grouped with these. As stated above, this is a result of the nature of the source corpus, which produces word embeddings that reflect a range of relationships, not merely semantic.

Two large groupings of animals also appears, including the cyan group near the center containing primarily domesticated farm animals such as qaib ‘chicken’, npua ‘pig’, and nyuj ‘cow’, and the purple group to the right of center containing small, canonically wild animals such as ntses ‘fish’, nas ‘rat’, and noog ‘bird’.

Modes of communication also receive their own grouping, as with phone, email, and duab ‘picture’ in red and blue toward the left.

Finally, ncej ‘pillar’ and ntoo ‘tree, wood’ are grouped together in blue to the right of center.

One drawback of the dendrogram approach in general, of course, is how groupings are made: more sensible groupings at the macro level may be missed because of the groupings already made at the lower levels. What we see here, however, is that while there are still a number of nouns that appear in unexpected groupings, large, mostly sensible semantic categories still dominate the results: human family terms, human abstract terms, human social roles (especially English-sourced professional ones), domesticated farm animals, small wild animals, methods of communication, and cylindrical wooden things. These categories can serve as a foundational hypothesis for further research, which is exactly our goal here.

K-Means Clustering

Next, we’ll do a k-means clustering analysis of the word embeddings of the nouns.

As with the dendrogram approach above, the k-means clustering approach will still be sensitive to non-semantic features of the word embeddings. Nevertheless, we should still find robust clusters in the results that will serve well in hypothesis formation driving further research.

We begin by importing the necessary libraries. K-means clustering is enabled by sklearn.cluster.KMeans.

from sklearn.cluster import KMeans
import numpy as np
from scipy.spatial.distance import cdist

Next, we train the k-means clustering model. We choose seven clusters here as it produces groupings that are impressionistically the most semantically sensible, as we will see below. Given the use of word embeddings from the SCH corpus, useful approaches to determine the best k value such as the elbow method or silhouette clustering are not particularly helpful for our purposes, as their results rely on the full range of what is encoded in the embeddings rather than semantics alone.

km = KMeans(n_clusters=7, init='random', n_init=30, max_iter=300, tol=1e-04, random_state=0)
y_km = km.fit_predict(total_proc_vectors)

Next, let’s look at the groupings.

outcomes = zip(y_km, total_proc, total_proc_english)
total_proc_grouped = groupby(sorted(outcomes, key=lambda x: x[0]), lambda x: x[0])
for key, group in total_proc_grouped:
    print(key)
    for item in group:
        print(item[1], "'" + item[2] + "'")
    print()

0
nqi 'price'
txiaj 'money'
tibneeg 'human being'
vajtswv 'God'
dab 'spirit'
dejnum 'responsibility'
neeg 'person'
mejkoob 'marriage negotiator'
choj 'bridge'
pojniam 'wife'
kev 'way'
hmoob 'Hmong'
duab 'image'
dej 'river'
txhaum 'mistake'
menyuam 'child'
neej 'life'

1
xibfwb 'pastor'
doctor 'doctor'
yawg 'grandfather'
chij 'flag'
tswv 'boss'
nai 'boss'
plig 'soul'
saub 'creator god'
nyab 'daughter-in-law'
thawj 'leader'
nom 'government official'
yawm 'grandfather'
qhev 'servant'
nab 'snake'
ntawv 'letter'
nais 'boss'
huabtais 'emperor'

2
niam 'mother'

3
pas 'stick'
nuv 'hook'
hneev 'crossbow'
ntses 'fish'
qau 'phallus'
ntiv 'finger'
ntoo 'tree trunk'
tav 'rib'
nplaig 'tongue'
ncej 'pillar'
noob 'seed'
cag 'root'
nas 'rat'
txha 'bone'

4
director 'director'
policy 'policy'
judge 'judge'
cujpwm 'behavior'
email 'email'
qauv 'form'
member 'member'
system 'system'
thawjcoj 'leader'
kasmoos 'politician'
yeebncuab 'enemy'
leader 'leader'
xeebceem 'characteristic'
cim 'symbol'
txivneej 'man'
kheej 'oneself'
phone 'phone'
cwjpwm 'behavior'
president 'president'

5
phauj 'aunt'
kwvtij 'brothers'
pog 'grandmother'
phoojywg 'friend'
noog 'bird'
npawg 'cousin'
nus 'brother'
uncle 'uncle'
ntxhais 'daughter'
kwv 'brother'
tijlaug 'brother'
nkauj 'young woman'
muam 'sister'
txiv 'father/husband'
qhua 'guest'
vauv 'son-in-law'

6
tsiaj 'animal'
ntxhiab 'scent'
npua 'pig'
twm 'water buffalo'
nyuj 'cow'
qaib 'chicken'
nees 'horse'
dev 'dog'
tsov 'tiger'
maum 'female'

With seven groups, the results are interesting.

While group 0 has an unclear semantic basis—likely reflecting non-semantic information or other idiosyncratic relationships encoded in the word embeddings—group 1 is dominated by terms that refer to people with authority: xibfwb ‘pastor’, nai ‘boss’, nom ‘government official’, and huabtais ’emperor’, for example.

Group 3 likewise has mostly inanimate objects on the one hand—pas ‘stick’, nuv ‘hook’, ntoo ‘tree trunk’— and body parts on the other—nplaig ‘tongue’ and txha ‘bone’, for example.

Group 4 has English borrowings related to official and other professional concepts, including occupations, such as director, judge, system, and policy, some abstract Hmong terms—probably associated with official concepts—such as xeebceem ‘characteristic’ and cwjpwm ‘behavior’.

Group 5 is almost a perfect fit for relationships—phauj ‘aunt’, kwvtij ‘brothers’, phoojywg ‘friend’, and so on—while niam ‘mother’ is alone in group 2 and noog ‘bird’ is inexplicably in group 5.

Group 6 is almost exclusively animal terms—tsiaj ‘animal’, npua ‘pig’, twm ‘water buffalo’—except for ntxhiab ‘scent’, which makes sense as a term associated with animals.

The results from the k-means clustering approach are a sensible starting point for hypothesis formation. Obvious categories include 1) inanimate objects characterized by a straight, rigid shape, 2) similarly “straight” body parts, 3) official concepts, 4) professional roles, 5) professional concepts, 6) relationships, and 7) animals. The following words present special difficulty for this taxonomy of seven categories, and, unsurprisingly, were grouped together as a sort of “other” category as group 0 above: nqi ‘price’, duab ‘image’, dej ‘river’, txhaum ‘mistake’, neej ‘life’, txiaj ‘money’, chij ‘flag’, ntawv ‘letter’, and noob ‘seed’. This can be explained, however, by the fact that the relationship between these items and the larger categories are the likely result of literal and metaphorical, but idiosyncratic, semantic extensions, in addition to the combination of various forms of information reflected in the word embeddings.

Comparison of results

The dendrogram approach produced evidence for the following groupings:

human family terms
human abstract terms
human social roles (especially English-sourced professional ones)
domesticated farm animals
small wild animals
methods of communication
cylindrical wooden things

The k-means clustering approach likewise produced evidence for these groupings:

inanimate objects characterized by a straight, rigid shape
similarly “straight” body parts
official concepts
professional roles
professional concepts
human relationships
animals

Altogether, human terms related to family, society, and professional roles, animal terms, and straight, rigid inanimate objects were uncovered as groupings by both approaches. Furthermore, official/professional concepts, “straight” body parts, and methods of
communication were groupings found in one of the approaches.

The Results as Hypotheses

The above results provide a strong basis for forming hypotheses for the semantic categories associated with the classifier tus. Below, we try new nouns not seen in the analyses above to check our categories. Let’s try xeebntxwv ‘grandchild’ as a human family term, neeb ‘shaman’ as a human society term, dais ‘bear’ for an animal term, and cav ‘log, pole’ for a straight, rigid inanimate object term.

We check for the co-occurrence of these with tus in terms of raw frequency in the bigrams_copy that we created above.

print("tus xeebntxwv 'the grandchild': ", bigrams_copy.ngram_fd[('tus', 'xeebntxwv')])
print("tus neeb 'the shaman': ", bigrams_copy.ngram_fd[('tus', 'neeb')])
print("tus dais 'the bear': ", bigrams_copy.ngram_fd[('tus', 'dais')])
print("tus cav 'the pole': ", bigrams_copy.ngram_fd[('tus', 'cav')])

tus xeebntxwv 'the grandchild':  5
tus neeb 'the shaman':  17
tus dais 'the bear':  8
tus cav 'the pole':  16

These four nouns show frequency of co-occurrences with tus in our corpus of less than 20, meaning that they were eliminated from the analyses above by the line bigrams.apply_freq_filter(20) early on in the process. Nevertheless, all four of these co-occur multiple times with tus, suggesting that our semantic category hypotheses based on the dendrogram and k-means clustering analyses above will likely prove correct, producing useful results contributing to our goal of charting the semantic network associated with tus.

Conclusion

Taken together, the dendrogram analysis and k-means clustering approach above enable hypothesis formation for a semantic network associated with the Hmong classifier tus. This is significant, given the current lack of a semantic ontology for the Hmong language that should otherwise limit this sort of research, and as a result, this approach will likely prove useful in data exploration and hypothesis formation for other resource-poor languages as well.

Corpora are useful tools both for analyzing human language and for NLP application development. However, finding a good platform for building a corpus is not always straightforward. Using the sqlite3 package to create a SQL database to manage our corpus data is an excellent solution, as it provides a means both to maintain the internal structure of the data and to quickly traverse that internal structure.

Let’s begin by importing the necessary libraries.

Import libraries.

import os
import sqlite3
import pickle

Create the database.

For a part-of-speech tagged database, we need to have the following tables:

Documents—to keep track of the original document files
Part of speech—to keep track of all of the possible parts of speech
Word Types—to keep track of all attested word types (or lemmas), rather than the word tokens and their varying forms
Word Tokens—to keep track of the individual word tokens in each document, as they appear in the original

For Hmong in particular, because the language’s orthography places spaces between syllables, we need to keep track of which position in the word each type/token represents. As a result, we need a fifth table:

Word position

Languages with more complicated morphology may need additional tables to keep track of the various morphological categories for a given word. Hmong, however, maximally allows only one affix per word plus reduplication, and morpheme boundaries coincide with syllable boundaries—and thus spaces—and so each morpheme is already stored as a type.

We do, however, want to encode a category only once in the database, and have references made to it, given proper database structure represented by each normal form (https://www.guru99.com/database-normalization.html). So, we refer to categories in one table using indices in another. For example, to reference parts of speech for each word type, we use the index from the parts of speech table to indicate the part of speech for a given type in the word types table.

Below, we use sqlite3.Connection(<database_filename>).cursor().execute with SQL CREATE TABLE commands to create each of the five tables, complete with index references within each table.

os.chdir(os.path.expanduser('~/corpus_location'))

# creates new database
conn = sqlite3.Connection('mycorpus.db')

# get cursor
crsr = conn.cursor()

# string lines to initialize each table in database
create_documents = """CREATE TABLE documents (
index INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
document_title VARCHAR(50),
document_addr VARCHAR(150));"""

create_part_of_speech = """CREATE TABLE part_of_speech (
index INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
pos_label VARCHAR(2));"""

create_word_location = """CREATE TABLE word_location (
index INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
location CHAR);"""

create_word_types = """CREATE TABLE word_types (
index INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
word_type_form VARCHAR(20),
word_location INTEGER,
pos_type INTEGER,
FOREIGN KEY (word_location)
REFERENCES word_location(index),
FOREIGN KEY (pos_type)
REFERENCES part_of_speech(index));"""

create_word_tokens = """CREATE TABLE word_tokens (
index INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
document_index INTEGER,
sentence_index INTEGER,
word_index INTEGER,
word_type_index INTEGER,
word_token_form VARCHAR(20),
FOREIGN KEY (document_index)
REFERENCES documents(index),
FOREIGN KEY (word_type_index)
REFERENCES word_types(index));"""

crsr.execute(create_documents)
crsr.execute(create_part_of_speech)
crsr.execute(create_word_location)
crsr.execute(create_word_types)
crsr.execute(create_word_tokens)

# set up word_location IOB tags
crsr.execute("INSERT INTO word_location(location) VALUES ('B'), ('I'), ('O');")

Loading the first file to insert.

Next, we use pickle to load a file that we want to insert into the database. pickle is a module that enables a file to loaded after being handled by another Python script elsewhere. Here, I use it to load a file with contents that have been preprocessed for insertion into the database. Note that this preprocessing step will be the subject of a later blog post.

os.chdir(os.path.expanduser('~/database_location/pickling'))
pickle_file_name = '9_txt.pkl'
f = open(pickle_file_name, 'rb')
doc_data = pickle.load(f)
f.close()

Inserting the document information.

The preprocessed data contains the text of the document, but not its name or original location. We insert them here using the SQL command INSERT INTO documents with the name of the file and its original location inserted from a tuple named document. We then run cursor().execute to run the SQL command, and use lastrowid to retrieve the number the database has assigned our newest document, so that we can use it in insertions when we begin inserting tokens from the file into the database.

document = ('Tus Mob Acute Flaccid Myelitis', 'https://www.dhs.wisconsin.gov/publications/p01298h.pdf')
insert_doc = "INSERT INTO docs (document_title, document_addr) VALUES ('" + document[0] + "', '" + document[1] + "');"
document_index = crsr.execute(insert_doc).lastrowid

Create a function to process each word.

Because each document contains hundreds of texts, it is incredibly inefficient to execute a new set of SQL commands for each insertion. As a result, we create a function named insert_word below to run each time we insert a word. The function has four parameters:

word_tuple—contains a tuple with the token string and a combined word position/POS tag
doc_index_value—indicates the ID number for the document in the documents table
sent_index_value—represents the position in sequence of the current sentence in the document
word_index_value—represents the position in sequence of the current word in the current sentence

def insert_word(word_tuple, doc_index_value, sent_index_value, word_index_value):
    '''
    Inserts a word into the database, based on the word_tuple.
    @param word_tuple is 3-tuple containing the token's form, the location within a word, and the part of speech
    @param doc_index_value is the index of the document from which the word is extracted
    @param sent_index_value is the index of the sentence in the document from which the word is extracted
    @param word_index_value is the index of the position of the word within its sentence
    '''
    
    # retrieve pos value if found, otherwise add pos value
    pos_results = crsr.execute("SELECT index FROM part_of_speech WHERE pos_label='" + word_tuple[2] + "';").fetchall()
    if len(pos_results) > 0:
        pos_label_index = pos_results[0][0]
    else:
        pos_label_index = crsr.execute("INSERT INTO part_of_speech (pos_label) VALUES ('" + word_tuple[2] + "');").lastrowid
    
    # retrieve relevant word_loc value
    if word_tuple[1] in ['B', 'I', 'O']:
        word_loc_index = crsr.execute("SELECT index FROM word_location WHERE location='" + word_tuple[1] + "';").fetchone()[0]
    else:
        raise ValueError('Word location value is invalid at word (' + str(sent_index_value - 1) + ', ' \
                        + str(word_index_value - 1) + ').')
    
    # match word[0].lower(), word_loc_index, pos_label_index against word_types, and if a match, retrieve index
    # if not, add and get index
    type_ = word[0].lower()
    type_results = crsr.execute("SELECT index FROM word_types WHERE word_type_form='" + type_ + "' AND word_location=" \
                                + str(word_loc_index) + " AND pos_type=" + str(pos_label_index) + ";").fetchall()
    if len(type_results) > 0:
        type_index = type_results[0][0]
    else:
        type_index = crsr.execute("INSERT INTO word_types (word_type_form, word_location, pos_type) VALUES ('" + type_ + "', " \
                                  + str(word_loc_index) + ", " + str(pos_label_index) + ");").lastrowid
        
    # insert complete values into word_tokens
    insertion = crsr.execute("INSERT INTO word_tokens (document_index, sentence_index, word_index, word_type_index, word_token_form)" \
                            + " VALUES (" + str(doc_index_value) + ", " + str(sent_index_value) + ", " \
                            + str(word_index_value) + ", " + str(type_index) + ", '" + word[0] + "');")

Add each token to the database.

The next step cycles through the tokens in the file opened with pickle above and runs insert_word to insert each token in the database. We then close the database, as once we have run this step, we have finished inserting our first document into the database!

for i, sent in enumerate(doc_data):
    for j, word in enumerate(sent):
        current_word = tuple([word[0]] + word[1].split('-'))
        insert_word(current_word, doc_index, i + 1, j + 1)
conn.commit()
conn.close()

Conclusion

We can create a SQL database using the sqlite3 package to store our data for our corpus. Above, we saw how to create the tables for the corpus using SQL queries and insert our first document. In later posts, we will look at the preprocessing step to convert the original PDF into data ready to insert into the database, and how to use the database to access and search our data.

This post introduces a semi-supervised approach to word tokenization and POS tagging that enables support for resource-poor languages.

The Hmong language is a resource-poor language [1] where corpora of POS-tagged data are previously unavailable, precluding supervised approaches. At the same time, the Hmong language has an unusually high number of homonyms and features syllable-based spacing in its orthography, meaning that widespread ambiguity will create serious problems for unsupervised approaches. A semi-supervised approach is in order.

The approach featured here follows a relatively unusual strategy: combining word tokenization and POS tagging as a single step. Because Hmong has an orthography where spaces are placed between syllables rather than words, word tokenization will be potentially non-trivial. However, a much more prominent language, Vietnamese, has the same issue, yet unlike Hmong, it is a relatively resource-rich language. This means that, with the relevant adaptations to handle a resource-poor language, approaches that work with Vietnamese should prove useful. One of these approaches is in fact combining word tokenization and POS tagging [2][3].

In this approach, word tokenization is combined with POS tagging as a sequence-labeling task where position in the word is handled using IOB tags, where B marks the first syllable of the word, I marks all other syllables of the word, and O marks everything that is not a word. Here, I combine these with POS tags using a hyphen, so that the first syllable of a noun is B-NN and the second syllable of a verb is I-VV.

In my approach here, I use pretrained word embeddings. Though Hmong is a resource-poor language, the Internet has proven popular with Hmong speakers, meaning that speakers have produced thousands of forum posts on the soc.culture.hmong listserv over the past 20 years or so. These have been organized into the approximately 12-million token SCH corpus, which is available for free download here: http://listserv.linguistlist.org/pipermail/my-hm/2015-May/000028.html.

These pretrained word embeddings are created through Word2Vec and loaded as an embedding layer into a Keras-based BiLSTM model. The BiLSTM model is excellent for the word tokenization/POS tagging task as it is specially designed for handling sequences where individual output values are dependent neighboring values.

The model is trained on a set of eight documents—approximately 6000 (actual) words—fully tagged with the combined word position-POS tags mentioned above.

Let’s begin by importing the relevant libraries.

Import libraries.

import os
import sqlite3
from itertools import groupby
import numpy as np
from pandas import DataFrame

from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec

from keras.models import Sequential
from keras.layers import Bidirectional, LSTM, Dense, InputLayer, Embedding, TimeDistributed, Activation
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.optimizers import Adam

Load existing database with POS-tagged words.

Next, we navigate to my local folder containing the database file and load the database using sqlite3.

os.chdir(os.path.expanduser('~/python_workspace/medical_corpus_scripting/corpus/hminterface/static/hminterface'))
conn = sqlite3.Connection('hmcorpus.db')
crsr = conn.cursor()

Retrieve tags from database.

Next we retrieve all of the tag types from the database using SQL and creating a dictionary that converts all of the tags to indices that can be used in the Keras model. The result is a unique index for each combination of word position IOB tag and POS tag that are actually attested in the corpus database to date.

query = """SELECT DISTINCT loc, pos_label FROM types
JOIN word_loc ON word_loc.ind=types.word_loc
JOIN pos ON pos.ind=types.pos_type;"""

# set the padding tag combination first, then add tag combinations from database
tag_combinations = [('O', 'PAD')]
tag_combinations += crsr.execute(query).fetchall()

tag_indices = {'-'.join(t): i for i, t in enumerate(tag_combinations)}
print(tag_indices.items())

dict_items([('O-PAD', 0), ('B-CL', 1), ('B-NN', 2), ('O-PU', 3), ('B-FW', 4), ('B-VV', 5), ('B-PP', 6), ('I-NN', 7), ('B-QU', 8), ('I-CL', 9), ('B-LC', 10), ('I-VV', 11), ('B-AD', 12), ('B-DT', 13), ('B-CC', 14), ('I-CC', 15), ('B-CV', 16), ('I-AD', 17), ('B-RL', 18), ('B-CS', 19), ('B-PN', 20), ('I-CS', 21), ('I-FW', 22), ('B-NR', 23), ('I-NR', 24), ('I-PU', 25), ('B-PU', 26), ('B-CM', 27), ('B-ON', 28), ('I-QU', 29), ('I-PN', 30), ('B-JJ', 31)])

Retrieve word tokens and tags as numerical codes.

The database is organized such that each “word” (i.e., syllable or punctuation demarcated by spaces) type is assigned its own index in the table types. This means that a dataframe can be created using the database data to convert between indices and word types.

query = """SELECT ind, type_form FROM types;"""
word_index_list = crsr.execute(query).fetchall()

# Visualize data
index_words = DataFrame(data=word_index_list, columns=['Index', 'Word_Type'])
index_words.set_index('Index', inplace=True)
print(index_words.head(15))

         Word_Type
Index             
1              tus
2              mob
3                –
4      shigellosis
5          disease
6             fact
7            sheet
8           series
9              zoo
10              li
11             cas
12               ?
13             yog
14              ib
15             tug

The following retrieves the word indices from the eight documents stored in the corpus database that we are going to use, and uses the itertools.groupby function to organize them in sequence as a list of sentence lists.

query = """SELECT doc_ind, sent_ind, word_type_ind, loc, pos_label FROM tokens
JOIN types ON tokens.word_type_ind=types.ind
JOIN word_loc ON word_loc.ind=types.word_loc
JOIN pos ON pos.ind=types.pos_type
WHERE doc_ind<=8;"""
query_words = crsr.execute(query).fetchall()
sentences_list = []
tags_list = []
for k, g in groupby(query_words, lambda x: (x[0], x[1])):
    sent = list(g)
    sentences_list.append([(w[2],) for w in sent])
    tags_list.append([(tag_indices['-'.join(w[3:])],) for w in sent])
    
# print the second sentence as a word type index sequence
print(sentences_list[1])
# print the sentence as a word type sequence, using index_words dataframe from above
print(list(index_words.at[word[0], 'Word_Type'] for word in sentences_list[1]))

[(4,), (13,), (14,), (15,), (2,), (16,), (17,), (18,), (19,), (20,), (21,), (16,), (22,)]
['shigellosis', 'yog', 'ib', 'tug', 'mob', 'los', 'ntawm', 'cov', 'kab', 'mob', 'bacteria', 'los', '.']

Handling padding and out-of-vocabulary items.

The Keras model we will use below requires each element in the training input to have the same number of tokens. This means that we will need to pad every sentence that is not as long as the longest sentence in the training set. We can achieve this by adding a 0 index value to our index_words dataframe.

Likewise, in testing and production we will inevitably run into items that are not in the vocabulary used in training the model. This can be handled by adding a row in the index_words dataframe with an index value beyond the current maximum for the value “out of vocabulary.” This ensures compatibility with the existing database values.

index_words.loc[0] = ['$PAD']
index_words.loc[index_words.index.max() + 1] = ['$OUT']
print(index_words.tail())

              Word_Type
Index                  
951    electromyography
952                 emg
953                 tom
0                  $PAD
954                $OUT

Split data into training and testing.

Here, we split the data into training and testing components using sklearn.model_selection.train_test_split. train_test_split splits the sentences randomly, so the training and testing portions will both contain portions of all eight documents. This means that the testing portion of the data will provide a clear indication as to whether training the model below has been successful, but we will still need to test it again later on a fully unseen document. Here, we split the data based on a common threshold: 20% of the sentences for testing and 80% for training.

X_train, X_test, y_train, y_test = train_test_split(sentences_list, tags_list, test_size=0.2)

Replacing X_test terms.

Because we will train the model below on the X_train set created above, the word type numerical values found in X_test that are not found in X_train will create trouble for the model, as the values will produce word embeddings for which the model has not been trained to process. We handle this here by replacing these numerical values with the out-of-vocabulary value, which is equal to index_words.index.max().

words = set(word_value for sent in X_train for word_value in sent)

pre_sample_sentence_index = 10
X_test_pre_sample = X_test[pre_sample_sentence_index]
X_test = [[word_value if word_value in words else (index_words.index.max(),) for word_value in sent] \
          for sent in X_test]

print('Original words: ', list(index_words.at[ind[0], 'Word_Type'] for ind in X_test_pre_sample))
print('Before out-of-vocabulary conversion: ', X_test_pre_sample)
print('After out-of-vocabulary conversion:  ', X_test[pre_sample_sentence_index])

Original words:  ['*', 'qees', 'tus', 'neeg', 'uas', 'muaj', 'hom', 'kab', 'mob', 'tb', 'no', 'yuav', 'kis', 'tau', 'rau', 'lwm', 'leej', 'lwm', 'tus', '.']
Before out-of-vocabulary conversion:  [(539,), (787,), (383,), (29,), (80,), (23,), (253,), (19,), (20,), (782,), (32,), (69,), (70,), (71,), (26,), (149,), (427,), (788,), (383,), (22,)]
After out-of-vocabulary conversion:   [(539,), (954,), (383,), (29,), (80,), (23,), (253,), (19,), (20,), (782,), (32,), (69,), (70,), (71,), (26,), (149,), (427,), (954,), (383,), (22,)]

Padding sentences.

Next, we need to pad the sentences such that each sentence has the same length. We do this by finding the longest sentence by tokens in X_train and using this as the maxlen parameter of keras.preprocessing.sequence.pad_sequences for each of the four data types.

LEN_MAX = len(max(X_train, key=len))

X_train = pad_sequences([[w[0] for w in line] for line in X_train], maxlen=LEN_MAX, padding='post')
y_train = pad_sequences(y_train, maxlen=LEN_MAX, padding='post')

X_test = pad_sequences([[w[0] for w in line] for line in X_test], maxlen=LEN_MAX, padding='post')
y_test = pad_sequences(y_test, maxlen=LEN_MAX, padding='post')

Load the pretrained word embedding model.

Now, we can load the Word2Vec word embedding model pretrained on the SCH corpus.

word2vec_model = Word2Vec.load('word2vec_Hmong_SCH.model')

Populate embedding matrix.

The embedding matrix in our Keras model below will use the word embedding vectors from the Word2Vec model above. However, we want to populate our embedding matrix using only those vectors that correspond to our training set. We create a matrix that can contain the full number of word indices in the database vocabulary, plus padding and out-of-vocabulary values. We then populate the matrix with the word embeddings at row positions corresponding to the word indices.

maximum_vocab_size = index_words.index.max() + 1
embedding_matrix = np.zeros((maximum_vocab_size, 150))
for ind in words:
    try:
        embedding_vector = word2vec_model.wv[index_words.at[ind[0], 'Word_Type']]
    except KeyError as e:
        embedding_vector = None
    if embedding_vector is not None:
        embedding_matrix[ind[0]] = embedding_vector

Create Keras model.

Now, we create the Keras model. We use the Sequential() model type, as this is a sequential labeling task.

We use the weights parameter of Embedding to input the word embedding matrix we just created above.

We then compile the model using categorical cross-entropy as a loss, and Adam as an optimizer.

model = Sequential()
model.add(InputLayer(input_shape=(LEN_MAX, )))
model.add(Embedding(maximum_vocab_size, 150, weights=[embedding_matrix], trainable=False))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(TimeDistributed(Dense(len(tag_indices))))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 93, 150)           143250    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 93, 512)           833536    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 93, 32)            16416     
_________________________________________________________________
activation_1 (Activation)    (None, 93, 32)            0         
=================================================================
Total params: 993,202
Trainable params: 849,952
Non-trainable params: 143,250
_________________________________________________________________

Train the model.

Now we train the model using the X_train data, with y_train converted to one-hot vectors using keras.utils.np_utils.to_categorical. We choose a batch size of 16, and set aside 20% of our training set for validation, leaving the rest for training.

model.fit(X_train, to_categorical(y_train, num_classes=max(tag_indices.values()) + 1), batch_size=16, epochs=50, validation_split=0.2)

Train on 224 samples, validate on 57 samples
Epoch 1/50
224/224 [==============================] - 8s 36ms/step - loss: 1.7080 - acc: 0.8271 - val_loss: 0.3139 - val_acc: 0.9276
...
Epoch 50/50
224/224 [==============================] - 5s 24ms/step - loss: 4.7023e-04 - acc: 1.0000 - val_loss: 0.0885 - val_acc: 0.9826

71/71 [==============================] - 0s 5ms/step
Accuracy: 96.68332581788721 percent

23/23 [==============================] - 0s 5ms/step
Accuracy: 96.25993371009827 percent

Evaluate model on test set.

Now we use evaluate to evaluate the accuracy of the model on the test set. As mentioned above, the test set contains sentences from the same documents as the training set, so the results will be higher than on previously unseen documents, which we address below.

scores = model.evaluate(X_test, to_categorical(y_test, num_classes=max(tag_indices.values()) + 1))
print("Accuracy: {result} percent".format(result=(scores[1]*100)))

71/71 [==============================] - 0s 5ms/step
Accuracy: 96.68332581788721 percent

Evaluate on unseen data.

Finally, we evaluate our model on unseen data—a word position/POS-tagged ninth document, which we load from the database.

query = """SELECT doc_ind, sent_ind, word_type_ind, loc, pos_label FROM tokens
JOIN types ON tokens.word_type_ind=types.ind
JOIN word_loc ON word_loc.ind=types.word_loc
JOIN pos ON pos.ind=types.pos_type
WHERE doc_ind==9;"""
query_words = crsr.execute(query).fetchall()
sentences_list = []
tags_list = []
for k, g in groupby(query_words, lambda x: (x[0], x[1])):
    sent = list(g)
    sentences_list.append([(w[2],) for w in sent])
    tags_list.append([(tag_indices['-'.join(w[3:])],) for w in sent])

X_new = [[word_value if word_value in words else (index_words.index.max(),) for word_value in sent] \
          for sent in sentences_list]
    
X_new = pad_sequences([[w[0] for w in line] for line in X_new], maxlen=LEN_MAX, padding='post')
y_new = pad_sequences(tags_list, maxlen=LEN_MAX, padding='post')

scores = model.evaluate(X_new, to_categorical(y_new, num_classes=max(tag_indices.values()) + 1))
print("Accuracy: {result} percent".format(result=(scores[1]*100)))

23/23 [==============================] - 0s 5ms/step
Accuracy: 96.25993371009827 percent

As can be seen above, even on an unseen text, the accuracy of this model still reaches 96.26% in this case, with an input of only about 6000 tagged words.

Conclusion

Altogether, combining word tokenization and POS tagging successfully tackles the problem of syllable-spacing in Hmong, and using a BiLSTM model with pretrained word embeddings using Word2Vec overcomes the limitations on available tagged data.

References and further reading

[1] Lewis, William D. and Phong Yang. 2012. Building MT for a Severely Under-Resourced Language: White Hmong. In Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Americas. https://pdfs.semanticscholar.org/098c/96c2ad281ac617fbe0766623834eb295ec2c.pdf

[2] Takahashi, Kanji and Kazuhide Yamamoto. 2016. Fundamental tools and resource are available for Vietnamese analysis. In Proceedings of the 2016 International Conference on Asian Lanuage Processing, p. 246–249. https://ieeexplore.ieee.org/document/7875978

[3] Nguyen Dat Quoc, Thanh Vu, Dai Quoc Nguyen, Mark Dras and Mark Johnson. 2017. In Proceedings of the Australasian Language Technology Association Workshop, p. 108–113. https://www.aclweb.org/anthology/U17-1013/

	FORM
0	li_cas
1	ib_tug
2	kab_mob
3	tshwm_sim
5	caij_ntuj_sov
8	caij_nplooj_ntoo_zeeg
9	nyob_nyob
10	los_puas
11	pab_pawg
12	ua_ke
13	thiaj_li
14	tab_sis
16	me_nyuam_yaus
17	me_nyuam
18	me_nyuam
19	teb_chaws
20	sib_deev
21	poj_niam
22	poj_niam
23	txiv_neej

Model type	Input data type	Training Accuracy	Validation Accuracy	Test Accuracy
MLP	Word embedding	100.00%	45.45%	18.52%
BiLSTM	Word embedding	100.00%	72.73%	88.89%
BiGRU/ RNN	Word embedding	96.51%	45.45%	37.04%
MLP	CountVectorizer unigram vectors	100.00%	68.18%	74.07%
MLP	CountVectorizer bigram vectors	100.00%	63.64%	85.19%

	doc_ind	sent_ind	FORM	type_form	XPOS	word_pos
9690	11	14	neeg	neeg	NN	B
9691	11	14	nkag	nkag	VV	B
9692	11	14	teb	teb	NN	B
9693	11	14	chaws	chaws	NN	I
9694	11	14	.	.	PU	O

	doc_ind	sent_ind	FORM	type_form	XPOS	word_pos	ID
0	1	1	Tus	tus	CL	B	1
1	1	1	Mob	mob	NN	B	2
2	1	1	–	–	PU	O	3
3	1	1	Shigellosis	shigellosis	FW	B	4
4	1	1	Disease	disease	FW	B	5
5	1	1	Fact	fact	FW	B	6
6	1	1	Sheet	sheet	FW	B	7
7	1	1	Series	series	FW	B	8
8	1	1	Tus	tus	CL	B	9
9	1	1	mob	mob	NN	B	10
10	1	1	shigellosis	shigellosis	FW	B	11
11	1	1	zoo	zoo	VV	B	12
12	1	1	li	li	PP	B	13
13	1	1	cas	cas	DT	I	14
14	1	1	?	?	PU	O	15
15	1	2	Shigellosis	shigellosis	FW	B	16
16	1	2	yog	yog	VV	B	17
17	1	2	ib	ib	QU	B	18
18	1	2	tug	tug	CL	I	19
19	1	2	mob	mob	NN	B	20

	doc_ind	sent_ind	FORM	type_form	XPOS	word_pos	ID
9661	11	14	ntsws	ntsws	NN	I	9662
9662	11	14	qhuav	qhuav	VV	I	9663
9682	11	14	chaws	chaws	NN	I	9683
9687	11	14	choj	choj	NN	I	9688
9693	11	14	chaws	chaws	NN	I	9694

	type_form	word_pos	type_form_L1	word_pos_L1	type_form_L2	word_pos_L2	type_form_L3	word_pos_L3
13	cas	I	li	B	zoo	B	shigellosis	B
18	tug	I	ib	B	yog	B	shigellosis	B
24	mob	I	kab	B	cov	B	ntawm	B
52	sim	I	tshwm	B	muaj	B	ntau	B
56	ntuj	I	caij	B	lub	B	rau	B
57	sov	I	ntuj	I	caij	B	lub	B
61	nplooj	I	caij	B	lub	B	thiab	B
62	ntoo	I	nplooj	I	caij	B	lub	B
63	zeeg	I	ntoo	I	nplooj	I	caij	B
66	nyob	I	nyob	B	.	O	zeeg	I

	index	type_form	word_pos	type_form_L1	word_pos_L1	type_form_L2	word_pos_L2	type_form_L3	word_pos_L3	offset
0	13	cas	I	li	B					18.0
1	18	tug	I	ib	B					24.0
2	24	mob	I	kab	B					52.0
3	52	sim	I	tshwm	B					56.0
4	56	ntuj	I	caij	B					57.0
5	57	sov	I	ntuj	I	caij	B			61.0
6	61	nplooj	I	caij	B					62.0
7	62	ntoo	I	nplooj	I	caij	B			63.0
8	63	zeeg	I	ntoo	I	nplooj	I	caij	B	66.0
9	66	nyob	I	nyob	B					75.0
10	75	puas	I	los	B					79.0
11	79	pawg	I	pab	B					87.0
12	87	ke	I	ua	B					94.0
13	94	li	I	thiaj	B					115.0
14	115	sis	I	tab	B					123.0
15	123	nyuam	I	me	B					124.0
16	124	yaus	I	nyuam	I	me	B			145.0
17	145	nyuam	I	me	B					153.0
18	153	nyuam	I	me	B					162.0
19	162	chaws	I	teb	B					177.0

	doc_ind	sent_ind	FORM	type_form	XPOS	word_pos	ID	UPOS
0	1	1	li_cas	tus	DT	B	1	DET
1	1	1	ib_tug	mob	NN	B	2	NOUN
2	1	1	kab_mob	–	PU	O	3	PUNCT
3	1	1	tshwm_sim	shigellosis	FW	B	4	X
4	1	1	Disease	disease	FW	B	5	X
5	1	1	caij_ntuj_sov	fact	FW	B	6	X
6	1	1	Sheet	sheet	FW	B	7	X
7	1	1	Series	series	FW	B	8	X
8	1	1	caij_nplooj_ntoo_zeeg	tus	CL	B	9	NOUN
9	1	1	nyob_nyob	mob	NN	B	10	NOUN
10	1	1	los_puas	shigellosis	FW	B	11	X
11	1	1	pab_pawg	zoo	VV	B	12	VERB
12	1	1	ua_ke	li	PP	B	13	ADP
14	1	1	tab_sis	?	PU	O	15	PUNCT
15	1	2	Shigellosis	shigellosis	FW	B	16	X
16	1	2	me_nyuam_yaus	yog	VV	B	17	VERB
17	1	2	me_nyuam	ib	CL	B	18	NOUN
19	1	2	teb_chaws	mob	NN	B	20	NOUN
20	1	2	sib_deev	los	VV	B	21	VERB
21	1	2	poj_niam	ntawm	LC	B	22	ADP

Conclusion

Conclusion¶

Import libraries.

Load training corpus.

Train word embeddings with Word2Vec.

Carry out data preprocessing to produce a high-quality set of nouns.

Plotting the dendrogram.

K-Means Clustering

K-means Clustering Results

Comparison of results

The Results as Hypotheses

Conclusion

Import libraries.

Create the database.

Loading the first file to insert.

Inserting the document information.

Create a function to process each word.

Add each token to the database.

Conclusion

Import libraries.

Load existing database with POS-tagged words.

Retrieve tags from database.

Retrieve word tokens and tags as numerical codes.

Handling padding and out-of-vocabulary items.

Split data into training and testing.

Replacing X_test terms.

Padding sentences.

Load the pretrained word embedding model.

Populate embedding matrix.

Create Keras model.

Train the model.

Evaluate model on test set.

Evaluate on unseen data.

Conclusion

References and further reading

Other further reading links