The Hmong Medical Corpus stores its tagged text data in a SQL database. To use this data with Stanford CoreNLP, it must first be converted into CoNLL-U format. This post shows how this is done.
First, let’s import the libraries needed.
from itertools import groupby
import os
import sqlite3
import pandas as pd
Next, let’s load the database. The Hmong Medical Corpus database is a SQLite database, so we load it through the sqlite3 module.
conn = sqlite3.Connection('hmcorpus.db')
crsr = conn.cursor()
Now, we use a SQL query to acquire the data we need. We can call the read_sql_query function in Pandas to facilitate creating a Pandas DataFrame.
sql_query = """SELECT doc_ind, sent_ind, token_form, type_form, pos_label, loc FROM tokens
JOIN types ON types.ind=tokens.word_type_ind
JOIN word_loc ON types.word_loc=word_loc.ind
JOIN pos ON pos.ind=types.pos_type;"""
query = pd.read_sql_query(sql_query, conn)
df = pd.DataFrame(query)
Next, we read in a CSV file that maps the part of speech tags from those specific to the Hmong Medical Corpus project to those used in the Universal POS tag set.
conv = pd.read_csv('mapping_to_upos.txt', sep='\t')
print(conv)
We now assign descriptive column names to our DataFrame created above.
df.columns = ['doc_ind', 'sent_ind', 'FORM', 'type_form', 'XPOS', 'word_pos']
df.tail()
Now, we add the ID column that will appear in the final CoNLL-U files.
df['ID'] = df.index + 1
df.head(20)
The next step is the most challenging in this process: converting syllable-based tokens reflecting Hmong orthography to word-based tokens required by the CoNLL-U formatting standards. We begin by finding all syllables labeled with a word_pos value of ‘I’ (for “internal”).
i_hits = df[df['word_pos']=='I']
i_hits.tail()
Here, we create a new DataFrame quads where we are going to combine the non-initial syllables. The DataFrame is named quads because the maximum word length in Hmong is four syllables.
quads = i_hits[['type_form', 'word_pos']]
quads.head()
Now, we reorganize quads so that each row contains four syllables with their corresponding word position tags. This is done in reverse such that type_form_L1 is the form one syllable to the left, type_form_L2 is two syllables to the left, and so on.
l1 = df.loc[quads.index - 1, ['type_form', 'word_pos']]
l1.index = l1.index + 1
quads = quads.join(l1, rsuffix="_L1")
quads.head()
#l1.head()
l2 = df.loc[quads.index - 2, ['type_form', 'word_pos']]
l2.index = l2.index + 2
quads = quads.join(l2, rsuffix="_L2")
quads.head()
l3 = df.loc[quads.index - 3, ['type_form', 'word_pos']]
l3.index = l3.index + 3
quads = quads.join(l3, rsuffix="_L3")
quads.head(10)
Next, if the syllable content in a row belongs to a different word from that found in column type_form, we erase that content so that the DataFrame only contains content belonging to a single word in a row.
m = quads['word_pos_L1'] != 'I'
quads.loc[m, ['type_form_L2', 'type_form_L3', 'word_pos_L2', 'word_pos_L3']] = ['', '', '', '']
m = quads['word_pos_L2'] != 'I'
quads.loc[m, ['type_form_L3', 'word_pos_L3']] = ['', '']
We then reset the index in order to use the original index from the original dataset as a means to determine which rows represent portions of the same word, to ensure that duplicates are eliminated. We do this by creating an offset column, where the value is shifted one from the original index, which is now its own column.
The rationale for this is straightforward: if the offset value for the row in question is one more than the index value, then the current row is a duplicate that represents only a portion of the full word. In other words, there is another row further down that contains the complete word.
quads = quads.reset_index()
quads['offset'] = quads['index'].shift(periods=-1)
quads.head(20)
Since some words have more than two syllables, they occupy more than one row in the quads DataFrame, and the following line of code allows only rows with the complete word to appear in quads.
quads = quads[quads['index'] + 1 != quads['offset']]
Next, we create a FORM column in quads that contains the complete word, combining the content of the type_form_XX columns together with underscores. Using underscores for the syllable breaks is the practice used in CoNLL files for Vietnamese, which has the same syllable-based spacing as Hmong, so we adopt the practice here.
quads['FORM'] = quads['type_form_L3'] + '_' + \
quads['type_form_L2'] + '_' + \
quads['type_form_L1'] + '_' + \
quads['type_form']
quads['FORM'] = quads['FORM'].str.lstrip('_')
Below, we can see the results in the FORM column on the right.
quads.head(10)
Next, we assign a head_pos column to quads, which determines the position of the initial syllable in our original DataFrame. Then we set head_pos to be the new index and reduce quads to the two columns we need to merge into our original DataFrame: the index head_pos indicating the position where the combined word needs to appear, and FORM containing the newly combined full word.
quads['head_pos'] = quads['index'] - quads['FORM'].str.count('_')
quads.set_index('head_pos', inplace=True)
quads = quads_final.loc[:, ['FORM']]
quads.head(20)
Next, we update the combined words in the original DataFrame containing the full POS-tagged corpus.
df.update(quads)
Next, we need to update all of the POS tags so that a single POS tag that correctly reflects the role of the full word appears in the corpus DataFrame.
First, we need to handle words comprised of quantifier + classifier sequences, where the part of speech of the resulting combination is a classifier. We do this by using a temporary DataFrame where we extract all of the positions where a classifier appears in non-initial position. When this is the case, we select out all instances where the preceding syllable is a quantifier, and assign the tag CL (“classifier”). We then update the corpus DataFrame.
dg = df.loc[df[(df['XPOS']=='CL') & (df['word_pos']=='I')].index - 1, ['XPOS']]
dg = dg[dg['XPOS']=='QU']
dg['XPOS'] = 'CL'
df.update(dg)
Second, we handle words comprised of the associative-reciprocal prefix sib + verb as a verb. We do this by finding each instance where the first three letters of the word is sib. Every word that begins with sib is a verb in our corpus, so we can use a simple assignment.
df.loc[df['FORM'].str[:3]=='sib', 'XPOS'] = 'VV'
Third, the ubiquitous unit li cas “what”, as a unit, is, in Hmong, a demonstrative used in questions.
df.loc[df['FORM']=='li_cas', 'XPOS'] = 'DT'
Now that all of the POS tags in the XPOS column have been updated we can now add the UPOS column with the equivalent values from the Universal POS tagset.
df = df.join(conv.set_index("XPOS"), rsuffix="_match", on=["XPOS"])
We can now drop every row where the type_form is a non-initial syllable, leaving only complete words in the corpus DataFrame.
df = df[df['word_pos'] != 'I']
df.head(20)
Since our ultimate goal is to create CoNLL-U files that will enable training of a Stanford CoreNLP POS-tagging model, we can assign the rest of the required rows with underscores.
df['LEMMA'] = '_'
df['FEATS'] = '_'
df['HEAD'] = '_'
df['DEPREL'] = '_'
df['DEPS'] = '_'
df['MISC'] = '_'
Now, we use drop with inplace=True to remove the two columns containing the syllable forms and word position tags that we used for processing from the original database.
df.drop(columns=['type_form', 'word_pos'], inplace=True)
df.head()
Next, we retrieve the set of unique doc_ind and sent_ind combinations as a Numpy array.
sentence_ids = df.groupby(['doc_ind', 'sent_ind']).size().reset_index().loc[:, ['doc_ind', 'sent_ind']].values
Here, we define which sentences from the corpus will appear as part of the testing dataset for training later. We select out sentences 7 and 14 from each of the documents in the corpus. Each document contains at least 14 sentences, so this selection will be suitable.
test_ids = [7, 14]
Finally, we create the CoNLL-U files that will be used for training and testing of our Stanford CoreNLP POS-tagging model.
We iterate through the sentence IDs to create a separate DataFrame for each sentence with its own consecutive index, to match CoNLL formatting requirements. Within each sentence, we no longer need the document and sentence numbers, and so we drop these and reorder the remaining columns to match the CoNLL specification. Then we write to file using to_csv.
f = open('hmcorpus_train.conllu', 'a')
g = open('hmcorpus_test.conllu', 'a')
for id in sentence_ids:
sent_df = df[(df['doc_ind']==id[0]) & (df['sent_ind']==id[1])].reset_index(drop=True)
sent_df.loc[:, 'ID'] = sent_df.index + 1
sent_df.drop(columns=['doc_ind', 'sent_ind'], inplace=True)
new_columns = ['ID', 'FORM', 'LEMMA', 'UPOS', 'XPOS', 'FEATS', 'HEAD', 'DEPREL', 'DEPS', 'MISC']
sent_df = sent_df[new_columns]
if id[1] in test_ids:
sent_df.to_csv(g, sep='\t', header=False, index=False)
g.write('\n')
else:
sent_df.to_csv(f, sep='\t', header=False, index=False)
f.write('\n')
f.close()
Conclusion
Altogether, using the methodology above, we can create CoNLL-U files based on our syllable-tokenized SQL database tables to use with Stanford CoreNLP. In the next post, we will train a Stanford CoreNLP POS-tagging model.