Converting text data from SQL tables to CoNLL-U format

The Hmong Medical Corpus stores its tagged text data in a SQL database. To use this data with Stanford CoreNLP, it must first be converted into CoNLL-U format. This post shows how this is done.

First, let’s import the libraries needed.

from itertools import groupby
import os
import sqlite3
import pandas as pd

Next, let’s load the database. The Hmong Medical Corpus database is a SQLite database, so we load it through the sqlite3 module.

conn = sqlite3.Connection('hmcorpus.db')
crsr = conn.cursor()

Now, we use a SQL query to acquire the data we need. We can call the read_sql_query function in Pandas to facilitate creating a Pandas DataFrame.

sql_query = """SELECT doc_ind, sent_ind, token_form, type_form, pos_label, loc FROM tokens
JOIN types ON types.ind=tokens.word_type_ind
JOIN word_loc ON types.word_loc=word_loc.ind
JOIN pos ON pos.ind=types.pos_type;"""
query = pd.read_sql_query(sql_query, conn)
df = pd.DataFrame(query)

Next, we read in a CSV file that maps the part of speech tags from those specific to the Hmong Medical Corpus project to those used in the Universal POS tag set.

conv = pd.read_csv('mapping_to_upos.txt', sep='\t')
print(conv)
   XPOS   UPOS
0    CL   NOUN
1    NN   NOUN
2    PU  PUNCT
3    FW      X
4    VV   VERB
5    PP    ADP
6    QU    NUM
7    LC    ADP
8    AD    ADV
9    DT    DET
10   CC  CCONJ
11   CV   NOUN
12   RL   NOUN
13   CS  SCONJ
14   PN   PRON
15   NR  PROPN
16   CM   PART
17   ON   INTJ
18   JJ    ADJ

We now assign descriptive column names to our DataFrame created above.

df.columns = ['doc_ind', 'sent_ind', 'FORM', 'type_form', 'XPOS', 'word_pos']
df.tail()
doc_ind sent_ind FORM type_form XPOS word_pos
9690 11 14 neeg neeg NN B
9691 11 14 nkag nkag VV B
9692 11 14 teb teb NN B
9693 11 14 chaws chaws NN I
9694 11 14 . . PU O

Now, we add the ID column that will appear in the final CoNLL-U files.

df['ID'] = df.index + 1
df.head(20)
doc_ind sent_ind FORM type_form XPOS word_pos ID
0 1 1 Tus tus CL B 1
1 1 1 Mob mob NN B 2
2 1 1 PU O 3
3 1 1 Shigellosis shigellosis FW B 4
4 1 1 Disease disease FW B 5
5 1 1 Fact fact FW B 6
6 1 1 Sheet sheet FW B 7
7 1 1 Series series FW B 8
8 1 1 Tus tus CL B 9
9 1 1 mob mob NN B 10
10 1 1 shigellosis shigellosis FW B 11
11 1 1 zoo zoo VV B 12
12 1 1 li li PP B 13
13 1 1 cas cas DT I 14
14 1 1 ? ? PU O 15
15 1 2 Shigellosis shigellosis FW B 16
16 1 2 yog yog VV B 17
17 1 2 ib ib QU B 18
18 1 2 tug tug CL I 19
19 1 2 mob mob NN B 20

The next step is the most challenging in this process: converting syllable-based tokens reflecting Hmong orthography to word-based tokens required by the CoNLL-U formatting standards. We begin by finding all syllables labeled with a word_pos value of ‘I’ (for “internal”).

i_hits = df[df['word_pos']=='I']
i_hits.tail()
doc_ind sent_ind FORM type_form XPOS word_pos ID
9661 11 14 ntsws ntsws NN I 9662
9662 11 14 qhuav qhuav VV I 9663
9682 11 14 chaws chaws NN I 9683
9687 11 14 choj choj NN I 9688
9693 11 14 chaws chaws NN I 9694

Here, we create a new DataFrame quads where we are going to combine the non-initial syllables. The DataFrame is named quads because the maximum word length in Hmong is four syllables.

quads = i_hits[['type_form', 'word_pos']]
quads.head()
type_form word_pos
13 cas I
18 tug I
24 mob I
52 sim I
56 ntuj I

Now, we reorganize quads so that each row contains four syllables with their corresponding word position tags. This is done in reverse such that type_form_L1 is the form one syllable to the left, type_form_L2 is two syllables to the left, and so on.

l1 = df.loc[quads.index - 1, ['type_form', 'word_pos']]
l1.index = l1.index + 1
quads = quads.join(l1, rsuffix="_L1")
quads.head()
#l1.head()
l2 = df.loc[quads.index - 2, ['type_form', 'word_pos']]
l2.index = l2.index + 2
quads = quads.join(l2, rsuffix="_L2")
quads.head()
l3 = df.loc[quads.index - 3, ['type_form', 'word_pos']]
l3.index = l3.index + 3
quads = quads.join(l3, rsuffix="_L3")
quads.head(10)
type_form word_pos type_form_L1 word_pos_L1 type_form_L2 word_pos_L2 type_form_L3 word_pos_L3
13 cas I li B zoo B shigellosis B
18 tug I ib B yog B shigellosis B
24 mob I kab B cov B ntawm B
52 sim I tshwm B muaj B ntau B
56 ntuj I caij B lub B rau B
57 sov I ntuj I caij B lub B
61 nplooj I caij B lub B thiab B
62 ntoo I nplooj I caij B lub B
63 zeeg I ntoo I nplooj I caij B
66 nyob I nyob B . O zeeg I

Next, if the syllable content in a row belongs to a different word from that found in column type_form, we erase that content so that the DataFrame only contains content belonging to a single word in a row.

m = quads['word_pos_L1'] != 'I'
quads.loc[m, ['type_form_L2', 'type_form_L3', 'word_pos_L2', 'word_pos_L3']] = ['', '', '', '']
m = quads['word_pos_L2'] != 'I'
quads.loc[m, ['type_form_L3', 'word_pos_L3']] = ['', '']

We then reset the index in order to use the original index from the original dataset as a means to determine which rows represent portions of the same word, to ensure that duplicates are eliminated. We do this by creating an offset column, where the value is shifted one from the original index, which is now its own column.

The rationale for this is straightforward: if the offset value for the row in question is one more than the index value, then the current row is a duplicate that represents only a portion of the full word. In other words, there is another row further down that contains the complete word.

quads = quads.reset_index()
quads['offset'] = quads['index'].shift(periods=-1)
quads.head(20)
index type_form word_pos type_form_L1 word_pos_L1 type_form_L2 word_pos_L2 type_form_L3 word_pos_L3 offset
0 13 cas I li B 18.0
1 18 tug I ib B 24.0
2 24 mob I kab B 52.0
3 52 sim I tshwm B 56.0
4 56 ntuj I caij B 57.0
5 57 sov I ntuj I caij B 61.0
6 61 nplooj I caij B 62.0
7 62 ntoo I nplooj I caij B 63.0
8 63 zeeg I ntoo I nplooj I caij B 66.0
9 66 nyob I nyob B 75.0
10 75 puas I los B 79.0
11 79 pawg I pab B 87.0
12 87 ke I ua B 94.0
13 94 li I thiaj B 115.0
14 115 sis I tab B 123.0
15 123 nyuam I me B 124.0
16 124 yaus I nyuam I me B 145.0
17 145 nyuam I me B 153.0
18 153 nyuam I me B 162.0
19 162 chaws I teb B 177.0

Since some words have more than two syllables, they occupy more than one row in the quads DataFrame, and the following line of code allows only rows with the complete word to appear in quads.

quads = quads[quads['index'] + 1 != quads['offset']]

Next, we create a FORM column in quads that contains the complete word, combining the content of the type_form_XX columns together with underscores. Using underscores for the syllable breaks is the practice used in CoNLL files for Vietnamese, which has the same syllable-based spacing as Hmong, so we adopt the practice here.

quads['FORM'] = quads['type_form_L3'] + '_' + \
                quads['type_form_L2'] + '_' + \
                quads['type_form_L1'] + '_' + \
                quads['type_form']
quads['FORM'] = quads['FORM'].str.lstrip('_')

Below, we can see the results in the FORM column on the right.

quads.head(10)
index type_form word_pos type_form_L1 word_pos_L1 type_form_L2 word_pos_L2 type_form_L3 word_pos_L3 offset FORM
0 13 cas I li B 18.0 li_cas
1 18 tug I ib B 24.0 ib_tug
2 24 mob I kab B 52.0 kab_mob
3 52 sim I tshwm B 56.0 tshwm_sim
5 57 sov I ntuj I caij B 61.0 caij_ntuj_sov
8 63 zeeg I ntoo I nplooj I caij B 66.0 caij_nplooj_ntoo_zeeg
9 66 nyob I nyob B 75.0 nyob_nyob
10 75 puas I los B 79.0 los_puas
11 79 pawg I pab B 87.0 pab_pawg
12 87 ke I ua B 94.0 ua_ke

Next, we assign a head_pos column to quads, which determines the position of the initial syllable in our original DataFrame. Then we set head_pos to be the new index and reduce quads to the two columns we need to merge into our original DataFrame: the index head_pos indicating the position where the combined word needs to appear, and FORM containing the newly combined full word.

quads['head_pos'] = quads['index'] - quads['FORM'].str.count('_')
quads.set_index('head_pos', inplace=True)
quads = quads_final.loc[:, ['FORM']]
quads.head(20)
FORM
0 li_cas
1 ib_tug
2 kab_mob
3 tshwm_sim
5 caij_ntuj_sov
8 caij_nplooj_ntoo_zeeg
9 nyob_nyob
10 los_puas
11 pab_pawg
12 ua_ke
13 thiaj_li
14 tab_sis
16 me_nyuam_yaus
17 me_nyuam
18 me_nyuam
19 teb_chaws
20 sib_deev
21 poj_niam
22 poj_niam
23 txiv_neej

Next, we update the combined words in the original DataFrame containing the full POS-tagged corpus.

df.update(quads)

Next, we need to update all of the POS tags so that a single POS tag that correctly reflects the role of the full word appears in the corpus DataFrame.

First, we need to handle words comprised of quantifier + classifier sequences, where the part of speech of the resulting combination is a classifier. We do this by using a temporary DataFrame where we extract all of the positions where a classifier appears in non-initial position. When this is the case, we select out all instances where the preceding syllable is a quantifier, and assign the tag CL (“classifier”). We then update the corpus DataFrame.

dg = df.loc[df[(df['XPOS']=='CL') & (df['word_pos']=='I')].index - 1, ['XPOS']]
dg = dg[dg['XPOS']=='QU']
dg['XPOS'] = 'CL'
df.update(dg)

Second, we handle words comprised of the associative-reciprocal prefix sib + verb as a verb. We do this by finding each instance where the first three letters of the word is sib. Every word that begins with sib is a verb in our corpus, so we can use a simple assignment.

df.loc[df['FORM'].str[:3]=='sib', 'XPOS'] = 'VV'

Third, the ubiquitous unit li cas “what”, as a unit, is, in Hmong, a demonstrative used in questions.

df.loc[df['FORM']=='li_cas', 'XPOS'] = 'DT'

Now that all of the POS tags in the XPOS column have been updated we can now add the UPOS column with the equivalent values from the Universal POS tagset.

df = df.join(conv.set_index("XPOS"), rsuffix="_match", on=["XPOS"])

We can now drop every row where the type_form is a non-initial syllable, leaving only complete words in the corpus DataFrame.

df = df[df['word_pos'] != 'I']
df.head(20)
doc_ind sent_ind FORM type_form XPOS word_pos ID UPOS
0 1 1 li_cas tus DT B 1 DET
1 1 1 ib_tug mob NN B 2 NOUN
2 1 1 kab_mob PU O 3 PUNCT
3 1 1 tshwm_sim shigellosis FW B 4 X
4 1 1 Disease disease FW B 5 X
5 1 1 caij_ntuj_sov fact FW B 6 X
6 1 1 Sheet sheet FW B 7 X
7 1 1 Series series FW B 8 X
8 1 1 caij_nplooj_ntoo_zeeg tus CL B 9 NOUN
9 1 1 nyob_nyob mob NN B 10 NOUN
10 1 1 los_puas shigellosis FW B 11 X
11 1 1 pab_pawg zoo VV B 12 VERB
12 1 1 ua_ke li PP B 13 ADP
14 1 1 tab_sis ? PU O 15 PUNCT
15 1 2 Shigellosis shigellosis FW B 16 X
16 1 2 me_nyuam_yaus yog VV B 17 VERB
17 1 2 me_nyuam ib CL B 18 NOUN
19 1 2 teb_chaws mob NN B 20 NOUN
20 1 2 sib_deev los VV B 21 VERB
21 1 2 poj_niam ntawm LC B 22 ADP

Since our ultimate goal is to create CoNLL-U files that will enable training of a Stanford CoreNLP POS-tagging model, we can assign the rest of the required rows with underscores.

df['LEMMA'] = '_'
df['FEATS'] = '_'
df['HEAD'] = '_'
df['DEPREL'] = '_'
df['DEPS'] = '_'
df['MISC'] = '_'

Now, we use drop with inplace=True to remove the two columns containing the syllable forms and word position tags that we used for processing from the original database.

df.drop(columns=['type_form', 'word_pos'], inplace=True)
df.head()
doc_ind sent_ind FORM XPOS ID UPOS LEMMA FEATS HEAD DEPREL DEPS MISC
0 1 1 li_cas DT 1 DET _ _ _ _ _ _
1 1 1 ib_tug NN 2 NOUN _ _ _ _ _ _
2 1 1 kab_mob PU 3 PUNCT _ _ _ _ _ _
3 1 1 tshwm_sim FW 4 X _ _ _ _ _ _
4 1 1 Disease FW 5 X _ _ _ _ _ _

Next, we retrieve the set of unique doc_ind and sent_ind combinations as a Numpy array.

sentence_ids = df.groupby(['doc_ind', 'sent_ind']).size().reset_index().loc[:, ['doc_ind', 'sent_ind']].values

Here, we define which sentences from the corpus will appear as part of the testing dataset for training later. We select out sentences 7 and 14 from each of the documents in the corpus. Each document contains at least 14 sentences, so this selection will be suitable.

test_ids = [7, 14]

Finally, we create the CoNLL-U files that will be used for training and testing of our Stanford CoreNLP POS-tagging model.

We iterate through the sentence IDs to create a separate DataFrame for each sentence with its own consecutive index, to match CoNLL formatting requirements. Within each sentence, we no longer need the document and sentence numbers, and so we drop these and reorder the remaining columns to match the CoNLL specification. Then we write to file using to_csv.

f = open('hmcorpus_train.conllu', 'a')
g = open('hmcorpus_test.conllu', 'a')
for id in sentence_ids:
    sent_df = df[(df['doc_ind']==id[0]) & (df['sent_ind']==id[1])].reset_index(drop=True)
    sent_df.loc[:, 'ID'] = sent_df.index + 1
    sent_df.drop(columns=['doc_ind', 'sent_ind'], inplace=True)
    new_columns = ['ID', 'FORM', 'LEMMA', 'UPOS', 'XPOS', 'FEATS', 'HEAD', 'DEPREL', 'DEPS', 'MISC']
    sent_df = sent_df[new_columns]
    if id[1] in test_ids:
        sent_df.to_csv(g, sep='\t', header=False, index=False)
        g.write('\n')
    else:
        sent_df.to_csv(f, sep='\t', header=False, index=False)
        f.write('\n')
f.close()

Conclusion

Altogether, using the methodology above, we can create CoNLL-U files based on our syllable-tokenized SQL database tables to use with Stanford CoreNLP. In the next post, we will train a Stanford CoreNLP POS-tagging model.

Leave a comment

Design a site like this with WordPress.com
Get started