The Hmong Medical Corpus stores its tagged text data in a SQL database. To use this data with Stanford CoreNLP, it must first be converted into CoNLL-U format. This post shows how this is done.

First, let’s import the libraries needed.

from itertools import groupby
import os
import sqlite3

Next, let’s load the database. The Hmong Medical Corpus database is a SQLite database, so we load it through the sqlite3 module.

conn = sqlite3.Connection('hmcorpus.db')
crsr = conn.cursor()

Now, we use a SQL query to acquire the data we need. We can call the read_sql_query function in Pandas to facilitate creating a Pandas DataFrame.

sql_query = """SELECT doc_ind, sent_ind, token_form, type_form, pos_label, loc FROM tokens
JOIN types ON types.ind=tokens.word_type_ind
JOIN word_loc ON types.word_loc=word_loc.ind
JOIN pos ON pos.ind=types.pos_type;"""
query = pd.read_sql_query(sql_query, conn)
df = pd.DataFrame(query)

Next, we read in a CSV file that maps the part of speech tags from those specific to the Hmong Medical Corpus project to those used in the Universal POS tag set.

conv = pd.read_csv('mapping_to_upos.txt', sep='\t')

print(conv)

   XPOS   UPOS
0    CL   NOUN
1    NN   NOUN
2    PU  PUNCT
3    FW      X
4    VV   VERB
5    PP    ADP
6    QU    NUM
7    LC    ADP
8    AD    ADV
9    DT    DET
10   CC  CCONJ
11   CV   NOUN
12   RL   NOUN
13   CS  SCONJ
14   PN   PRON
15   NR  PROPN
16   CM   PART
17   ON   INTJ
18   JJ    ADJ

We now assign descriptive column names to our DataFrame created above.

df.columns = ['doc_ind', 'sent_ind', 'FORM', 'type_form', 'XPOS', 'word_pos']

df.tail()

Now, we add the ID column that will appear in the final CoNLL-U files.

df['ID'] = df.index + 1

df.head(20)

The next step is the most challenging in this process: converting syllable-based tokens reflecting Hmong orthography to word-based tokens required by the CoNLL-U formatting standards. We begin by finding all syllables labeled with a word_pos value of ‘I’ (for “internal”).

i_hits = df[df['word_pos']=='I']

i_hits.tail()

Here, we create a new DataFrame quads where we are going to combine the non-initial syllables. The DataFrame is named quads because the maximum word length in Hmong is four syllables.

quads = i_hits[['type_form', 'word_pos']]

quads.head()

Now, we reorganize quads so that each row contains four syllables with their corresponding word position tags. This is done in reverse such that type_form_L1 is the form one syllable to the left, type_form_L2 is two syllables to the left, and so on.

l1 = df.loc[quads.index - 1, ['type_form', 'word_pos']]
l1.index = l1.index + 1
quads = quads.join(l1, rsuffix="_L1")
quads.head()
#l1.head()
l2 = df.loc[quads.index - 2, ['type_form', 'word_pos']]
l2.index = l2.index + 2
quads = quads.join(l2, rsuffix="_L2")
quads.head()
l3 = df.loc[quads.index - 3, ['type_form', 'word_pos']]
l3.index = l3.index + 3
quads = quads.join(l3, rsuffix="_L3")
quads.head(10)

Next, if the syllable content in a row belongs to a different word from that found in column type_form, we erase that content so that the DataFrame only contains content belonging to a single word in a row.

m = quads['word_pos_L1'] != 'I'
quads.loc[m, ['type_form_L2', 'type_form_L3', 'word_pos_L2', 'word_pos_L3']] = ['', '', '', '']
m = quads['word_pos_L2'] != 'I'
quads.loc[m, ['type_form_L3', 'word_pos_L3']] = ['', '']

We then reset the index in order to use the original index from the original dataset as a means to determine which rows represent portions of the same word, to ensure that duplicates are eliminated. We do this by creating an offset column, where the value is shifted one from the original index, which is now its own column.

The rationale for this is straightforward: if the offset value for the row in question is one more than the index value, then the current row is a duplicate that represents only a portion of the full word. In other words, there is another row further down that contains the complete word.

quads = quads.reset_index()
quads['offset'] = quads['index'].shift(periods=-1)
quads.head(20)

Since some words have more than two syllables, they occupy more than one row in the quads DataFrame, and the following line of code allows only rows with the complete word to appear in quads.

quads = quads[quads['index'] + 1 != quads['offset']]

Next, we create a FORM column in quads that contains the complete word, combining the content of the type_form_XX columns together with underscores. Using underscores for the syllable breaks is the practice used in CoNLL files for Vietnamese, which has the same syllable-based spacing as Hmong, so we adopt the practice here.

quads['FORM'] = quads['type_form_L3'] + '_' + \
                quads['type_form_L2'] + '_' + \
                quads['type_form_L1'] + '_' + \
                quads['type_form']
quads['FORM'] = quads['FORM'].str.lstrip('_')

Below, we can see the results in the FORM column on the right.

quads.head(10)

Next, we assign a head_pos column to quads, which determines the position of the initial syllable in our original DataFrame. Then we set head_pos to be the new index and reduce quads to the two columns we need to merge into our original DataFrame: the index head_pos indicating the position where the combined word needs to appear, and FORM containing the newly combined full word.

quads['head_pos'] = quads['index'] - quads['FORM'].str.count('_')

quads.set_index('head_pos', inplace=True)
quads = quads_final.loc[:, ['FORM']]
quads.head(20)

Next, we update the combined words in the original DataFrame containing the full POS-tagged corpus.

df.update(quads)

Next, we need to update all of the POS tags so that a single POS tag that correctly reflects the role of the full word appears in the corpus DataFrame.

First, we need to handle words comprised of quantifier + classifier sequences, where the part of speech of the resulting combination is a classifier. We do this by using a temporary DataFrame where we extract all of the positions where a classifier appears in non-initial position. When this is the case, we select out all instances where the preceding syllable is a quantifier, and assign the tag CL (“classifier”). We then update the corpus DataFrame.

dg = df.loc[df[(df['XPOS']=='CL') & (df['word_pos']=='I')].index - 1, ['XPOS']]
dg = dg[dg['XPOS']=='QU']
dg['XPOS'] = 'CL'
df.update(dg)

Second, we handle words comprised of the associative-reciprocal prefix sib + verb as a verb. We do this by finding each instance where the first three letters of the word is sib. Every word that begins with sib is a verb in our corpus, so we can use a simple assignment.

df.loc[df['FORM'].str[:3]=='sib', 'XPOS'] = 'VV'

Third, the ubiquitous unit li cas “what”, as a unit, is, in Hmong, a demonstrative used in questions.

df.loc[df['FORM']=='li_cas', 'XPOS'] = 'DT'

Now that all of the POS tags in the XPOS column have been updated we can now add the UPOS column with the equivalent values from the Universal POS tagset.

df = df.join(conv.set_index("XPOS"), rsuffix="_match", on=["XPOS"])

We can now drop every row where the type_form is a non-initial syllable, leaving only complete words in the corpus DataFrame.

df = df[df['word_pos'] != 'I']
df.head(20)

Since our ultimate goal is to create CoNLL-U files that will enable training of a Stanford CoreNLP POS-tagging model, we can assign the rest of the required rows with underscores.

df['LEMMA'] = '_'
df['FEATS'] = '_'
df['HEAD'] = '_'
df['DEPREL'] = '_'
df['DEPS'] = '_'
df['MISC'] = '_'

Now, we use drop with inplace=True to remove the two columns containing the syllable forms and word position tags that we used for processing from the original database.

df.drop(columns=['type_form', 'word_pos'], inplace=True)
df.head()

Next, we retrieve the set of unique doc_ind and sent_ind combinations as a Numpy array.

sentence_ids = df.groupby(['doc_ind', 'sent_ind']).size().reset_index().loc[:, ['doc_ind', 'sent_ind']].values

Here, we define which sentences from the corpus will appear as part of the testing dataset for training later. We select out sentences 7 and 14 from each of the documents in the corpus. Each document contains at least 14 sentences, so this selection will be suitable.

test_ids = [7, 14]

Finally, we create the CoNLL-U files that will be used for training and testing of our Stanford CoreNLP POS-tagging model.

We iterate through the sentence IDs to create a separate DataFrame for each sentence with its own consecutive index, to match CoNLL formatting requirements. Within each sentence, we no longer need the document and sentence numbers, and so we drop these and reorder the remaining columns to match the CoNLL specification. Then we write to file using to_csv.

f = open('hmcorpus_train.conllu', 'a')
g = open('hmcorpus_test.conllu', 'a')
for id in sentence_ids:
    sent_df = df[(df['doc_ind']==id[0]) & (df['sent_ind']==id[1])].reset_index(drop=True)
    sent_df.loc[:, 'ID'] = sent_df.index + 1
    sent_df.drop(columns=['doc_ind', 'sent_ind'], inplace=True)
    new_columns = ['ID', 'FORM', 'LEMMA', 'UPOS', 'XPOS', 'FEATS', 'HEAD', 'DEPREL', 'DEPS', 'MISC']
    sent_df = sent_df[new_columns]
    if id[1] in test_ids:
        sent_df.to_csv(g, sep='\t', header=False, index=False)
        g.write('\n')
    else:
        sent_df.to_csv(f, sep='\t', header=False, index=False)
        f.write('\n')
f.close()

Conclusion

Altogether, using the methodology above, we can create CoNLL-U files based on our syllable-tokenized SQL database tables to use with Stanford CoreNLP. In the next post, we will train a Stanford CoreNLP POS-tagging model.

	type_form	word_pos	type_form_L1	word_pos_L1	type_form_L2	word_pos_L2	type_form_L3	word_pos_L3
13	cas	I	li	B	zoo	B	shigellosis	B
18	tug	I	ib	B	yog	B	shigellosis	B
24	mob	I	kab	B	cov	B	ntawm	B
52	sim	I	tshwm	B	muaj	B	ntau	B
56	ntuj	I	caij	B	lub	B	rau	B
57	sov	I	ntuj	I	caij	B	lub	B
61	nplooj	I	caij	B	lub	B	thiab	B
62	ntoo	I	nplooj	I	caij	B	lub	B
63	zeeg	I	ntoo	I	nplooj	I	caij	B
66	nyob	I	nyob	B	.	O	zeeg	I

	index	type_form	word_pos	type_form_L1	word_pos_L1	type_form_L2	word_pos_L2	type_form_L3	word_pos_L3	offset
0	13	cas	I	li	B					18.0
1	18	tug	I	ib	B					24.0
2	24	mob	I	kab	B					52.0
3	52	sim	I	tshwm	B					56.0
4	56	ntuj	I	caij	B					57.0
5	57	sov	I	ntuj	I	caij	B			61.0
6	61	nplooj	I	caij	B					62.0
7	62	ntoo	I	nplooj	I	caij	B			63.0
8	63	zeeg	I	ntoo	I	nplooj	I	caij	B	66.0
9	66	nyob	I	nyob	B					75.0
10	75	puas	I	los	B					79.0
11	79	pawg	I	pab	B					87.0
12	87	ke	I	ua	B					94.0
13	94	li	I	thiaj	B					115.0
14	115	sis	I	tab	B					123.0
15	123	nyuam	I	me	B					124.0
16	124	yaus	I	nyuam	I	me	B			145.0
17	145	nyuam	I	me	B					153.0
18	153	nyuam	I	me	B					162.0
19	162	chaws	I	teb	B					177.0

	index	type_form	word_pos	type_form_L1	word_pos_L1	type_form_L2	word_pos_L2	type_form_L3	word_pos_L3	offset	FORM
0	13	cas	I	li	B					18.0	li_cas
1	18	tug	I	ib	B					24.0	ib_tug
2	24	mob	I	kab	B					52.0	kab_mob
3	52	sim	I	tshwm	B					56.0	tshwm_sim
5	57	sov	I	ntuj	I	caij	B			61.0	caij_ntuj_sov
8	63	zeeg	I	ntoo	I	nplooj	I	caij	B	66.0	caij_nplooj_ntoo_zeeg
9	66	nyob	I	nyob	B					75.0	nyob_nyob
10	75	puas	I	los	B					79.0	los_puas
11	79	pawg	I	pab	B					87.0	pab_pawg
12	87	ke	I	ua	B					94.0	ua_ke

	FORM
0	li_cas
1	ib_tug
2	kab_mob
3	tshwm_sim
5	caij_ntuj_sov
8	caij_nplooj_ntoo_zeeg
9	nyob_nyob
10	los_puas
11	pab_pawg
12	ua_ke
13	thiaj_li
14	tab_sis
16	me_nyuam_yaus
17	me_nyuam
18	me_nyuam
19	teb_chaws
20	sib_deev
21	poj_niam
22	poj_niam
23	txiv_neej

Converting text data from SQL tables to CoNLL-U format

Conclusion

Leave a comment Cancel reply

	doc_ind	sent_ind	FORM	type_form	XPOS	word_pos
9690	11	14	neeg	neeg	NN	B
9691	11	14	nkag	nkag	VV	B
9692	11	14	teb	teb	NN	B
9693	11	14	chaws	chaws	NN	I
9694	11	14	.	.	PU	O

	doc_ind	sent_ind	FORM	type_form	XPOS	word_pos	ID
0	1	1	Tus	tus	CL	B	1
1	1	1	Mob	mob	NN	B	2
2	1	1	–	–	PU	O	3
3	1	1	Shigellosis	shigellosis	FW	B	4
4	1	1	Disease	disease	FW	B	5
5	1	1	Fact	fact	FW	B	6
6	1	1	Sheet	sheet	FW	B	7
7	1	1	Series	series	FW	B	8
8	1	1	Tus	tus	CL	B	9
9	1	1	mob	mob	NN	B	10
10	1	1	shigellosis	shigellosis	FW	B	11
11	1	1	zoo	zoo	VV	B	12
12	1	1	li	li	PP	B	13
13	1	1	cas	cas	DT	I	14
14	1	1	?	?	PU	O	15
15	1	2	Shigellosis	shigellosis	FW	B	16
16	1	2	yog	yog	VV	B	17
17	1	2	ib	ib	QU	B	18
18	1	2	tug	tug	CL	I	19
19	1	2	mob	mob	NN	B	20

	doc_ind	sent_ind	FORM	type_form	XPOS	word_pos	ID
9661	11	14	ntsws	ntsws	NN	I	9662
9662	11	14	qhuav	qhuav	VV	I	9663
9682	11	14	chaws	chaws	NN	I	9683
9687	11	14	choj	choj	NN	I	9688
9693	11	14	chaws	chaws	NN	I	9694

	doc_ind	sent_ind	FORM	type_form	XPOS	word_pos	ID	UPOS
0	1	1	li_cas	tus	DT	B	1	DET
1	1	1	ib_tug	mob	NN	B	2	NOUN
2	1	1	kab_mob	–	PU	O	3	PUNCT
3	1	1	tshwm_sim	shigellosis	FW	B	4	X
4	1	1	Disease	disease	FW	B	5	X
5	1	1	caij_ntuj_sov	fact	FW	B	6	X
6	1	1	Sheet	sheet	FW	B	7	X
7	1	1	Series	series	FW	B	8	X
8	1	1	caij_nplooj_ntoo_zeeg	tus	CL	B	9	NOUN
9	1	1	nyob_nyob	mob	NN	B	10	NOUN
10	1	1	los_puas	shigellosis	FW	B	11	X
11	1	1	pab_pawg	zoo	VV	B	12	VERB
12	1	1	ua_ke	li	PP	B	13	ADP
14	1	1	tab_sis	?	PU	O	15	PUNCT
15	1	2	Shigellosis	shigellosis	FW	B	16	X
16	1	2	me_nyuam_yaus	yog	VV	B	17	VERB
17	1	2	me_nyuam	ib	CL	B	18	NOUN
19	1	2	teb_chaws	mob	NN	B	20	NOUN
20	1	2	sib_deev	los	VV	B	21	VERB
21	1	2	poj_niam	ntawm	LC	B	22	ADP

Conclusion

Share this:

Related

Leave a comment Cancel reply