Machine Learning – Hmong Medical Corpus

A Stanford CoreNLP POS Tagger model for Hmong

A new Stanford CoreNLP POS Tagger model for Hmong is now available. The model file and corresponding props files are available here: https://github.com/nathanmwhite/hmong-medical-corpus/tree/master/Stanford-CoreNLP This model is trained and tested on the files created in the previous post, derived from the Hmong Medical Corpus: The training data file: hmcorpus_train.conllu The test data file: hmcorpus_test.conllu

Converting text data from SQL tables to CoNLL-U format

The Hmong Medical Corpus stores its tagged text data in a SQL database. To use this data with Stanford CoreNLP, it must first be converted into CoNLL-U format. This post shows how this is done. First, let’s import the libraries needed. from itertools import groupby import os import sqlite3 import pandas as pd Next, let’sContinue reading “Converting text data from SQL tables to CoNLL-U format”

Question classification with limited annotated data

For resource-poor languages such as Hmong, large datasets of annotated questions are unavailable, which means that producing an automated question classifier is a potentially challenging task. Currently, a dataset containing 411 annotated Hmong questions is publicly available. The challenge here is to produce a question classifier with adequate accuracy using this available dataset. What weContinue reading “Question classification with limited annotated data”

Using Word Embeddings for Semantic Analysis of Nominal Classifiers

Word embeddings created by Word2Vec can be utilized in exploring the semantic distributions of nouns associated with nominal classifiers. In this post, we explore using dendrogram analysis and k-means clustering with word embeddings as a means to form hypotheses for research involving these distributions. Nominal classifiers are known to have a range of semantic valuesContinue reading “Using Word Embeddings for Semantic Analysis of Nominal Classifiers”

A semi-supervised combined word tokenizer and POS tagger for Hmong

This post introduces a semi-supervised approach to word tokenization and POS tagging that enables support for resource-poor languages. The Hmong language is a resource-poor language [1] where corpora of POS-tagged data are previously unavailable, precluding supervised approaches. At the same time, the Hmong language has an unusually high number of homonyms and features syllable-based spacingContinue reading “A semi-supervised combined word tokenizer and POS tagger for Hmong”