Nathan White

A Stanford CoreNLP POS Tagger model for Hmong

A new Stanford CoreNLP POS Tagger model for Hmong is now available. The model file and corresponding props files are available here: https://github.com/nathanmwhite/hmong-medical-corpus/tree/master/Stanford-CoreNLP This model is trained and tested on the files created in the previous post, derived from the Hmong Medical Corpus: The training data file: hmcorpus_train.conllu The test data file: hmcorpus_test.conllu

Converting text data from SQL tables to CoNLL-U format

The Hmong Medical Corpus stores its tagged text data in a SQL database. To use this data with Stanford CoreNLP, it must first be converted into CoNLL-U format. This post shows how this is done. First, let’s import the libraries needed. from itertools import groupby import os import sqlite3 import pandas as pd Next, let’sContinue reading “Converting text data from SQL tables to CoNLL-U format”

Question classification with limited annotated data

For resource-poor languages such as Hmong, large datasets of annotated questions are unavailable, which means that producing an automated question classifier is a potentially challenging task. Currently, a dataset containing 411 annotated Hmong questions is publicly available. The challenge here is to produce a question classifier with adequate accuracy using this available dataset. What weContinue reading “Question classification with limited annotated data”

Using Word Embeddings for Semantic Analysis of Nominal Classifiers

Word embeddings created by Word2Vec can be utilized in exploring the semantic distributions of nouns associated with nominal classifiers. In this post, we explore using dendrogram analysis and k-means clustering with word embeddings as a means to form hypotheses for research involving these distributions. Nominal classifiers are known to have a range of semantic valuesContinue reading “Using Word Embeddings for Semantic Analysis of Nominal Classifiers”

Using a SQL database for corpus development and management

Corpora are useful tools both for analyzing human language and for NLP application development. However, finding a good platform for building a corpus is not always straightforward. Using the sqlite3 package to create a SQL database to manage our corpus data is an excellent solution, as it provides a means both to maintain the internalContinue reading “Using a SQL database for corpus development and management”

A semi-supervised combined word tokenizer and POS tagger for Hmong

This post introduces a semi-supervised approach to word tokenization and POS tagging that enables support for resource-poor languages. The Hmong language is a resource-poor language [1] where corpora of POS-tagged data are previously unavailable, precluding supervised approaches. At the same time, the Hmong language has an unusually high number of homonyms and features syllable-based spacingContinue reading “A semi-supervised combined word tokenizer and POS tagger for Hmong”

Hmong Medical Corpus Blog: The Rationale

The Hmong Medical Corpus (currently hosted here: http://corpus.ap-southeast-2.elasticbeanstalk.com/hminterface/) was launched in August 2019 with a goal of making Hmong medical information readily available to members of the Hmong community in a single, searchable location and to members of the linguistic research community who need greater access to material in Hmong. This project involves natural languageContinue reading “Hmong Medical Corpus Blog: The Rationale”