Corpora are useful tools both for analyzing human language and for NLP application development. However, finding a good platform for building a corpus is not always straightforward. Using the sqlite3 package to create a SQL database to manage our corpus data is an excellent solution, as it provides a means both to maintain the internal structure of the data and to quickly traverse that internal structure.
Let’s begin by importing the necessary libraries.
Import libraries.
import os
import sqlite3
import pickle
Create the database.
For a part-of-speech tagged database, we need to have the following tables:
- Documents—to keep track of the original document files
- Part of speech—to keep track of all of the possible parts of speech
- Word Types—to keep track of all attested word types (or lemmas), rather than the word tokens and their varying forms
- Word Tokens—to keep track of the individual word tokens in each document, as they appear in the original
For Hmong in particular, because the language’s orthography places spaces between syllables, we need to keep track of which position in the word each type/token represents. As a result, we need a fifth table:
- Word position
Languages with more complicated morphology may need additional tables to keep track of the various morphological categories for a given word. Hmong, however, maximally allows only one affix per word plus reduplication, and morpheme boundaries coincide with syllable boundaries—and thus spaces—and so each morpheme is already stored as a type.
We do, however, want to encode a category only once in the database, and have references made to it, given proper database structure represented by each normal form (https://www.guru99.com/database-normalization.html). So, we refer to categories in one table using indices in another. For example, to reference parts of speech for each word type, we use the index from the parts of speech table to indicate the part of speech for a given type in the word types table.
Below, we use sqlite3.Connection(<database_filename>).cursor().execute with SQL CREATE TABLE commands to create each of the five tables, complete with index references within each table.
os.chdir(os.path.expanduser('~/corpus_location'))
# creates new database
conn = sqlite3.Connection('mycorpus.db')
# get cursor
crsr = conn.cursor()
# string lines to initialize each table in database
create_documents = """CREATE TABLE documents (
index INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
document_title VARCHAR(50),
document_addr VARCHAR(150));"""
create_part_of_speech = """CREATE TABLE part_of_speech (
index INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
pos_label VARCHAR(2));"""
create_word_location = """CREATE TABLE word_location (
index INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
location CHAR);"""
create_word_types = """CREATE TABLE word_types (
index INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
word_type_form VARCHAR(20),
word_location INTEGER,
pos_type INTEGER,
FOREIGN KEY (word_location)
REFERENCES word_location(index),
FOREIGN KEY (pos_type)
REFERENCES part_of_speech(index));"""
create_word_tokens = """CREATE TABLE word_tokens (
index INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
document_index INTEGER,
sentence_index INTEGER,
word_index INTEGER,
word_type_index INTEGER,
word_token_form VARCHAR(20),
FOREIGN KEY (document_index)
REFERENCES documents(index),
FOREIGN KEY (word_type_index)
REFERENCES word_types(index));"""
crsr.execute(create_documents)
crsr.execute(create_part_of_speech)
crsr.execute(create_word_location)
crsr.execute(create_word_types)
crsr.execute(create_word_tokens)
# set up word_location IOB tags
crsr.execute("INSERT INTO word_location(location) VALUES ('B'), ('I'), ('O');")
Loading the first file to insert.
Next, we use pickle to load a file that we want to insert into the database. pickle is a module that enables a file to loaded after being handled by another Python script elsewhere. Here, I use it to load a file with contents that have been preprocessed for insertion into the database. Note that this preprocessing step will be the subject of a later blog post.
os.chdir(os.path.expanduser('~/database_location/pickling'))
pickle_file_name = '9_txt.pkl'
f = open(pickle_file_name, 'rb')
doc_data = pickle.load(f)
f.close()
Inserting the document information.
The preprocessed data contains the text of the document, but not its name or original location. We insert them here using the SQL command INSERT INTO documents with the name of the file and its original location inserted from a tuple named document. We then run cursor().execute to run the SQL command, and use lastrowid to retrieve the number the database has assigned our newest document, so that we can use it in insertions when we begin inserting tokens from the file into the database.
document = ('Tus Mob Acute Flaccid Myelitis', 'https://www.dhs.wisconsin.gov/publications/p01298h.pdf')
insert_doc = "INSERT INTO docs (document_title, document_addr) VALUES ('" + document[0] + "', '" + document[1] + "');"
document_index = crsr.execute(insert_doc).lastrowid
Create a function to process each word.
Because each document contains hundreds of texts, it is incredibly inefficient to execute a new set of SQL commands for each insertion. As a result, we create a function named insert_word below to run each time we insert a word. The function has four parameters:
word_tuple—contains a tuple with the token string and a combined word position/POS tagdoc_index_value—indicates the ID number for the document in thedocumentstablesent_index_value—represents the position in sequence of the current sentence in the documentword_index_value—represents the position in sequence of the current word in the current sentence
def insert_word(word_tuple, doc_index_value, sent_index_value, word_index_value):
'''
Inserts a word into the database, based on the word_tuple.
@param word_tuple is 3-tuple containing the token's form, the location within a word, and the part of speech
@param doc_index_value is the index of the document from which the word is extracted
@param sent_index_value is the index of the sentence in the document from which the word is extracted
@param word_index_value is the index of the position of the word within its sentence
'''
# retrieve pos value if found, otherwise add pos value
pos_results = crsr.execute("SELECT index FROM part_of_speech WHERE pos_label='" + word_tuple[2] + "';").fetchall()
if len(pos_results) > 0:
pos_label_index = pos_results[0][0]
else:
pos_label_index = crsr.execute("INSERT INTO part_of_speech (pos_label) VALUES ('" + word_tuple[2] + "');").lastrowid
# retrieve relevant word_loc value
if word_tuple[1] in ['B', 'I', 'O']:
word_loc_index = crsr.execute("SELECT index FROM word_location WHERE location='" + word_tuple[1] + "';").fetchone()[0]
else:
raise ValueError('Word location value is invalid at word (' + str(sent_index_value - 1) + ', ' \
+ str(word_index_value - 1) + ').')
# match word[0].lower(), word_loc_index, pos_label_index against word_types, and if a match, retrieve index
# if not, add and get index
type_ = word[0].lower()
type_results = crsr.execute("SELECT index FROM word_types WHERE word_type_form='" + type_ + "' AND word_location=" \
+ str(word_loc_index) + " AND pos_type=" + str(pos_label_index) + ";").fetchall()
if len(type_results) > 0:
type_index = type_results[0][0]
else:
type_index = crsr.execute("INSERT INTO word_types (word_type_form, word_location, pos_type) VALUES ('" + type_ + "', " \
+ str(word_loc_index) + ", " + str(pos_label_index) + ");").lastrowid
# insert complete values into word_tokens
insertion = crsr.execute("INSERT INTO word_tokens (document_index, sentence_index, word_index, word_type_index, word_token_form)" \
+ " VALUES (" + str(doc_index_value) + ", " + str(sent_index_value) + ", " \
+ str(word_index_value) + ", " + str(type_index) + ", '" + word[0] + "');")
Add each token to the database.
The next step cycles through the tokens in the file opened with pickle above and runs insert_word to insert each token in the database. We then close the database, as once we have run this step, we have finished inserting our first document into the database!
for i, sent in enumerate(doc_data):
for j, word in enumerate(sent):
current_word = tuple([word[0]] + word[1].split('-'))
insert_word(current_word, doc_index, i + 1, j + 1)
conn.commit()
conn.close()
Conclusion
We can create a SQL database using the sqlite3 package to store our data for our corpus. Above, we saw how to create the tables for the corpus using SQL queries and insert our first document. In later posts, we will look at the preprocessing step to convert the original PDF into data ready to insert into the database, and how to use the database to access and search our data.