Building the arXiv classifier - II
Part II: Natural language processing
There are many great introductory tutorials for natural language processing (NLP) freely available online, some examples are here, here, some books I recommend are Speech and Language Processing by Dan Jurafsky, Natural Language Processing with Python by Loper, Klein, and Bird
In the project I follow roughly the following pipeline, also formalized as the CRISP-DM model, basically it is as follows:
- Data gathering (done in Part I)
- Text pre-processing
- Parsing, exploratory analysis
- Feature engineering
- Modeling/pattern recognition
- Testing/evaluation
This post will be largely on pre-processing and some exploratory analysis.
Document set
From Part I we have a document set essentially a dictionary of dictionaries like this:
<'astro', [dictionary of astronomy articles in the form <id, abstract>]>
<'cond', [dictionary of condensed matter articles in the form <id, abstract>]>
and so on.
Normalization: Tokenization, stemming, lemmatization, removing stopwords
Here I used this particular set of packages from NLTK, feel free to experiment to get better results.
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
Tokenization
Tokenization simply means breaking a sentence into tokens, from 'At eight o'clock on Thursday morning Arthur didn't feel very good.'
to ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning','Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
, that’s it. Done. There are some nuances, in terms of rules, nltk.word_tokenize()
is a good place to start, I used a RegexpTokenizer
to have slightly better control.
Stemming
Word stems are also known as the base form of a word, and we can create new words by attaching affixes to them in a process known as inflection. i.e. organize -> organize(d), organiz(ing), organize(r), organiz(ation) …
The process of converting inflected words back to the base form is known as stemming, it helps us by reducing the diversity of seen words (in technical parlance, “decreasing the vocabulary”) when they convey similar meaning, this usually has impact on precision (may improve) and recall (improve).
The most common and empirically effective for stemming English sentences is Porter’s algorithm, which i use here.
Lemmatization
Lemmatization is similar to stemming but a more involved process, it reduces inflected words into their lemma. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. This dictionary may not be available, meaning you may have to build yourself. For some specific cases you may have to tune a dictionary of lemmas for a specific vocabulary.
The advantage of lemmatization is a reduction of noise after pre-processing text, because the stem word can be the same for the inflectional forms of different lemmas. This translates into noise in our search results. In fact, it is very common to find entire forms as instances of several lemmas.
The cost of using lemmatization is its significant complexity and the deep linguistics knowledge is required to create the dictionaries that allow the algorithm to look for the lemma.
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
– from Stanford NLP text
My take is that if you have the time and the resources, try sand see if lemmatization helps in information retrieval.
Here are some lemmatizers you can try:
- WordNet lemmatizer (shoutout to the good people at Princeton University and my thesis advisor Dr Fellbaum)
import nltk
from nltk.stem import WordNetLemmatizer
- spaCy lemmatizer
import spacy
# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en', disable=['parser', 'ner'])
parsed = nlp(sentence)
" ".join([token.lemma_ for token in parsed])
I did not use lemmatization in this project because we are dealing with extremely technical documents (look up Flesch–Kincaid readability), so a traditional dictionary of lemmas may perform worse than expected. OF course if this turns out to be a demonstrably erroneous hypothesis I’ll incorporate a lemmatizer in this pipeline
Removing stopwords
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that occurs frequently but does not convey much meaning. .Removing them from the vocabulary increases the discriminative power of the other words in the document. Any standard stop words list for English will work admirably.
code of the entire pipeline is as follows:
def tokenize(doc_set):
# create English stop words list
en_stop = stopwords.words('english')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# create tokenizer
tokenizer = RegexpTokenizer(r'\w+')
doc_texts = []
# loop through document list
for doc in doc_set:
# doc is a tuple in the form (id, category, text)
# clean and tokenize document string
raw = doc[2].lower()
tokens = tokenizer.tokenize(raw)
# remove pure numbers (and negative numbers) from tokens
no_digits = [i for i in tokens if not (i.isdigit() or i[0] == '-' and i[1:].isdigit())]
# remove stop words from tokens
stopped_tokens = [i for i in no_digits if i not in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
doc_texts.append(stemmed_tokens)
return doc_texts