Transformer

The Transformer – a model that uses attention to boost the speed with which these models can be trained. The biggest benefit, comes from how The Transformer lends itself to parallelization.

Encoding component, a decoding component, and connections between them. The encoding component is a stack of encoders, each encoder is a self-attention layer (a layer that helps the encoder look at other words in the input sentence as it encodes a specific word) + feed-forward layer.

Seq2Seq

https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263

A Seq2Seq model is a model that takes a sequence of items (words, letters, time series, etc) and outputs another sequence of items. It can be used as a model for machine interaction and machine translation.

Word Embeddings

Mathematical dimension reduction models

Word Representations suffer from the inherent Curse of Dimensionality due to its multidimensional representation in word vector space.

The idea is very simple — make a word vector representation, say in the form of multiple One-Hot vectors. Then deploy a Dimensionality Reduction algorithm such as Matrix Factorization using Singular Value Decomposition (SVD) to arrive at meaningful conclusions.

Text Similarity

Text similarity has to determine how ‘close’ two pieces of text are both in surface closeness lexical similarity and meaning semantic similarity.

Since differences in word order often go hand in hand with differences in meaning (compare the dog bites the man with the man bites the dog), we’d like our sentence embeddings to be sensitive to this variation.

Big idea

The big idea is that you represent documents as vectors of features, and compare documents by measuring the distance between these features.

Building the arXiv classifier - II

Part II: Natural language processing

There are many great introductory tutorials for natural language processing (NLP) freely available online, some examples are here, here, some books I recommend are Speech and Language Processing by Dan Jurafsky, Natural Language Processing with Python by Loper, Klein, and Bird

In the project I follow roughly the following pipeline, also formalized as the CRISP-DM model, basically it is as follows:

Data gathering (done in Part I)
Text pre-processing
Parsing, exploratory analysis
Feature engineering
Modeling/pattern recognition
Testing/evaluation

This post will be largely on pre-processing and some exploratory analysis.

Building the arXiv classifier - I

Part I: Getting the dataset

The arXiv dataset

The arXiv is a online repository of preprints of scientific papers in the fields of astronomy, physics, mathematics, computer science, quantitative biology, quantitative finance and statistics. To date it has more than a million papers and more are being added every day. This dataset I focused on is a relatively recent (2007-17) sample totaling approximately 800,000 pieces of metadata which I curated via a data dump using the arXiv APIs. They contain a significant number of papers (>5000) from every category (~10) submitted in the past decade.

Project arXivtag

Overview

The following few posts are a short and rather technical documentation of a pet project I did since graduating from college. I named it “arXivtag” because it is essentially a Latent Dirichlet Allocation (LDA) based article classifier. The goal is to classify (return the subject) when given a new abstract of an arXiv submission.

Hello world!

Next you can update your site name, avatar and other options using the _config.yml file in the root of your repository (shown below).