Building the arXiv classifier - I

Part I: Getting the dataset

The arXiv dataset

The arXiv is a online repository of preprints of scientific papers in the fields of astronomy, physics, mathematics, computer science, quantitative biology, quantitative finance and statistics. To date it has more than a million papers and more are being added every day. This dataset I focused on is a relatively recent (2007-17) sample totaling approximately 800,000 pieces of metadata which I curated via a data dump using the arXiv APIs. They contain a significant number of papers (>5000) from every category (~10) submitted in the past decade.

Bulk access of arXiv metadata

For harvesting arXiv data year by year

(Please read here and here) (This SO thread helps a lot too)

Please do not DDoS the arXiv server, I accept no responsibility if you get into trouble doing this

arXiv supports bulk access of their article metadata (updated daily) as well as real-time programmatic access to metadata via the arXiv API.

A sample http query looks like this:

http://export.arxiv.org/oai2?verb=ListIdentifiers&set=math&metadataPrefix=oai_dc&from=2007-05-23&until=2015-05-24

here we have set=math and from=2007-05-23&until=2007-05-24

A sample response looks like this:

<Records xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <Record>
    <header status="">
      <identifier>oai:arXiv.org:0704.0004</identifier>
      <datestamp>2007-05-23</datestamp>
      <setSpec>math</setSpec>
    </header>
    <metadata>
      <arXiv xmlns="http://arxiv.org/OAI/arXiv/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://arxiv.org/OAI/arXiv/ http://arxiv.org/OAI/arXiv.xsd">
        <id>0704.0004</id>
        <created>2007-03-30</created>
        <authors>
          <author>
            <keyname>Callan</keyname>
            <forenames>David</forenames>
          </author>
        </authors>
        <title>A determinant of Stirling cycle numbers counts unlabeled acyclic
         single-source automata</title>
        <categories>math.CO</categories>
        <comments>11 pages</comments>
        <msc-class>05A15</msc-class>
        <abstract>We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant.
         </abstract>
      </arXiv>
    </metadata>
    <about/>
  </Record>
</Records>

every response has a list of <Record> under a <Records> tag.

However, if you query more than 1000 articles at once you will get a resumptiontoken and effectively the server is going to rate limit you. To get around that I wrote a script to wait for 20-30 seconds before issuing the http query again with the resumption token. Something like this:

# harvests 1 year worth of arXiv articles
def harvest_by_year(year):
    save_path = "../Data/raw"
    filename = "arXiv" + str(year) + ".xml"
    filename = os.path.join(save_path, filename)
    f = io.open(filename, 'a', encoding="utf-8")
    first_url = "http://export.arxiv.org/oai2?verb=ListRecords&from=" + \
        str(year) + "-01-01&until=" + \
        str(year) + "-12-31&metadataPrefix=arXiv"
    data = urllib.request.urlopen(first_url).read()
    soup = BeautifulSoup(data, 'lxml')
    f.write(soup.prettify())

    token = soup.find('resumptiontoken').text
    resume = True

    # loop over resumption tokens till the end
    while resume:
        # wait for server
        time.sleep(21)
        url = 'http://export.arxiv.org/oai2?verb=ListRecords&resumptionToken=' + token

        next_data = urllib.request.urlopen(url).read()
        soup = BeautifulSoup(next_data, 'html.parser')
        f.write(soup.prettify())
        if soup.find('resumptiontoken') is not None:
            token = soup.find('resumptiontoken').text
            if token is "":
                resume = False
                break
        else:
            resume = False
            break
    return

and I used BeautifulSoup to clean it up, join the XML and remove the resumptiontoken in the XML responses. Note that the joined XML files can get very large (big data woooo) even for one year worth.

Alternative: bulk download

If small scale tests - and I fully encourage you to do small scale tests - work, then you can go here and download full data sets of arXiv article as well as metadata.

Wrangling text (yes that’s the technical term) to get what we need

Getting everything in order

Use the strainer from BeautifulSoup to parse out the identifier, abstract and categories tags and zip them into a tuple list and dump it via pickle

identifier is a string like this oai:arXiv.org:0704.0004 it suffices to take only the rear chunk 0704.0004.

categories is a list of one or more strings like this math.CO we are taking the first category for the purposes of this project, the natural extension would be to take the first n categories and do a multi-class, multi-label classifier.

use utf-8 encoding for text because of the propensity of mathematical symbols in these scientific papers.

If you read closely I opened the file three times each with a different strainer, this is done on purpose since by “straining” the file it is not entirely loaded in memory and the objects that you get after straining are much much smaller than the full set of records.

    filename = "../Data/raw/arXivbulk.xml"

    strainer_id = SoupStrainer("identifier")
    soup_id = BeautifulSoup(io.open(filename, encoding="utf-8"), "xml", parse_only=strainer_id)
    # truncate just to get id
    id_list = [x[14:] for x in soup_id.strings]

    strainer_abs = SoupStrainer("abstract")
    soup_abs = BeautifulSoup(io.open(filename, encoding="utf-8"), "xml", parse_only=strainer_abs)
    # clean newline and whitespace from abs
    abs_list = [" ".join(x.text.replace('\n', ' ').strip().split()) for x in soup_abs.find_all('abstract')]

    # reduce categories to the first big category in the first word
    strainer_cat = SoupStrainer("categories")
    soup_set = BeautifulSoup(io.open(filename, encoding="utf-8"), "xml", parse_only=strainer_cat)
    set_list = [x.split(' ', 1)[0].split('.', 1)[0] for x in soup_set.strings]
    

    # build a dictionary with key = id, value = tuple of other things
    keys = id_list
    values = list(zip(set_list, abs_list))
    print(values.__len__())
    article_dic = dict(set(zip(keys, values)))
    print(article_dic.keys().__len__())

    dictname = "../Data/dict/full_articleset.p"
    pickle.dump(article_dic, open(dictname, "wb"))

This will get you a dictionary with the key = id and value = tuple of the category and abstract, like this:

<'0704.0004', ('math.CO', 'We show that a determinant ...')>

We are not done yet

Since there are much more possibilities when the sub-categories are involved (i.e. the ‘CO’ part of ‘math.CO’), I’m going to combine everything under a few big categories and split the dictionary into smaller dictionaries labeled by their category.

The categories I’ve chosen are: astro(nomy), cond(ensed matter), cs, hep(high-energy physics), math, physics, qbio (quantitative biology), qfin (quantitative finance), quant(um mechanics), stat(istics), and others(everything else, not a big set).

dictname = "../Data/dict/full_articleset.p"
    article_dic = pickle.load(open(dictname, "rb"))

    # keys that look like this oai:arXiv.org:adap-org/9806001 old version, do not use
    dict9107 = {key: article_dic[key] for key in list(article_dic.keys()) if '/' in key}
    # dict9107 is currently unused

    # keys that look like this oai:arXiv.org:0704.0010
    dict0704 = {key: article_dic[key] for key in list(article_dic.keys()) if '/' not in key}

    # build individual lists
    astro = []
    cond = []
    cs = []
    hep = []
    math = []
    physics = []
    qbio = []
    qfin = []
    quant = []
    stat = []
    others = []
    for key, value in dict0704.items():
        if 'astro' in value[0]:
            astro.append((key, value[0], value[1]))
        elif 'cond' in value[0]:
            cond.append((key, value[0], value[1]))
        elif any(ext in value[0] for ext in ['chao', 'gr-qc', 'nlin', 'nucl', 'physics', 'phys']):
            physics.append((key, value[0], value[1]))
        elif 'cs' in value[0]:
            cs.append((key, value[0], value[1]))
        elif 'hep' in value[0]:
            hep.append((key, value[0], value[1]))
        elif 'math' in value[0]:
            math.append((key, value[0], value[1]))
        elif 'q-bio' in value[0]:
            qbio.append((key, value[0], value[1]))
        elif 'q-fin' in value[0]:
            qfin.append((key, value[0], value[1]))
        elif 'quant' in value[0]:
            quant.append((key, value[0], value[1]))
        elif 'stat' in value[0]:
            stat.append((key, value[0], value[1]))
        else:
            others.append((key, value[0], value[1]))

    # dictionary for pickle dump
    # this dictionary is in the form subject: (id, category, abstract)
    bigcat_dict = {'astro': astro, 'cond': cond, 'cs': cs, 'hep': hep, 'math': math, 'physics': physics,
                  'qbio': qbio, 'qfin': qfin, 'quant': quant, 'stat': stat, 'others': others}

If you print out some statistics about your article set now, you will see something like this:

----2011----

// These two must match for obvious reasons
number of articles: 63251
number of abstracts: 63251

unique first tags
['acc-phys', 'adap-org', 'alg-geom', 'astro-ph', 'astro-ph.CO', 'astro-ph.EP', 'astro-ph.GA', 'astro-ph.HE', 'astro-ph.IM', 'astro-ph.SR', 'chao-dyn', 'chem-ph', 'cond-mat', 'cond-mat.dis-nn', 'cond-mat.mes-hall', 'cond-mat.mtrl-sci', 'cond-mat.other', 'cond-mat.quant-gas', 'cond-mat.soft', 'cond-mat.stat-mech', 'cond-mat.str-el', 'cond-mat.supr-con', 'cs.AI', 'cs.AR', 'cs.CC', 'cs.CE', 'cs.CG', 'cs.CL', 'cs.CR', 'cs.CV', 'cs.CY', 'cs.DB', 'cs.DC', 'cs.DL', 'cs.DM', 'cs.DS', 'cs.ET', 'cs.FL', 'cs.GR', 'cs.GT', 'cs.HC', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO', 'cs.MA', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NE', 'cs.NI', 'cs.OH', 'cs.OS', 'cs.PF', 'cs.PL', 'cs.RO', 'cs.SC', 'cs.SD', 'cs.SE', 'cs.SI', 'cs.SY', 'dg-ga', 'funct-an', 'gr-qc', 'hep-ex', 'hep-lat', 'hep-ph', 'hep-th', 'math-ph', 'math.AC', 'math.AG', 'math.AP', 'math.AT', 'math.CA', 'math.CO', 'math.CT', 'math.CV', 'math.DG', 'math.DS', 'math.FA', 'math.GM', 'math.GN', 'math.GR', 'math.GT', 'math.HO', 'math.KT', 'math.LO', 'math.MG', 'math.NA', 'math.NT', 'math.OA', 'math.OC', 'math.PR', 'math.QA', 'math.RA', 'math.RT', 'math.SG', 'math.SP', 'math.ST', 'nlin.AO', 'nlin.CD', 'nlin.CG', 'nlin.PS', 'nlin.SI', 'nucl-ex', 'nucl-th', 'physics.acc-ph', 'physics.ao-ph', 'physics.atm-clus', 'physics.atom-ph', 'physics.bio-ph', 'physics.chem-ph', 'physics.class-ph', 'physics.comp-ph', 'physics.data-an', 'physics.ed-ph', 'physics.flu-dyn', 'physics.gen-ph', 'physics.geo-ph', 'physics.hist-ph', 'physics.ins-det', 'physics.med-ph', 'physics.optics', 'physics.plasm-ph', 'physics.pop-ph', 'physics.soc-ph', 'physics.space-ph', 'q-alg', 'q-bio.BM', 'q-bio.CB', 'q-bio.GN', 'q-bio.MN', 'q-bio.NC', 'q-bio.OT', 'q-bio.PE', 'q-bio.QM', 'q-bio.SC', 'q-bio.TO', 'q-fin.CP', 'q-fin.GN', 'q-fin.PM', 'q-fin.PR', 'q-fin.RM', 'q-fin.ST', 'q-fin.TR', 'quant-ph', 'solv-int', 'stat.AP', 'stat.CO', 'stat.ME', 'stat.ML', 'stat.OT']

Categories:
astro 7500
cond 7927
cs 5937
hep 12420
math 16650
physics 8388
qbio 626
qfin 342
quant 2721
stat 712
others 28

I’ll add pretty graphs later, this will suffice to show that you have made it this far (yay!)

I highly recommending dumping all of this into a pickle file because if you did this pipeline one good time it will serve as the basis dataset for the subsequent work. It is also good to have clean separable, and verifiable checkpoints to ensure that the transformation of data at each checkpoint is correct. This lets you debug faster, refactor code more easily, and have a contractual data structure between different sections of your pipeline.

Next we are ready to look at some properties of the text.

Written on October 1, 2018