Building the arXiv classifier - I
Part I: Getting the dataset
The arXiv dataset
The arXiv is a online repository of preprints of scientific papers in the fields of astronomy, physics, mathematics, computer science, quantitative biology, quantitative finance and statistics. To date it has more than a million papers and more are being added every day. This dataset I focused on is a relatively recent (2007-17) sample totaling approximately 800,000 pieces of metadata which I curated via a data dump using the arXiv APIs. They contain a significant number of papers (>5000) from every category (~10) submitted in the past decade.
Bulk access of arXiv metadata
For harvesting arXiv data year by year
(Please read here and here) (This SO thread helps a lot too)
Please do not DDoS the arXiv server, I accept no responsibility if you get into trouble doing this
arXiv supports bulk access of their article metadata (updated daily) as well as real-time programmatic access to metadata via the arXiv API.
A sample http query looks like this:
http://export.arxiv.org/oai2?verb=ListIdentifiers&set=math&metadataPrefix=oai_dc&from=2007-05-23&until=2015-05-24
here we have set=math
and from=2007-05-23&until=2007-05-24
A sample response looks like this:
<Records xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Record>
<header status="">
<identifier>oai:arXiv.org:0704.0004</identifier>
<datestamp>2007-05-23</datestamp>
<setSpec>math</setSpec>
</header>
<metadata>
<arXiv xmlns="http://arxiv.org/OAI/arXiv/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://arxiv.org/OAI/arXiv/ http://arxiv.org/OAI/arXiv.xsd">
<id>0704.0004</id>
<created>2007-03-30</created>
<authors>
<author>
<keyname>Callan</keyname>
<forenames>David</forenames>
</author>
</authors>
<title>A determinant of Stirling cycle numbers counts unlabeled acyclic
single-source automata</title>
<categories>math.CO</categories>
<comments>11 pages</comments>
<msc-class>05A15</msc-class>
<abstract>We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant.
</abstract>
</arXiv>
</metadata>
<about/>
</Record>
</Records>
every response has a list of <Record>
under a <Records>
tag.
However, if you query more than 1000 articles at once you will get a resumptiontoken
and effectively the server is going to rate limit you. To get around that I wrote a script to wait for 20-30 seconds before issuing the http query again with the resumption token. Something like this:
# harvests 1 year worth of arXiv articles
def harvest_by_year(year):
save_path = "../Data/raw"
filename = "arXiv" + str(year) + ".xml"
filename = os.path.join(save_path, filename)
f = io.open(filename, 'a', encoding="utf-8")
first_url = "http://export.arxiv.org/oai2?verb=ListRecords&from=" + \
str(year) + "-01-01&until=" + \
str(year) + "-12-31&metadataPrefix=arXiv"
data = urllib.request.urlopen(first_url).read()
soup = BeautifulSoup(data, 'lxml')
f.write(soup.prettify())
token = soup.find('resumptiontoken').text
resume = True
# loop over resumption tokens till the end
while resume:
# wait for server
time.sleep(21)
url = 'http://export.arxiv.org/oai2?verb=ListRecords&resumptionToken=' + token
next_data = urllib.request.urlopen(url).read()
soup = BeautifulSoup(next_data, 'html.parser')
f.write(soup.prettify())
if soup.find('resumptiontoken') is not None:
token = soup.find('resumptiontoken').text
if token is "":
resume = False
break
else:
resume = False
break
return
and I used BeautifulSoup
to clean it up, join the XML and remove the resumptiontoken
in the XML responses. Note that the joined XML files can get very large (big data woooo) even for one year worth.
Alternative: bulk download
If small scale tests - and I fully encourage you to do small scale tests - work, then you can go here and download full data sets of arXiv article as well as metadata.
Wrangling text (yes that’s the technical term) to get what we need
Getting everything in order
Use the strainer
from BeautifulSoup
to parse out the identifier
, abstract
and categories
tags and zip them into a tuple list and dump it via pickle
identifier
is a string like this oai:arXiv.org:0704.0004
it suffices to take only the rear chunk 0704.0004
.
categories
is a list of one or more strings like this math.CO
we are taking the first category for the purposes of this project, the natural extension would be to take the first n categories and do a multi-class, multi-label classifier.
use utf-8
encoding for text because of the propensity of mathematical symbols in these scientific papers.
If you read closely I opened the file three times each with a different strainer, this is done on purpose since by “straining” the file it is not entirely loaded in memory and the objects that you get after straining are much much smaller than the full set of records.
filename = "../Data/raw/arXivbulk.xml"
strainer_id = SoupStrainer("identifier")
soup_id = BeautifulSoup(io.open(filename, encoding="utf-8"), "xml", parse_only=strainer_id)
# truncate just to get id
id_list = [x[14:] for x in soup_id.strings]
strainer_abs = SoupStrainer("abstract")
soup_abs = BeautifulSoup(io.open(filename, encoding="utf-8"), "xml", parse_only=strainer_abs)
# clean newline and whitespace from abs
abs_list = [" ".join(x.text.replace('\n', ' ').strip().split()) for x in soup_abs.find_all('abstract')]
# reduce categories to the first big category in the first word
strainer_cat = SoupStrainer("categories")
soup_set = BeautifulSoup(io.open(filename, encoding="utf-8"), "xml", parse_only=strainer_cat)
set_list = [x.split(' ', 1)[0].split('.', 1)[0] for x in soup_set.strings]
# build a dictionary with key = id, value = tuple of other things
keys = id_list
values = list(zip(set_list, abs_list))
print(values.__len__())
article_dic = dict(set(zip(keys, values)))
print(article_dic.keys().__len__())
dictname = "../Data/dict/full_articleset.p"
pickle.dump(article_dic, open(dictname, "wb"))
This will get you a dictionary with the key = id and value = tuple of the category and abstract, like this:
<'0704.0004', ('math.CO', 'We show that a determinant ...')>
We are not done yet
Since there are much more possibilities when the sub-categories are involved (i.e. the ‘CO’ part of ‘math.CO’), I’m going to combine everything under a few big categories and split the dictionary into smaller dictionaries labeled by their category.
The categories I’ve chosen are: astro(nomy), cond(ensed matter), cs, hep(high-energy physics), math, physics, qbio (quantitative biology), qfin (quantitative finance), quant(um mechanics), stat(istics), and others(everything else, not a big set).
dictname = "../Data/dict/full_articleset.p"
article_dic = pickle.load(open(dictname, "rb"))
# keys that look like this oai:arXiv.org:adap-org/9806001 old version, do not use
dict9107 = {key: article_dic[key] for key in list(article_dic.keys()) if '/' in key}
# dict9107 is currently unused
# keys that look like this oai:arXiv.org:0704.0010
dict0704 = {key: article_dic[key] for key in list(article_dic.keys()) if '/' not in key}
# build individual lists
astro = []
cond = []
cs = []
hep = []
math = []
physics = []
qbio = []
qfin = []
quant = []
stat = []
others = []
for key, value in dict0704.items():
if 'astro' in value[0]:
astro.append((key, value[0], value[1]))
elif 'cond' in value[0]:
cond.append((key, value[0], value[1]))
elif any(ext in value[0] for ext in ['chao', 'gr-qc', 'nlin', 'nucl', 'physics', 'phys']):
physics.append((key, value[0], value[1]))
elif 'cs' in value[0]:
cs.append((key, value[0], value[1]))
elif 'hep' in value[0]:
hep.append((key, value[0], value[1]))
elif 'math' in value[0]:
math.append((key, value[0], value[1]))
elif 'q-bio' in value[0]:
qbio.append((key, value[0], value[1]))
elif 'q-fin' in value[0]:
qfin.append((key, value[0], value[1]))
elif 'quant' in value[0]:
quant.append((key, value[0], value[1]))
elif 'stat' in value[0]:
stat.append((key, value[0], value[1]))
else:
others.append((key, value[0], value[1]))
# dictionary for pickle dump
# this dictionary is in the form subject: (id, category, abstract)
bigcat_dict = {'astro': astro, 'cond': cond, 'cs': cs, 'hep': hep, 'math': math, 'physics': physics,
'qbio': qbio, 'qfin': qfin, 'quant': quant, 'stat': stat, 'others': others}
If you print out some statistics about your article set now, you will see something like this:
----2011----
// These two must match for obvious reasons
number of articles: 63251
number of abstracts: 63251
unique first tags
['acc-phys', 'adap-org', 'alg-geom', 'astro-ph', 'astro-ph.CO', 'astro-ph.EP', 'astro-ph.GA', 'astro-ph.HE', 'astro-ph.IM', 'astro-ph.SR', 'chao-dyn', 'chem-ph', 'cond-mat', 'cond-mat.dis-nn', 'cond-mat.mes-hall', 'cond-mat.mtrl-sci', 'cond-mat.other', 'cond-mat.quant-gas', 'cond-mat.soft', 'cond-mat.stat-mech', 'cond-mat.str-el', 'cond-mat.supr-con', 'cs.AI', 'cs.AR', 'cs.CC', 'cs.CE', 'cs.CG', 'cs.CL', 'cs.CR', 'cs.CV', 'cs.CY', 'cs.DB', 'cs.DC', 'cs.DL', 'cs.DM', 'cs.DS', 'cs.ET', 'cs.FL', 'cs.GR', 'cs.GT', 'cs.HC', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO', 'cs.MA', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NE', 'cs.NI', 'cs.OH', 'cs.OS', 'cs.PF', 'cs.PL', 'cs.RO', 'cs.SC', 'cs.SD', 'cs.SE', 'cs.SI', 'cs.SY', 'dg-ga', 'funct-an', 'gr-qc', 'hep-ex', 'hep-lat', 'hep-ph', 'hep-th', 'math-ph', 'math.AC', 'math.AG', 'math.AP', 'math.AT', 'math.CA', 'math.CO', 'math.CT', 'math.CV', 'math.DG', 'math.DS', 'math.FA', 'math.GM', 'math.GN', 'math.GR', 'math.GT', 'math.HO', 'math.KT', 'math.LO', 'math.MG', 'math.NA', 'math.NT', 'math.OA', 'math.OC', 'math.PR', 'math.QA', 'math.RA', 'math.RT', 'math.SG', 'math.SP', 'math.ST', 'nlin.AO', 'nlin.CD', 'nlin.CG', 'nlin.PS', 'nlin.SI', 'nucl-ex', 'nucl-th', 'physics.acc-ph', 'physics.ao-ph', 'physics.atm-clus', 'physics.atom-ph', 'physics.bio-ph', 'physics.chem-ph', 'physics.class-ph', 'physics.comp-ph', 'physics.data-an', 'physics.ed-ph', 'physics.flu-dyn', 'physics.gen-ph', 'physics.geo-ph', 'physics.hist-ph', 'physics.ins-det', 'physics.med-ph', 'physics.optics', 'physics.plasm-ph', 'physics.pop-ph', 'physics.soc-ph', 'physics.space-ph', 'q-alg', 'q-bio.BM', 'q-bio.CB', 'q-bio.GN', 'q-bio.MN', 'q-bio.NC', 'q-bio.OT', 'q-bio.PE', 'q-bio.QM', 'q-bio.SC', 'q-bio.TO', 'q-fin.CP', 'q-fin.GN', 'q-fin.PM', 'q-fin.PR', 'q-fin.RM', 'q-fin.ST', 'q-fin.TR', 'quant-ph', 'solv-int', 'stat.AP', 'stat.CO', 'stat.ME', 'stat.ML', 'stat.OT']
Categories:
astro 7500
cond 7927
cs 5937
hep 12420
math 16650
physics 8388
qbio 626
qfin 342
quant 2721
stat 712
others 28
I’ll add pretty graphs later, this will suffice to show that you have made it this far (yay!)
I highly recommending dumping all of this into a pickle file because if you did this pipeline one good time it will serve as the basis dataset for the subsequent work. It is also good to have clean separable, and verifiable checkpoints to ensure that the transformation of data at each checkpoint is correct. This lets you debug faster, refactor code more easily, and have a contractual data structure between different sections of your pipeline.
Next we are ready to look at some properties of the text.