Web interface for browsing, search and filtering recent arxiv submissions

Overview

arxiv sanity preserver

Update Nov 27, 2021: you may wish to look at my from-scratch re-write of arxiv-sanity: arxiv-sanity-lite. It is a smaller version of arxiv-sanity that focuses on the core value proposition, is significantly less likely to ever go down, scales better, and has a few additional goodies such as multiple possible tags per account, regular emails of new papers of interest, etc. It is also running live on arxiv-sanity-lite.com.

This project is a web interface that attempts to tame the overwhelming flood of papers on Arxiv. It allows researchers to keep track of recent papers, search for papers, sort papers by similarity to any paper, see recent popular papers, to add papers to a personal library, and to get personalized recommendations of (new or old) Arxiv papers. This code is currently running live at www.arxiv-sanity.com/, where it's serving 25,000+ Arxiv papers from Machine Learning (cs.[CV|AI|CL|LG|NE]/stat.ML) over the last ~3 years. With this code base you could replicate the website to any of your favorite subsets of Arxiv by simply changing the categories in fetch_papers.py.

user interface

Code layout

There are two large parts of the code:

Indexing code. Uses Arxiv API to download the most recent papers in any categories you like, and then downloads all papers, extracts all text, creates tfidf vectors based on the content of each paper. This code is therefore concerned with the backend scraping and computation: building up a database of arxiv papers, calculating content vectors, creating thumbnails, computing SVMs for people, etc.

User interface. Then there is a web server (based on Flask/Tornado/sqlite) that allows searching through the database and filtering papers by similarity, etc.

Dependencies

Several: You will need numpy, feedparser (to process xml files), scikit learn (for tfidf vectorizer, training of SVM), flask (for serving the results), flask_limiter, and tornado (if you want to run the flask server in production). Also dateutil, and scipy. And sqlite3 for database (accounts, library support, etc.). Most of these are easy to get through pip, e.g.:

$ virtualenv env                # optional: use virtualenv
$ source env/bin/activate       # optional: use virtualenv
$ pip install -r requirements.txt

You will also need ImageMagick and pdftotext, which you can install on Ubuntu as sudo apt-get install imagemagick poppler-utils. Bleh, that's a lot of dependencies isn't it.

Processing pipeline

The processing pipeline requires you to run a series of scripts, and at this stage I really encourage you to manually inspect each script, as they may contain various inline settings you might want to change. In order, the processing pipeline is:

  1. Run fetch_papers.py to query arxiv API and create a file db.p that contains all information for each paper. This script is where you would modify the query, indicating which parts of arxiv you'd like to use. Note that if you're trying to pull too many papers arxiv will start to rate limit you. You may have to run the script multiple times, and I recommend using the arg --start-index to restart where you left off when you were last interrupted by arxiv.
  2. Run download_pdfs.py, which iterates over all papers in parsed pickle and downloads the papers into folder pdf
  3. Run parse_pdf_to_text.py to export all text from pdfs to files in txt
  4. Run thumb_pdf.py to export thumbnails of all pdfs to thumb
  5. Run analyze.py to compute tfidf vectors for all documents based on bigrams. Saves a tfidf.p, tfidf_meta.p and sim_dict.p pickle files.
  6. Run buildsvm.py to train SVMs for all users (if any), exports a pickle user_sim.p
  7. Run make_cache.py for various preprocessing so that server starts faster (and make sure to run sqlite3 as.db < schema.sql if this is the very first time ever you're starting arxiv-sanity, which initializes an empty database).
  8. Start the mongodb daemon in the background. Mongodb can be installed by following the instructions here - https://docs.mongodb.com/tutorials/install-mongodb-on-ubuntu/.
  • Start the mongodb server with - sudo service mongod start.
  • Verify if the server is running in the background : The last line of /var/log/mongodb/mongod.log file must be - [initandlisten] waiting for connections on port
  1. Run the flask server with serve.py. Visit localhost:5000 and enjoy sane viewing of papers!

Optionally you can also run the twitter_daemon.py in a screen session, which uses your Twitter API credentials (stored in twitter.txt) to query Twitter periodically looking for mentions of papers in the database, and writes the results to the pickle file twitter.p.

I have a simple shell script that runs these commands one by one, and every day I run this script to fetch new papers, incorporate them into the database, and recompute all tfidf vectors/classifiers. More details on this process below.

protip: numpy/BLAS: The script analyze.py does quite a lot of heavy lifting with numpy. I recommend that you carefully set up your numpy to use BLAS (e.g. OpenBLAS), otherwise the computations will take a long time. With ~25,000 papers and ~5000 users the script runs in several hours on my current machine with a BLAS-linked numpy.

Running online

If you'd like to run the flask server online (e.g. AWS) run it as python serve.py --prod.

You also want to create a secret_key.txt file and fill it with random text (see top of serve.py).

Current workflow

Running the site live is not currently set up for a fully automatic plug and play operation. Instead it's a bit of a manual process and I thought I should document how I'm keeping this code alive right now. I have a script that performs the following update early morning after arxiv papers come out (~midnight PST):

python fetch_papers.py
python download_pdfs.py
python parse_pdf_to_text.py
python thumb_pdf.py
python analyze.py
python buildsvm.py
python make_cache.py

I run the server in a screen session, so screen -S serve to create it (or -r to reattach to it) and run:

python serve.py --prod --port 80

The server will load the new files and begin hosting the site. Note that on some systems you can't use port 80 without sudo. Your two options are to use iptables to reroute ports or you can use setcap to elavate the permissions of your python interpreter that runs serve.py. In this case I'd recommend careful permissions and maybe virtualenv, etc.

Owner
Andrej
I like to train Deep Neural Nets on large datasets.
Andrej
Geodesic Dome Math

dome Geodesic Dome Math Python dome tool dome.py calculates an icosahedron or 2v geodesic dome and creates 3d printable hubs as OpenSCAD sources. usag

Brian Olson 2 Feb 09, 2022
This scrypt for auto brightness control

God damn. This scrypt for auto brightness control. The scrypt has voice assistant. You should move this script to auto-upload folder. What do you need

0 Jul 25, 2022
urlwatch is intended to help you watch changes in webpages and get notified of any changes.

urlwatch is intended to help you watch changes in webpages and get notified (via e-mail, in your terminal or through various third party services) of any changes.

Thomas Perl 2.5k Jan 08, 2023
Eros is an expiremental programming language built using simple Python code.

Eros is an expiremental programming language built using simple Python code. Featuring an easy syntax and unique features like type slicing, the language remains an expirement that grows in down time

zxro 2 Nov 21, 2021
Displays Christmas-themed ASCII art

Christmas Color Scripts Displays Christmas-themed ASCII art. This was mainly inspired by DistroTube's Shell Color Scripts Screenshots ASCII Shadow Tex

1 Aug 09, 2022
Hypothesis strategies for generating Python programs, something like CSmith

hypothesmith Hypothesis strategies for generating Python programs, something like CSmith. This is definitely pre-alpha, but if you want to play with i

Zac Hatfield-Dodds 73 Dec 14, 2022
Render your templates using .txt files

PizzaX About Run Run tests To run the tests, open your terminal and type python tests.py (WIN) or python3 tests.py (UNX) Using the function To use the

Marcello Belanda 2 Nov 24, 2021
A Web app to Cross-Seed torrents in Deluge/qBittorrent/Transmission

SeedCross A Web app to Cross-Seed torrents in Deluge/qBittorrent/Transmission based on CrossSeedAutoDL Require Jackett Deluge/qBittorrent/Transmission

ccf2012 76 Dec 19, 2022
Python3 Interface to numa Linux library

py-libnuma is python3 interface to numa Linux library so that you can set task affinity and memory affinity in python level for your process which can help you to improve your code's performence.

Dalong 13 Nov 10, 2022
Export transactions for an algorand wallet to a CSV file

algorand_txn_csv_exporter - (Algorand transaction CSV exporter) This script will export transactions for an algorand wallet to a CSV file. It is inten

TeneoPython01 5 Jun 19, 2022
carrier.py is a Python package/module that's used to save time when programming

carrier.py is a Python package/module that's used to save time when programming, it helps with functions such as 24 and 12 hour time, Discord webhooks, etc

Zacky2613 2 Mar 20, 2022
External Network Pentest Automation using Shodan API and other tools.

Chopin External Network Pentest Automation using Shodan API and other tools. Workflow Input a file containing CIDR ranges. Converts CIDR ranges to ind

Aditya Dixit 9 Aug 04, 2022
Python bindings for `ign-msgs` and `ign-transport`

Python Ignition This project aims to provide Python bindings for ignition-msgs and ignition-transport. It is a work in progress... C++ and Python libr

Rhys Mainwaring 3 Nov 08, 2022
Chemical equation balancer

Chemical equation balancer Balance your chemical equations with ease! Installation $ git clone

Marijan Smetko 4 Nov 26, 2022
Test to grab m3u from YouTube live.

YouTube_to_m3u https://raw.githubusercontent.com/benmoose39/YouTube_to_m3u/main/youtube.m3u Updated m3u links of YouTube live channels, auto-updated e

136 Jan 06, 2023
Streamlit Component, for a Chatbot UI

st-chat Streamlit Component, for a Chat-bot UI, example app authors - @yashppawar & @YashVardhan-AI Installation Install streamlit-chat with pip pip i

Yash AI 99 Jan 07, 2023
An example of Connecting a MySQL Database with Python Code

An example of Connecting a MySQL Database with Python Code And How to install Table of contents General info Technologies Setup General info In this p

Mohammad Hosseinzadeh 1 Nov 23, 2021
Strong Typing in Python with Decorators

typy Strong Typing in Python with Decorators Description This light-weight library provides decorators that can be used to implement strongly-typed be

Ekin 0 Feb 06, 2022
PKU team for 2021 project 'Guangchangwu detection'.

PKU team for 2021 project 'Guangchangwu detection'.

Helin Wang 3 Feb 21, 2022
This repo is a collection of programs and websites templates too

📢 Register here for Hacktoberfest and make four pull requests (PRs) between October 1st-31st to grab free SWAGS 🔥 . IMPORTANT While making pull requ

Binayak Jha - 2 7 Oct 03, 2022