PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

Overview

PaperRobot

PaperRobot 是一个论文抓取工具,可以快速批量下载大量论文,方便后期进行持续的论文管理与学习。

PaperRobot通过多个接口抓取论文,目前抓取成功率维持在90%以上。通过配置Config文件,可以抓取任意计算机领域相关会议的论文。

example

Installation

  • Download this tool
git clone https://github.com/mo-xiaoxi/PaperRobot.git
  • Install dependencies
sudo pip3 install -r requirements.txt

Python version: Python 3 (>=3.7).

Why build this tool?

  1. 通过这个工具可以构建自己的论文数据库。具体参考:如何建立独属于你自己的论文数据库
  2. 一个方便的论文调研工具: Secpaper. 论文调研必备!
  3. 提取论文的摘要,自动翻译推送整理一些会议的研究简报,可以快速地过一下每个会议论文的内容,感兴趣的再阅读对应的pdf。
  4. 对会议研究热点、作者变化等等进行归类与整理。 如Computer Science Rankings.

Usage

$ python run.py --help
usage: run.py [-h] [-m {d,s}] [-c {ccs,uss,sp,ndss,dsn,raid,imc,asiaccs,acsac,sigcomm}] [-s YEAR_START] [-e YEAR_END] [-b BIBTEX] [-t TITLE] [-u URL] [--all {bibtex,pdf}]

OPTIONS:
  -h, --help            show this help message and exit
  -m {d,s}, --mode {d,s}
                        s:show info, d: download
  -c {ccs,uss,sp,ndss,dsn,raid,imc,asiaccs,acsac,sigcomm}, --conference {ccs,uss,sp,ndss,dsn,raid,imc,asiaccs,acsac,sigcomm}
                        The target conference.
  -s YEAR_START, --year_start YEAR_START
                        The start year of paper.
  -e YEAR_END, --year_end YEAR_END
                        The end year of paper.
  -b BIBTEX, --bibtex BIBTEX
                        Download with bibtex file.
  -t TITLE, --title TITLE
                        Download with Google search.
  -u URL, --url URL     Dowanload with url.
  --all {bibtex,pdf}    Download all bibbex or papers,2001-2022 by default

Example

  • 基于Title下载论文 python run.py -t "A Large-scale Analysis of Email Sender Spoofing Attacks"
  • 基于URL下载论文 python run.py -u "https://www.usenix.org/conference/usenixsecurity21/presentation/shen-kaiwen"
  • 基于bib下载论文 python run.py -b bibtex/example.bib
  • 获取NDSS 2021会议论文 python run.py -c ndss -s 2021 -e 2022
  • 获取NDSS 2001-2021会议论文 python run.py -c ndss -s 2001 -e 2022
  • 获取所有会议的bibtex文件 python run.py --all bibtex
  • 获取所有会议的pdf文件 python run.py --all bibtex

其他说明:

  • PaperRobot通过dblp抓取对应会议的bibtex,以保证通用性,理论上支持任意DBLP上收录的会议。

    通过配置下列数据,可以增加新的会议支持。

    LIB = {
        "ccs": "CCS",
        "uss": "Usenix_Security",
        "sp": "S&P",
        "ndss": "NDSS",
        "dsn": "DSN",
        "raid": "RAID",
        "imc": "IMC",
        "asiaccs": "ASIACCS",
        "acsac": "ACSAC",
        "sigcomm": "SIGCOMM",
    }
  • 多个PDF辅助抓取接口:

    • 通过doi序列号在SCI-HUB抓取论文(zotera适用方法)
    • 论文官方网站抓取论文
    • 通过google搜索抓取论文
    • 通过crossRef网站抓取论文(这个接口效果不是特别好)
  • keep_cookies.py 用于维护某些站点的登陆状态,需要单独运行。

    • 维护登陆状态的原因是某些网站(如dl.acm)需要登陆才能下载pdf。

      用户需要单独配置config中的账号密码,账号密码为学校账号与密码。

    • 若在教育网IP内访问, 则不需要维护Cookie信息,教育网IP直接可以下载PDF。

    • 用户也可以手动维护cookie信息,利用burpsuite等一系列工具导出cookie,写入data/cookie.json文件即可。

TODO

  • 更好的文档说明,中英文文档分开。
  • 修改日志信息到英文版本
  • 多进程+多协程并发处理
  • 代理池构建
  • 使用重试修饰器重写需重试的函数
Owner
moxiaoxi
CTF Player of Tea-Deliverers, Blue-Lotus. Ph.D. Student at Tsinghua University. Research on Protocol Security.
moxiaoxi
让中国用户使用git从github下载的速度提高1000倍!

序言 github上有很多好项目,但是国内用户连github却非常的慢.每次都要用插件或者其他工具来解决. 这次自己做一个小工具,输入github原地址后,就可以自动替换为代理地址,方便大家更快速的下载. 安装 pip install cit 主要功能与用法 主要功能 change 将目标地址转换为

35 Aug 29, 2022
京东茅台抢购

截止 2021/2/1 日,该项目已无法使用! 京东:约满即止,仅限京东实名认证用户APP端抢购,2月1日10:00开始预约,2月1日12:00开始抢购(京东APP需升级至8.5.6版本及以上) 写在前面 本项目来自 huanghyw - jd_seckill,作者的项目地址我找不到了,找到了再贴上

abee 73 Dec 03, 2022
Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

Iceberg Locations Antarctic large iceberg positions derived from ASCAT and OSCAT-2. All data collected here are from the NASA SCP website Overview Thi

Joel Hanson 5 Jul 27, 2022
Crawler in Python 3.7, 3.8. 3.9. Pypy3

Description Python Crawler written Python 3. (Supports major Python releases Python3.6, Python3.7 and Python 3.8) Installation and Use Setup VirtualEn

Vinit Kumar 2 Mar 12, 2022
薅薅乐 - JD 测试脚本

薅薅乐 安裝 使用docker docker一键安装: docker run -d --name jd classmatelin/hhl:latest. 使用 进入容器: docker exec -it jd bash 获取JD_COOKIES: python get_jd_cookies.py,

ClassmateLin 575 Dec 28, 2022
A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

combined-shop-scraper A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items. Features Define an

2 Dec 13, 2021
Libextract: extract data from websites

Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python

499 Dec 09, 2022
Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

Sejal Rajput 1 Jan 25, 2022
An helper library to scrape data from Instagram effortlessly, using the Influencer Hunters APIs.

Instagram Scraper An utility library to scrape data from Instagram hassle-free Go to the website » View Demo · Report Bug · Request Feature About The

2 Jul 06, 2022
This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

1 Dec 05, 2021
基于Github Action的定时HITsz疫情上报脚本,开箱即用

HITsz Daily Report 基于 GitHub Actions 的「HITsz 疫情系统」访问入口 定时自动上报脚本,开箱即用。 感谢 @JellyBeanXiewh 提供原始脚本和 idea。 感谢 @bugstop 对脚本进行重构并新增 Easy Connect 校内代理访问。

Ter 56 Nov 27, 2022
:arrow_double_down: Dumb downloader that scrapes the web

You-Get NOTICE: Read this if you are looking for the conventional "Issues" tab. You-Get is a tiny command-line utility to download media contents (vid

Mort Yao 46.4k Jan 03, 2023
AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

5 Nov 25, 2021
Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

Agroforestry Species Switchboard 2.0 Scraper Scrape plants scientific name information from Species Switchboard 2.0. Requirements python = 3.10 (you

Mgs. M. Rizqi Fadhlurrahman 2 Dec 23, 2021
Meme-videos - Scrapes memes and turn them into a video compilations

Meme Videos Scrapes memes from reddit using praw and request and then converts t

Partho 12 Oct 28, 2022
A python script to extract answers to any question on Quora (Quora+ included)

quora-plus-bypass A python script to extract answers to any question on Quora (Quora+ included) Requirements Python 3.x

Nitin Narayanan 10 Aug 18, 2022
Nekopoi scraper using python3

Features Scrap from url Todo [+] Search by genre [+] Search by query [+] Scrap from homepage Example # Hentai Scraper from nekopoi import Hent

MhankBarBar 9 Apr 06, 2022
A simple code to fetch comments below an Instagram post and save them to a csv file

fetch_comments A simple code to fetch comments below an Instagram post and save them to a csv file usage First you have to enter your username and pas

2 Jul 14, 2022
Telegram Group Scrapper

this programe is make your work so much easy on telegrame. do you want to send messages on everyone to your group or others group. use this script it will do your work automatically with one click. a

HackArrOw 3 Dec 03, 2022
Divar.ir Ads scrapper

Divar.ir Ads Scrapper Introduction This project first asynchronously grab Divar.ir Ads and then save to .csv and .xlsx files named data.csv and data.x

Iman Kermani 4 Aug 29, 2022