An arxiv spider

Overview

An Arxiv Spider

做为一个cser,杰出男孩深知内核对连接到计算机上的硬件设备进行管理的高效方式是中断而不是轮询。每当小伙伴发来一篇刚挂在arxiv上的”热乎“好文章时,杰出男孩都会感叹道:”师兄这是每天都挂在arxiv上呀,跑的好快~“。于是杰出男孩找了找 github,借鉴了一下其他大佬们的脚本,实现了一个每天向自己的邮件发送('cs.CV','cs.AI','stat.ML','cs.LG','cs.RO')里面感兴趣的文章的spider,支持自定义key word以及感兴趣的author

How to run

  1. 配置main.py里面的邮箱用户名和密码,记得开启邮箱的pop3验证

  2. 修改run.sh里面代码的目录和运行的python env的路径

  3. 使用crontab设置定时任务

    crontab -e

    contrab内容为

    0 10 * * 1,2,3,4,5 bash your_dir/arxiv_spider/run.sh

    即每周一到周五,早上10点定时推送arxiv当天更新到邮箱

arxiv是一个非常棒的网站,用脚本高频率爬取肯定是要被谴责的行为。但文章每天只更新一次,所以建议大家每天运行一次脚本,相当于每天逛一次arxiv了~

Result

Today arxiv has 338 new papers in ['cs.CV', 'cs.AI', 'stat.ML', 'cs.LG', 'cs.RO'] area, and 127 of them is about CV, 2/2 of them contain your keywords.

Ensure your keywords is ['(?i)offline.*(RL|reinforcement learning)', '(?i)(RL|reinforcement learning).*offline'].

This is your paperlist.Enjoy!

------------1------------
arXiv:2110.12468
Title: SCORE: Spurious COrrelation REduction for Offline Reinforcement Learning
['Machine Learning (cs.LG)', 'Artificial Intelligence (cs.AI)']
https://arxiv.org/abs/2110.12468

------------2------------
arXiv:2110.13060
Title: Safely Bridging Offline and Online Reinforcement Learning
['Machine Learning (cs.LG)', 'Machine Learning (stat.ML)']
https://arxiv.org/abs/2110.13060

Ensure your authors is ['Sergey Levine', 'Song Han'].

This is your paperlist.Enjoy!

------------1------------
arXiv:2110.12080
Title: C-Planning: An Automatic Curriculum for Learning Goal-Reaching Tasks
['Machine Learning (cs.LG)', 'Artificial Intelligence (cs.AI)']
https://arxiv.org/abs/2110.12080

------------2------------
arXiv:2110.12543
Title: Understanding the World Through Action
['Machine Learning (cs.LG)']
https://arxiv.org/abs/2110.12543

Acknowledgement

This code is built upon the implementation from https://github.com/ZihaoZhao/Arxiv_daily

Owner
Jie Liu
Jie Liu
An experiment to deploy a serverless infrastructure for a scrapy project.

Serverless Scrapy project This project aims to evaluate the feasibility of an architecture based on serverless technology for a web crawler using scra

José Ferraz Neto 5 Jul 08, 2022
An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Autoscraper-n-blogger An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post and notifies via Telegram bot

GOKUL A.P 13 Dec 21, 2022
Meme-videos - Scrapes memes and turn them into a video compilations

Meme Videos Scrapes memes from reddit using praw and request and then converts t

Partho 12 Oct 28, 2022
An application that on a given url, crowls a web page and gets all words, sorts and counts them.

Web-Scrapping-1 An application that on a given url, crowls a web page and gets all words, sorts and counts them. Installation Using the package manage

adriano atambo 1 Jan 16, 2022
High available distributed ip proxy pool, powerd by Scrapy and Redis

高可用IP代理池 README | 中文文档 本项目所采集的IP资源都来自互联网,愿景是为大型爬虫项目提供一个高可用低延迟的高匿IP代理池。 项目亮点 代理来源丰富 代理抓取提取精准 代理校验严格合理 监控完备,鲁棒性强 架构灵活,便于扩展 各个组件分布式部署 快速开始 注意,代码请在release

SpiderClub 5.2k Jan 03, 2023
mlscraper: Scrape data from HTML pages automatically with Machine Learning

🤖 Scrape data from HTML websites automatically with Machine Learning

Karl Lorey 798 Dec 29, 2022
LSpider 一个为被动扫描器定制的前端爬虫

LSpider LSpider - 一个为被动扫描器定制的前端爬虫 什么是LSpider? 一款为被动扫描器而生的前端爬虫~ 由Chrome Headless、LSpider主控、Mysql数据库、RabbitMQ、被动扫描器5部分组合而成。

Knownsec, Inc. 321 Dec 12, 2022
Python web scrapper

Website scrapper Web scrapping project in Python. Created for learning purposes. Start Install python Update configuration with websites Launch script

Nogueira Vitor 1 Dec 19, 2021
Collection of code files to scrap different kinds of websites.

STW-Collection Scrap The Web Collection; blog posts. This repo contains Scrapy sample code to scrap the following kind of websites: Do you want to lea

Tapasweni Pathak 15 Jun 08, 2022
The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

The open-source web scrapers that feed the Los Angeles Times' California coronavirus tracker. Processed data ready for analysis is available at datade

Los Angeles Times Data and Graphics Department 51 Dec 14, 2022
Web and PDF Scraper Refactoring

Web and PDF Scraper Refactoring This repository contains the example code of the Web and PDF scraper code roast. Here are the links to the videos: Par

18 Dec 31, 2022
TikTok Username Swapper/Claimer/etc

TikTok-Turbo TikTok Username Swapper/Claimer/etc I wanted to create it as fast as possible but i eventually gave up and recoded it many many many many

Kevin 12 Dec 19, 2022
抢京东茅台脚本,定时自动触发,自动预约,自动停止

jd_maotai 抢京东茅台脚本,定时自动触发,自动预约,自动停止 小白信用 99.6,暂时还没抢到过,朋友 80 多抢到了一瓶,所以我感觉是跟信用分没啥关系,完全是看运气的。

Aruelius.L 117 Dec 22, 2022
Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

Xavier Grangier 3.8k Jan 02, 2023
A Python module to bypass Cloudflare's anti-bot page.

cloudflare-scrape A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Reque

3k Jan 04, 2023
Poolbooru gelscraper - a simple python script for scraping images off gelbooru pools.

poolbooru_gelscraper a simple python script for scraping images off gelbooru pools. modules required:requests_html, and os by default saves files with

savantshuia 1 Jan 02, 2022
Crawl BookCorpus

These are scripts to reproduce BookCorpus by yourself.

Sosuke Kobayashi 590 Jan 03, 2023
simple http & https proxy scraper and checker

simple http & https proxy scraper and checker

Neospace 11 Nov 15, 2021
基于Github Action的定时HITsz疫情上报脚本,开箱即用

HITsz Daily Report 基于 GitHub Actions 的「HITsz 疫情系统」访问入口 定时自动上报脚本,开箱即用。 感谢 @JellyBeanXiewh 提供原始脚本和 idea。 感谢 @bugstop 对脚本进行重构并新增 Easy Connect 校内代理访问。

Ter 56 Nov 27, 2022
CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform

CRI Scrape CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform Disclaimer This code is only for educational purpose. So

Vincenzo Cardone 0 Jul 23, 2022