Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Related tags

Web CrawlingBigdata
Overview

Scrapy Cluster

Build Status Documentation Join the chat at https://gitter.im/istresearch/scrapy-cluster Coverage Status License Docker Pulls

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster.

The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests.

Dependencies

Please see the requirements.txt within each sub project for Pip package dependencies.

Other important components required to run the cluster

Core Concepts

This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include:

  • The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster
  • Scale Scrapy instances across a single machine or multiple machines
  • Coordinate and prioritize their scraping effort for desired sites
  • Persist data across scraping jobs
  • Execute multiple scraping jobs concurrently
  • Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked
  • Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime
  • Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results)
  • Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address
  • Enables completely different spiders to yield crawl requests to each other, giving flexibility to how the crawl job is tackled

Scrapy Cluster test environment

To set up a pre-canned Scrapy Cluster test environment, make sure you have Docker.

Steps to launch the test environment:

  1. Build your containers (or omit --build to pull from docker hub)
docker-compose up -d --build
  1. Tail kafka to view your future results
docker-compose exec kafka_monitor python kafkadump.py dump -t demo.crawled_firehose -ll INFO
  1. From another terminal, feed a request to kafka
curl localhost:5343/feed -H "content-type:application/json" -d '{"url": "http://dmoztools.net", "appid":"testapp", "crawlid":"abc123"}'
  1. Validate you've got data!
# wait a couple seconds, your terminal from step 2 should dump json data
{u'body': '...content...', u'crawlid': u'abc123', u'links': [], u'encoding': u'utf-8', u'url': u'http://dmoztools.net', u'status_code': 200, u'status_msg': u'OK', u'response_url': u'http://dmoztools.net', u'request_headers': {u'Accept-Language': [u'en'], u'Accept-Encoding': [u'gzip,deflate'], u'Accept': [u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], u'User-Agent': [u'Scrapy/1.5.0 (+https://scrapy.org)']}, u'response_headers': {u'X-Amz-Cf-Pop': [u'IAD79-C3'], u'Via': [u'1.1 82c27f654a5635aeb67d519456516244.cloudfront.net (CloudFront)'], u'X-Cache': [u'RefreshHit from cloudfront'], u'Vary': [u'Accept-Encoding'], u'Server': [u'AmazonS3'], u'Last-Modified': [u'Mon, 20 Mar 2017 16:43:41 GMT'], u'Etag': [u'"cf6b76618b6f31cdec61181251aa39b7"'], u'X-Amz-Cf-Id': [u'y7MqDCLdBRu0UANgt4KOc6m3pKaCqsZP3U3ZgIuxMAJxoml2HTPs_Q=='], u'Date': [u'Tue, 22 Dec 2020 21:37:05 GMT'], u'Content-Type': [u'text/html']}, u'timestamp': u'2020-12-22T21:37:04.736926', u'attrs': None, u'appid': u'testapp'}

Documentation

Please check out the official Scrapy Cluster documentation for more information on how everything works!

Branches

The master branch of this repository contains the latest stable release code for Scrapy Cluster 1.2.

The dev branch contains bleeding edge code and is currently working towards Scrapy Cluster 1.3. Please note that not everything may be documented, finished, tested, or finalized but we are happy to help guide those who are interested.

Owner
Hanh Pham Van
Hanh Pham Van
用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。

crawler_for_university 用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。 环境依赖 wxpy,requests,bs4等库 功能描述 该项目基于python,通过爬虫爬各高校的就业信息网,爬取招聘信

8 Aug 16, 2021
A package designed to scrape data from Yahoo Finance.

yahoostock A package designed to scrape data from Yahoo Finance. Installation The most simple installation method is through PIP. pip install yahoosto

Rohan Singh 2 May 28, 2022
An Web Scraping API for MDL(My Drama List) for Python.

PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

6 Dec 10, 2022
This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Introduction This was supposed to be a web scraping project, but somehow I've turned it into a spamming project.

Boss Perry (Pez) 1 Jan 23, 2022
Find papers by keywords and venues. Then download it automatically

paper finder Find papers by keywords and venues. Then download it automatically. How to use this? Search CLI python search.py -k "knowledge tracing,kn

Jiahao Chen (TabChen) 2 Dec 15, 2022
Kusonime scraper using python3

Features Scrap from url Scrap from recommendation Search by query Todo [+] Search by genre Example # Get download url from kusonime import Scrap

MhankBarBar 2 Jan 28, 2022
Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

Alpha Swap English This is a simple python tool for the purpose of swapping latinic letters with cirylic ones and vice versa, in txt, docx and pdf fil

Aleksandar Damnjanovic 3 May 31, 2022
原神爬虫 抓取原神界面圣遗物信息

原神圣遗物半自动爬虫 说明 直接抓取原神界面中的圣遗物数据 目前只适配了背包页面的抓取 准确率:97.5%(普通通用接口,对 40 件随机圣遗物识别,统计完全正确的数量为 39) 准确率:100%(4k 屏幕,普通通用接口,对 110 件圣遗物识别,统计完全正确的数量为 110) 不排除还有小错误的

hwa 28 Oct 10, 2022
WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

WebScraping Web scraping Pyton program that scrapes Job website for python devel

Michelle 2 Jul 22, 2022
Lovely Scrapper

Lovely Scrapper

Tushar Gadhe 2 Jan 01, 2022
Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings

0 Jan 07, 2022
download NCERT books using scrapy

download_ncert_books download NCERT books using scrapy Downloading Books: You can either use the spider by cloning this repo and following the instruc

1 Dec 02, 2022
茅台抢购最新优化版本,茅台秒杀,优化了抢购协程队列

茅台抢购最新优化版本,茅台秒杀,优化了抢购协程队列

MaoTai 33 Sep 03, 2022
Using Selenium with Python to Web Scrap Popular Youtube Tech Channels.

Web Scrapping Popular Youtube Tech Channels with Selenium Data Mining, Data Wrangling, and Exploratory Data Analysis About the Data Web scrapi

David Rusho 0 Aug 18, 2021
👁️ Tool for Data Extraction and Web Requests.

httpmapper 👁️ Project • Technologies • Installation • How it works • License Project 🚧 For educational purposes. This is a project that I developed,

15 Dec 05, 2021
Download images from forum threads

Forum Image Scraper Downloads images from forum threads Only works with forums which doesn't require a login to view and have an incremental paginatio

9 Nov 16, 2022
A Scrapper with python

Scrapper-en-python Scrapper des données signifie récuperer des données pour les traiter ou les analyser. En python, il y'a 2 grands moyens de scrapper

Lun4rIum 1 Dec 05, 2021
FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

UserGhost411 1 Nov 17, 2022
The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

The open-source web scrapers that feed the Los Angeles Times' California coronavirus tracker. Processed data ready for analysis is available at datade

Los Angeles Times Data and Graphics Department 51 Dec 14, 2022
学习强国 自动化 百分百正确、瞬间答题,分值45分

项目简介 学习强国自动化脚本,解放你的时间! 使用Selenium、requests、mitmpoxy、百度智能云文字识别开发而成 使用说明 注:Chrome版本 驱动会自动下载 首次使用会生成数据库文件db.db,用于提高文章、视频任务效率。 依赖安装 pip install -r require

lisztomania 359 Dec 30, 2022