Web crawling framework based on asyncio.

Last update: Jan 05, 2023

Overview

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

Web crawling framework based on asyncio.

Related tags

Overview

Requirements

Installation

Usage

Example

Contribution

Owner

Jiuli Gao

crypto currency scraping

京东茅台抢购最新优化版本，京东秒杀，添加误差时间调整，优化了茅台抢购进程队列

京东秒杀商品抢购Python脚本

腾讯课堂，模拟登陆，获取课程信息，视频下载，视频解密。

Nekopoi scraper using python3

An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line!

Instagram profile scrapper with python

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

Explore scraping with BeautifulSoup!

一个m3u8视频流下载脚本

Web3 Pancakeswap Sniper bot written in python3

A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

Scrapes proxies and saves them to a text file

Transistor, a Python web scraping framework for intelligent use cases.

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

Linkedin webscraping - Linkedin web scraping with python

A simple app to scrap data from Twitter.