Async Python 3.6+ web scraping micro-framework based on asyncio

Overview

Ruia logo

Ruia

🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio.

Write less, run faster.

travis codecov PyPI - Python Version PyPI Downloads gitter

Overview

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Write less, run faster:

Features

  • Easy: Declarative programming
  • Fast: Powered by asyncio
  • Extensible: Middlewares and plugins
  • Powerful: JavaScript support

Installation

# For Linux & Mac
pip install -U ruia[uvloop]

# For Windows
pip install -U ruia

# New features
pip install git+https://github.com/howie6879/ruia

Tutorials

  1. Overview
  2. Installation
  3. Define Data Items
  4. Spider Control
  5. Request & Response
  6. Customize Middleware
  7. Write a Plugins

TODO

  • Cache for debug, to decreasing request limitation, ruia-cache
  • Provide an easy way to debug the script, ruia-shell
  • Distributed crawling/scraping

Contribution

Ruia is still under developing, feel free to open issues and pull requests:

  • Report or fix bugs
  • Require or publish plugins
  • Write or fix documentation
  • Add test cases

!!!Notice: We use black to format the code

Thanks

Comments
  • Add rtds support.

    Add rtds support.

    I notice that you have tried to use mkdoc to generate the website.

    Here's an example at readthedocs.org, powered by sphinx.

    There's a little bug, but it is still great.

    RTDs

    opened by panhaoyu 32
  • Log crucial information regardless of log-level

    Log crucial information regardless of log-level

    I've reduced the log level of a Spider in my script as I find it too verbose, however I also filter out crucial info, particularly the after completion info (number of requests, time, ect.) - https://github.com/howie6879/ruia/blob/651fac54540fe0030d3a3d5eefca6c67d0dcb3c3/ruia/spider.py#L280-L287

    This is code I currently use to reduce verbosity:

    import logging
    
    # Disable logging (for speed)
    logging.root.setLevel(logging.ERROR)
    

    I'm thinking of changing the code so that it shows regardless of log level, but will there ever be a case where you wouldn't want to see it?

    opened by abmyii 13
  • `DELAY` attribute specifically for retries

    `DELAY` attribute specifically for retries

    I assumed the DELAY attr would set the delay for retries but instead it applies to all requests. I would appreciate it if there was a DELAY attr specifically for retries (RETRY_DELAY). I'd be happy to implement it if given the go-ahead.

    Thank you for this great library!

    opened by abmyii 13
  • Calling `self.start` as an instance method for a `Spider`

    Calling `self.start` as an instance method for a `Spider`

    I have the following parent class which has reusable code for all the spiders in my project (this is just a basic example):

    class Downloader(Spider):
        concurrency = 15
        worker_numbers = 2
    
        # RETRY_DELAY (secs) is time between retries
        request_config = {
            "RETRIES": 10,
            "DELAY": 0,
            "RETRY_DELAY": 0.1
        }
    
        db_name = "DB"
        db_url = "postgresql://..."
        main_table = "test"
    
        def __init__(self, *args, **kwargs):
            # Initialise DB connection
            self.db = DB(self.db_url, self.db_name, self.main_table)
    
        def download(self):
            self.start()
            
    		# After completion, commit to DB
            self.db.commit()
    

    I use it by sub-classing for each different spider. However, it seems that self.start cannot be accessed as an instance for spiders (since it's a classmethod) - giving this error:

    Traceback (most recent call last):
      File "src/scraper.py", line 107, in <module>
        scraper = Scraper()
      File "src/downloader.py", line 31, in __init__
        super(Downloader, self).__init__(*args, **kwargs)
      File "/usr/lib/python3.8/site-packages/ruia/spider.py", line 159, in __init__
        self.request_session = ClientSession()
      File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 210, in __init__
        loop = get_running_loop(loop)
      File "/usr/lib/python3.8/site-packages/aiohttp/helpers.py", line 269, in get_running_loop
        loop = asyncio.get_event_loop()
      File "/usr/lib/python3.8/asyncio/events.py", line 639, in get_event_loop
        raise RuntimeError('There is no current event loop in thread %r.'
    RuntimeError: There is no current event loop in thread 'MainThread'.
    Exception ignored in: <function ClientSession.__del__ at 0x7f28875e8b80>
    Traceback (most recent call last):
      File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 302, in __del__
        if not self.closed:
      File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 916, in closed
        return self._connector is None or self._connector.closed
    AttributeError: 'ClientSession' object has no attribute '_connector'
    

    Any idea how I can solve this issue whilst maintaining the structure I am trying to implement?

    opened by abmyii 11
  • asyncio `RuntimeError`

    asyncio `RuntimeError`

    ERROR asyncio Exception in callback BaseSelectorEventLoop._sock_write_done(150)(<Future finished result=None>)
    handle: <Handle BaseSelectorEventLoop._sock_write_done(150)(<Future finished result=None>)>
    Traceback (most recent call last):
      File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
        self._context.run(self._callback, *self._args)
      File "/usr/lib/python3.8/asyncio/selector_events.py", line 516, in _sock_write_done
        self.remove_writer(fd)
      File "/usr/lib/python3.8/asyncio/selector_events.py", line 346, in remove_writer
        self._ensure_fd_no_transport(fd)
      File "/usr/lib/python3.8/asyncio/selector_events.py", line 251, in _ensure_fd_no_transport
        raise RuntimeError(
    RuntimeError: File descriptor 150 is used by transport <_SelectorSocketTransport fd=150 read=idle write=<polling, bufsize=0>>
    

    Getting this quite a bit still. I don't think it's ruia directly, but aiohttp. Any ideas?

    One thing that may be causing it is that in clean functions I call other functions synchronously, i.e.:

        async def clean_<...>(self, value):
            return <function>(value)
    

    Could that be causing it? I tried doing return await ... but the error still persisted.

    opened by abmyii 11
  • Show URL in Error for easier debugging

    Show URL in Error for easier debugging

    I think errors would be more useful if they also showed the URL of the parsed page. Example:

    ERROR Spider <Item: extract ... error, please check selector or set parameter named default>, https://...

    I hacked a solution together by passing around the url parameter, but I can't think of a clean solution ATM. Any ideas? I can also push my changes if you would like to see them (very hacky).

    opened by abmyii 11
  • 运行示例代码报错

    运行示例代码报错

    我参考的是 https://github.com/howie6879/ruia/blob/master/docs/en/tutorials/item.md 里的代码

    import asyncio
    from ruia import Item, TextField, AttrField
    
    
    class PythonDocumentationItem(Item):
        title = TextField(css_select='title')
        tutorial_link = AttrField(xpath_select="//a[text()='Tutorial']", attr='href')
    
    
    async def main():
        url = 'https://docs.python.org/3/'
        item = await PythonDocumentationItem.get_item(url=url)
        print(item.title)
        print(item.tutorial_link)
    
    
    if __name__ == '__main__':
        # Python 3.7 required
        asyncio.run(main())
    

    运行能获取到正常的结果 3.9.5 Documentation tutorial/index.html 但是会报错,提示RuntimeError: Event loop is closed 完整的运行结果如下所示:

    [2021:05:06 14:11:55] INFO  Request <GET: https://docs.python.org/3/>
    3.9.5 Documentation
    tutorial/index.html
    Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x0000018664A679D0>
    Traceback (most recent call last):
      File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\proactor_events.py", line 116, in __del__
        self.close()
      File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\proactor_events.py", line 108, in close
        self._loop.call_soon(self._call_connection_lost, None)
      File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 746, in call_soon
        self._check_closed()
      File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 510, in _check_closed
        raise RuntimeError('Event loop is closed')
    RuntimeError: Event loop is closed
    

    PC :win10 64bit Python :3.9.4 64bit

    opened by qgyhd1234 10
  • 我愿意用分布式函数调度框架合和你来比,看谁代码更少谁更自由来爬任意网站,欢迎交流。

    我愿意用分布式函数调度框架合和你来比,看谁代码更少谁更自由来爬任意网站,欢迎交流。

    https://github.com/ydf0509/distributed_framework/blob/master/test_frame/car_home_crawler_sample/car_home_consumer.py

    欢迎来对比,或者你不想用汽车之家测试,可以指定一个任何两层级网站的爬虫调度,看谁的代码少,写法更快更自由,看谁的控制手段多,看谁的运行速度更快,。

    opened by ydf0509 9
  • `TextField` strips strings which may not be desirable

    `TextField` strips strings which may not be desirable

    https://github.com/howie6879/ruia/blob/8a91c0129d38efd8fcd3bee10b78f694a1c37213/ruia/field.py#L120

    My use case is extracting paragraphs which have newlines between them, and these are stripped out by TextField. Should a new field be introduced (I have already made one for my scraper), or should the stripping be optional? Perhaps both is best.

    opened by abmyii 9
  • Trouble scraping deck.tk/deckstats.net

    Trouble scraping deck.tk/deckstats.net

    For example:

    import asyncio
    from ruia import Request
    
    
    async def request_example():
        url = "https://deck.tk/07Pw8tfr"
        params = {
            'name': 'ruia',
        }
        headers = {
            'User-Agent': 'Python3.6',
        }
        request = Request(url=url, method='GET', params=params, headers=headers)
        response = await request.fetch()
        json_result = await response.json()
        print(json_result)
    
    
    if __name__ == '__main__':
        asyncio.get_event_loop().run_until_complete(request_example())
    

    This simply hangs without resolution. That is, the request is never resolved, and I must Ctrl-C out of it. Scrapy handles this without issue, but I was hoping to transition to ruia. Any ideas?

    bug enhancement 
    opened by Triquetra 7
  • Would be nice to be able to pass in

    Would be nice to be able to pass in "start_urls"

    Ruia seems like a brilliant way to write simple and elegant web scrapers, but I can't figure out how to have a different "start_urls" value. I want a web scraper that can check all links on any GIVEN web page, not just whatever the start_urls lead me to, but also with the simplicity and asynchronous power that Ruia provides. Maybe this is a feature but I can't tell from the documentation or code

    opened by JacobJustice 7
  • Improve Chinese documentation

    Improve Chinese documentation

    Toc: Ruia中文文档

    • [x] 快速开始
    • [ ] 入门指南
      • [ ] 1.概览
      • [ ] 2.爱美妆
      • [ ] 3.定义Item
      • [ ] 4.运行 Spider
      • [ ] 5.个性化
      • [ ] 6.插件
      • [ ] 7.帮助
    • [ ] 基础概念
      • [ ] 1.Request
      • [ ] 2.Response
      • [ ] 3.Item
      • [ ] 4.Field
      • [ ] 5.Spider
      • [ ] 6.Middleware
    • [ ] 开发指南
      • [ ] 1.搭建开发环境
      • [ ] 2.Ruia架构
      • [ ] 3.为Ruia编写插件
      • [ ] 4.贡献代码
    • [ ] 实践指南
      • [ ] 1.谈谈对Python爬虫的理解
    enhancement 
    opened by howie6879 0
Releases(v0.8.0)
Owner
howie.hu
奇文共欣赏,疑义相与析
howie.hu
Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

Shopee Scraper A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil. The project was crea

Paulo DaRosa 5 Nov 29, 2022
Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

proxy scraper 🔎 Installation: git clone https://github.com/ebankoff/proxy_scraper Required pip libraries (pip install library name): lxml beautifulso

Eban'ko 19 Dec 07, 2022
腾讯课堂,模拟登陆,获取课程信息,视频下载,视频解密。

腾讯课堂脚本 要学一些东西,但腾讯课堂不支持自定义变速,播放时有水印,且有些老师的课一遍不够看,于是这个脚本诞生了。 时间比较紧张,只会不定时修复重大bug。多线程下载之类的功能更新短期内不会有,如果你想一起完善这个脚本,欢迎pr 2020.5.22测试可用 使用方法 很简单,三部完成 下载代码,

163 Dec 30, 2022
ChromiumJniGenerator - Jni Generator module extracted from Chromium project

ChromiumJniGenerator - Jni Generator module extracted from Chromium project

allenxuan 4 Jun 12, 2022
A Pixiv web crawler module

Pixiv-spider A Pixiv spider module WARNING It's an unfinished work, browsing the code carefully before using it. Features 0004 - Readme.md updated, co

Uzuki 1 Nov 14, 2021
Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Christian Gracia 0 Jan 22, 2022
A simple proxy scraper that utilizes the requests module in python.

Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

3 Sep 08, 2021
Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a ra

1 Jan 04, 2022
Scraping Thailand COVID-19 data from the DDC's tableau dashboard

Scraping COVID-19 data from DDC Dashboard Scraping Thailand COVID-19 data from the DDC's tableau dashboard. Data is updated at 07:30 and 08:00 daily.

Noppakorn Jiravaranun 5 Jan 04, 2022
一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

QQ音乐歌词爬虫 一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件,默认去除了所有演唱会(Live)版本的歌曲。 使用方法 直接运行python run.py即可,然后输入你想获取的歌手名字,然后静静等待片刻。 output目录下保存生成的歌词和歌名文件。以周杰伦为例,会生成两

Yang Wei 11 Jul 27, 2022
This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

crawler_to_visual_gmane Analyzing an EMAIL Archive from gmane and vizualizing the data using the D3 JavaScript library. This is a set of tools that al

Saim Zafar 1 Dec 20, 2021
NASA APOD Discord Bot - Fetches information from NASA APOD site.

NASA APOD Discord Bot - Fetches information from NASA APOD site.

Astronomy Club IITK 4 Apr 23, 2022
Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage import lassie lassie.fetch('http://www.youtube.com/watch?v

Mike Helmick 570 Dec 19, 2022
A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

cybernews A package that provides you Latest Cyber/Hacker News from website using Web-Scraping. Latest Cyber/Hacker News Using Webscraping Developed b

Hitesh Rana 4 Jun 02, 2022
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

howie.hu 1.6k Jan 01, 2023
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 588 Dec 27, 2022
An automated, headless YouTube Watcher and Scraper

Searches YouTube, queries recommended videos and watches them. All fully automated and anonymised through the Tor network. The project consists of two independently usable components, the YouTube aut

44 Oct 18, 2022
Python scrapper scrapping torrent website and download new movies Automatically.

torrent-scrapper Python scrapper scrapping torrent website and download new movies Automatically. If you like it Put a ⭐ on this repo 😇 Run this git

Fazil vk 1 Jan 08, 2022
robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

Joshua Carp 3.7k Dec 27, 2022
A Python module to bypass Cloudflare's anti-bot page.

cloudflare-scrape A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Reque

3k Jan 04, 2023