A universal package of scraper scripts for humans

Related tags

Web CrawlingScrapera
Overview

Logo

MIT License version-shield release-shield python-shield

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Sponsors
  6. License
  7. Contact
  8. Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

  • Images
  • Text
  • Audio
  • Videos
  • Miscellaneous

  • The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

    DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

    Prerequisites

    Prerequisites can be installed separately through the requirements.txt file as below

    pip install -r requirements.txt

    Installation

    Scrapera is built with Python 3 and can be pip installed directly

    pip install scrapera

    Alternatively, if you wish to install the latest version directly through GitHub then run

    pip install git+https://github.com/DarshanDeshpande/Scrapera.git

    Usage

    To use any sub-module, you just need to import, instantiate and execute

    from scrapera.video.vimeo import VimeoScraper
    scraper = VimeoScraper()
    scraper.scrape('https://vimeo.com/191955190', '540p')

    For more examples, please refer to the individual test folders in respective modules

    Contributing

    Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
    For more guidelines, refer to CONTRIBUTING

    License

    Distributed under the MIT License. See LICENSE for more information.

    Sponsors

    Logo

    Contact

    Feel free to reach out for any issues or requests related to Scrapera

    Darshan Deshpande (Owner) - Email | LinkedIn

    Acknowledgements

    Owner
    Helping Machines Learn Better 💻😃
    京东茅台抢购最新优化版本,京东秒杀,添加误差时间调整,优化了茅台抢购进程队列

    京东茅台抢购最新优化版本,京东秒杀,添加误差时间调整,优化了茅台抢购进程队列

    776 Jul 28, 2021
    An experiment to deploy a serverless infrastructure for a scrapy project.

    Serverless Scrapy project This project aims to evaluate the feasibility of an architecture based on serverless technology for a web crawler using scra

    José Ferraz Neto 5 Jul 08, 2022
    Simple library for exploring/scraping the web or testing a website you’re developing

    Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit for

    Dan Claudiu Pop 79 Nov 27, 2022
    PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

    PaperRobot PaperRobot 是一个论文抓取工具,可以快速批量下载大量论文,方便后期进行持续的论文管理与学习。 PaperRobot通过多个接口抓取论文,目前抓取成功率维持在90%以上。通过配置Config文件,可以抓取任意计算机领域相关会议的论文。 Installation Down

    moxiaoxi 47 Nov 23, 2022
    Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

    trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

    Adrien Barbaresi 704 Jan 06, 2023
    Facebook Group Scraping Using Beautiful Soup & Selenium

    Extract Facebook group posts that are related to a specific topic and write them to a .json file.

    Fatima Ghadieh 14 Aug 12, 2022
    A distributed crawler for weibo, building with celery and requests.

    A distributed crawler for weibo, building with celery and requests.

    SpiderClub 4.8k Jan 03, 2023
    Automatically download and crop key information from the arxiv daily paper.

    Arxiv daily 速览 功能:按关键词筛选arxiv每日最新paper,自动获取摘要,自动截取文中表格和图片。 1 测试环境 Ubuntu 16+ Python3.7 torch 1.9 Colab GPU 2 使用演示 首先下载权重baiduyun 提取码:il87,放置于code/Pars

    HeoLis 20 Jul 30, 2022
    Async Python 3.6+ web scraping micro-framework based on asyncio

    Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

    howie.hu 1.6k Jan 01, 2023
    LSpider 一个为被动扫描器定制的前端爬虫

    LSpider LSpider - 一个为被动扫描器定制的前端爬虫 什么是LSpider? 一款为被动扫描器而生的前端爬虫~ 由Chrome Headless、LSpider主控、Mysql数据库、RabbitMQ、被动扫描器5部分组合而成。

    Knownsec, Inc. 321 Dec 12, 2022
    Web scrapping

    Project Setup Table of Contents Project Setup Table of Contents Run project locally Install Requirements Run script Run project locally Install Requir

    Charles 3 Feb 04, 2022
    This is a script that scrapes the longitude and latitude on food.grab.com

    grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

    0 Nov 22, 2021
    抖音批量下载用户所有无水印视频

    Douyincrawler 抖音批量下载用户所有无水印视频 Run 安装python3, 安装依赖

    28 Dec 08, 2022
    download NCERT books using scrapy

    download_ncert_books download NCERT books using scrapy Downloading Books: You can either use the spider by cloning this repo and following the instruc

    1 Dec 02, 2022
    Using Selenium with Python to Web Scrap Popular Youtube Tech Channels.

    Web Scrapping Popular Youtube Tech Channels with Selenium Data Mining, Data Wrangling, and Exploratory Data Analysis About the Data Web scrapi

    David Rusho 0 Aug 18, 2021
    Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

    Aliexpress to telegram post Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a b

    Fernando 6 Dec 06, 2022
    A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

    GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

    Muhammad Abdullah 273 Dec 31, 2022
    Snowflake database loading utility with Scrapy integration

    Snowflake Stage Exporter Snowflake database loading utility with Scrapy integration. Meant for streaming ingestion of JSON serializable objects into S

    Oleg T. 0 Dec 06, 2021
    Binance Smart Chain Contract Scraper + Contract Evaluator

    Pulls Binance Smart Chain feed of newly-verified contracts every 30 seconds, then checks their contract code for links to socials.Returns only those with socials information included, and then submit

    14 Dec 09, 2022
    This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

    - Hello, This Project Contains Amazon Web-bot. - I've developed this bot for fething some items information on Amazon. - Scrapy Framework in Python is

    Khaled Tofailieh 4 Feb 13, 2022