A universal package of scraper scripts for humans

Related tags

Web CrawlingScrapera
Overview

Logo

MIT License version-shield release-shield python-shield

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Sponsors
  6. License
  7. Contact
  8. Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

  • Images
  • Text
  • Audio
  • Videos
  • Miscellaneous

  • The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

    DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

    Prerequisites

    Prerequisites can be installed separately through the requirements.txt file as below

    pip install -r requirements.txt

    Installation

    Scrapera is built with Python 3 and can be pip installed directly

    pip install scrapera

    Alternatively, if you wish to install the latest version directly through GitHub then run

    pip install git+https://github.com/DarshanDeshpande/Scrapera.git

    Usage

    To use any sub-module, you just need to import, instantiate and execute

    from scrapera.video.vimeo import VimeoScraper
    scraper = VimeoScraper()
    scraper.scrape('https://vimeo.com/191955190', '540p')

    For more examples, please refer to the individual test folders in respective modules

    Contributing

    Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
    For more guidelines, refer to CONTRIBUTING

    License

    Distributed under the MIT License. See LICENSE for more information.

    Sponsors

    Logo

    Contact

    Feel free to reach out for any issues or requests related to Scrapera

    Darshan Deshpande (Owner) - Email | LinkedIn

    Acknowledgements

    Owner
    Helping Machines Learn Better 💻😃
    Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

    GetTss python Package extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install GetTss Us

    laojunjun 6 Nov 21, 2022
    自动完成每日体温上报(Github Actions)

    体温上报助手 简介 每天 10:30 GMT+8 自动完成体温上报,如想修改定时运行的时间,可修改 .github/workflows/SduHealthReport.yml 中 schedule 属性。 如果当日有异常,请手动在小程序端/PC 端填写!

    Teng Zhang 23 Sep 15, 2022
    Web Scraping images using Selenium and Python

    Web Scraping images using Selenium and Python A propos de ce document This is a markdown document about Web scraping images and videos using Selenium

    Nafaa BOUGRAINE 3 Jul 01, 2022
    Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

    Crawler Rottentomatoes, Goodreads and IMDB sites crawler. Crawler written by beautifulsoup, selenium and lxml to gather books and films information an

    Faeze Ghorbanpour 1 Dec 30, 2021
    Parse feeds in Python

    feedparser - Parse Atom and RSS feeds in Python. Copyright 2010-2020 Kurt McKee Kurt McKee 1.5k Dec 30, 2022

    🕷 Phone Crawler with multi-thread functionality

    Phone Crawler: Phone Crawler with multi-thread functionality Disclaimer: I'm not responsible for any illegal/misuse actions, this program was made for

    Kmuv1t 3 Feb 10, 2022
    Console application for downloading images from Reddit in Python

    RedditImageScraper Console application for downloading images from Reddit in Python Introduction This short Python script was created for the mass-dow

    James 0 Jul 04, 2021
    Jobinja.ir jobs scraper.

    Jobinja.ir Dataset Introduction This project is a simple web scraper that scraps pages of jobinja.ir concurrently and writes and update (if file gets

    Iman Kermani 3 Apr 15, 2022
    Docker containerized Python Flask API that uses selenium to scrape and interact with websites

    Docker containerized Python Flask API that uses selenium to scrape and interact with websites

    Christian Gracia 0 Jan 22, 2022
    A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

    GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

    Muhammad Abdullah 273 Dec 31, 2022
    中国大学生在线 四史自动答题刷分(现仅支持英雄篇)

    中国大学生在线 “四史”学习教育竞答 自动答题 刷分 (现仅支持英雄篇,已更新可用) 若对您有所帮助,记得点个Star 🌟 !!! 中国大学生在线 “四史”学习教育竞答 自动答题 刷分 (现仅支持英雄篇,已更新可用) 🥰 🥰 🥰 依赖 本项目依赖的第三方库: requests 在终端执行以下

    XWhite 229 Dec 12, 2022
    Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

    Instagram_scrapper This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or exce

    Lakhdar Belkharroubi 5 Oct 17, 2022
    Introduction to WebScraping Workshop - Semcomp 24 Beta

    Extrair informações da internet de forma automatizada. Existem diversas maneiras de fazer isso, nesse tutorial vamos ver algumas delas, por meio de bibliotecas de python.

    Luísa Moura 19 Sep 11, 2022
    High available distributed ip proxy pool, powerd by Scrapy and Redis

    高可用IP代理池 README | 中文文档 本项目所采集的IP资源都来自互联网,愿景是为大型爬虫项目提供一个高可用低延迟的高匿IP代理池。 项目亮点 代理来源丰富 代理抓取提取精准 代理校验严格合理 监控完备,鲁棒性强 架构灵活,便于扩展 各个组件分布式部署 快速开始 注意,代码请在release

    SpiderClub 5.2k Jan 03, 2023
    The first public repository that provides free BUBT website scraping API script on Github.

    BUBT WEBSITE SCRAPPING SCRIPT I think this is the first public repository that provides free BUBT website scraping API script on github. When I was do

    Md Imam Hossain 3 Feb 10, 2022
    一些爬虫相关的签名、验证码破解

    cracking4crawling 一些爬虫相关的签名、验证码破解,目前已有脚本: 小红书App接口签名(shield)(2020.12.02) 小红书滑块(数美)验证破解(2020.12.02) 海南航空App接口签名(hnairSign)(2020.12.05) 说明: 脚本按目标网站、App命

    XNFA 90 Feb 09, 2021
    Collection of code files to scrap different kinds of websites.

    STW-Collection Scrap The Web Collection; blog posts. This repo contains Scrapy sample code to scrap the following kind of websites: Do you want to lea

    Tapasweni Pathak 15 Jun 08, 2022
    HappyScrapper - Google news web scrapper with python

    HappyScrapper ~ Google news web scrapper INSTALLATION ♦ Clone the repository ♦ O

    Jhon Aguiar 0 Nov 07, 2022
    Amazon web scraping using Scrapy Framework

    Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

    Sejal Rajput 1 Jan 25, 2022
    12306抢票脚本

    12306抢票脚本

    罐子里的茶 457 Jan 05, 2023