A universal package of scraper scripts for humans

Related tags

Web CrawlingScrapera
Overview

Logo

MIT License version-shield release-shield python-shield

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Sponsors
  6. License
  7. Contact
  8. Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

  • Images
  • Text
  • Audio
  • Videos
  • Miscellaneous

  • The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

    DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

    Prerequisites

    Prerequisites can be installed separately through the requirements.txt file as below

    pip install -r requirements.txt

    Installation

    Scrapera is built with Python 3 and can be pip installed directly

    pip install scrapera

    Alternatively, if you wish to install the latest version directly through GitHub then run

    pip install git+https://github.com/DarshanDeshpande/Scrapera.git

    Usage

    To use any sub-module, you just need to import, instantiate and execute

    from scrapera.video.vimeo import VimeoScraper
    scraper = VimeoScraper()
    scraper.scrape('https://vimeo.com/191955190', '540p')

    For more examples, please refer to the individual test folders in respective modules

    Contributing

    Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
    For more guidelines, refer to CONTRIBUTING

    License

    Distributed under the MIT License. See LICENSE for more information.

    Sponsors

    Logo

    Contact

    Feel free to reach out for any issues or requests related to Scrapera

    Darshan Deshpande (Owner) - Email | LinkedIn

    Acknowledgements

    Owner
    Helping Machines Learn Better 💻😃
    This script is intended to crawl license information of repositories through the GitHub API.

    GithubLicenseCrawler This script is intended to crawl license information of repositories through the GitHub API. Taking a csv file with requirements.

    schutera 4 Oct 25, 2022
    Anonymously scrapes onlinesim.ru for new usable phone numbers.

    phone-scraper Anonymously scrapes onlinesim.ru for new usable phone numbers. Usage Clone the repository $ git clone https://github.com/thomasgruebl/ph

    16 Oct 08, 2022
    Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

    Pyrics Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes. ./test/run.py provides the full function in terminal cmd

    MisterDK 1 Feb 12, 2022
    A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

    A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

    2 Jun 06, 2022
    Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye

    Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye, you can search with various keywords and usernames on Twitter.

    Jolanda de Koff 19 Dec 12, 2022
    A pure-python HTML screen-scraping library

    Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

    Scrapy project 1.8k Dec 31, 2022
    Comment Webpage Screenshot is a GitHub Action that captures screenshots of web pages and HTML files located in the repository

    Comment Webpage Screenshot is a GitHub Action that helps maintainers visually review HTML file changes introduced on a Pull Request by adding comments with the screenshots of the latest HTML file cha

    Maksudul Haque 21 Sep 29, 2022
    Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

    Github Scraper Github scraper app is used to scrape data for a specific user profile. Github scraper app gets a github profile name and check whether

    Siva Prakash 6 Apr 05, 2022
    自动完成每日体温上报(Github Actions)

    体温上报助手 简介 每天 10:30 GMT+8 自动完成体温上报,如想修改定时运行的时间,可修改 .github/workflows/SduHealthReport.yml 中 schedule 属性。 如果当日有异常,请手动在小程序端/PC 端填写!

    Teng Zhang 23 Sep 15, 2022
    Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

    Iceberg Locations Antarctic large iceberg positions derived from ASCAT and OSCAT-2. All data collected here are from the NASA SCP website Overview Thi

    Joel Hanson 5 Jul 27, 2022
    A web scraper for nomadlist.com, made to avoid website restrictions.

    Gypsylist gypsylist.py is a web scraper for nomadlist.com, made to avoid website restrictions. nomadlist.com is a website with a lot of information fo

    Alessio Greggi 5 Nov 24, 2022
    A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

    Universal Online Judge Spider Introduction This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/). It also works for all other Onl

    TriNitroTofu 1 Dec 07, 2021
    A simplistic scraper made to download tons of random screenshots made by people.

    printStealer 1.1 What is this tool? This tool is developed to show the insecurity of the screenshot utility called prnt sc. It is a site that stores s

    appelsiensam 4 Jul 26, 2022
    A distributed crawler for weibo, building with celery and requests.

    A distributed crawler for weibo, building with celery and requests.

    SpiderClub 4.8k Jan 03, 2023
    Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

    Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

    Gerapy 2.9k Jan 03, 2023
    Poolbooru gelscraper - a simple python script for scraping images off gelbooru pools.

    poolbooru_gelscraper a simple python script for scraping images off gelbooru pools. modules required:requests_html, and os by default saves files with

    savantshuia 1 Jan 02, 2022
    学习强国 自动化 百分百正确、瞬间答题,分值45分

    项目简介 学习强国自动化脚本,解放你的时间! 使用Selenium、requests、mitmpoxy、百度智能云文字识别开发而成 使用说明 注:Chrome版本 驱动会自动下载 首次使用会生成数据库文件db.db,用于提高文章、视频任务效率。 依赖安装 pip install -r require

    lisztomania 359 Dec 30, 2022
    Scrapping the data from each page of biocides listed on the BAUA website into a csv file

    Scrapping the data from each page of biocides listed on the BAUA website into a csv file

    Eric DE MARIA 1 Nov 30, 2021
    薅薅乐 - JD 测试脚本

    薅薅乐 安裝 使用docker docker一键安装: docker run -d --name jd classmatelin/hhl:latest. 使用 进入容器: docker exec -it jd bash 获取JD_COOKIES: python get_jd_cookies.py,

    ClassmateLin 575 Dec 28, 2022
    Scrape puzzle scrambles from csTimer.net

    Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

    Jason Nguyen 1 Oct 29, 2021