Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Last update: Jan 24, 2022

Overview

Toxicity comments crawler

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Twitter

Tweets and replies are scraped from Twitter API for a given list of users.

Twitch

Coming soon.

YouTube

Coming soon.

Facebook

Coming soon.

Instagram

Coming soon.

The toxic level of a given comment is calculated using the Perspective API.

Architecture

Usage

To run the crawler, you need to provide the following environment variables:

Variable	Description	Default	Required
`AWS_ROLE_ARN`	AWS Role ARN	`None`	Optional
`AWS_WEB_IDENTITY_TOKEN_FILE`	AWS Web Identity Token File	`None`	Optional
`AWS_ACCESS_KEY_ID`	AWS Access Key ID	`None`	Optional
`AWS_SECRET_ACCESS_KEY`	AWS Secret Access Key	`None`	Optional
`AWS_S3_BUCKET`	AWS S3 Bucket	`None`	Required
`AWS_S3_BUCKET_PREFIX`	AWS S3 Bucket Prefix	`None`	Required
`LOG_LEVEL`	Log level	`INFO`	Optional
`PERSPECTIVE_API_KEY`	Perspective API Key	`None`	Required
`PERSPECTIVE_THRESHOLD`	Perspective Threshold	`0.5`	Required
`FILTER_TOXIC_COMMENTS`	Filter Toxic Comments	`True`	Required
`TWITTER_CONSUMER_KEY`	Twitter Consumer Key	`None`	Required
`TWITTER_CONSUMER_SECRET`	Twitter Consumer Secret	`None`	Required
`TWITTER_ACCESS_TOKEN`	Twitter Access Token	`None`	Required
`TWITTER_ACCESS_TOKEN_SECRET`	Twitter Access Token Secret	`None`	Required
`TWITTER_MAX_TWEETS`	Twitter Max Tweets or replies	`None`	Required

If AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE are provided, the crawler will use them to assume a role, and will not use AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.

Running

Prerequisites

Docker

Then, you can run the crawler with the following command:

docker run --env-file .env -d dougtrajano/toxicity-crawler:latest

License

The project is licensed under the Apache 2.0 License.

This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

1 Dec 5, 2021

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

2.9k Jan 3, 2023

A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Autoscraper-n-blogger An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post and notifies via Telegram bot

13 Dec 21, 2022

This is a script that scrapes the longitude and latitude on food.grab.com

grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

0 Nov 22, 2021

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

1 Nov 7, 2021

Scrapes all articles and their headlines from theonion.com

The Onion Article Scraper Scrapes all articles and their headlines from the satirical news website https://www.theonion.com Also see Clickhole Article

0 Nov 17, 2021

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

2 Jun 6, 2022

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Crawler Rottentomatoes, Goodreads and IMDB sites crawler. Crawler written by beautifulsoup, selenium and lxml to gather books and films information an

1 Dec 30, 2021

Releases(0.2.1)

0.2.1(Dec 27, 2021)
What's Changed

Add wait_on_rate_limit in TwitterAPI by @DougTrajano in https://github.com/DougTrajano/toxicity-crawler/pull/29

Full Changelog: https://github.com/DougTrajano/toxicity-crawler/compare/0.2.0...0.2.1
Source code(tar.gz)
Source code(zip)
0.2.0(Dec 25, 2021)
What's Changed

Fixed an issue with tweet content in TwitterAPI by @DougTrajano

Added an exploratory notebook to test TwitterAPI by @DougTrajano

Bump pyyaml from 5.4.1 to 6.0 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/12

Bump google-api-python-client from 2.22.0 to 2.33.0 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/26

Bump metaflow from 2.3.6 to 2.4.7 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/28

Full Changelog: https://github.com/DougTrajano/toxicity-crawler/compare/0.1.4...0.2.0
Source code(tar.gz)
Source code(zip)
0.1.4(Sep 26, 2021)
Changes

Bump google-api-python-client from 2.21.0 to 2.22.0 #3

Fix Python path in Dockerfile

Source code(tar.gz)
Source code(zip)
0.1.3(Sep 24, 2021)
Changes

Updated GitHub Action.

Fix error in Docker execution.

Source code(tar.gz)
Source code(zip)
0.1.2(Sep 24, 2021)

Updated GitHub Action
Source code(tar.gz)
Source code(zip)
0.1.1(Sep 24, 2021)

Updated GitHub Action
Source code(tar.gz)
Source code(zip)
0.1.0(Sep 24, 2021)

Initial version
Source code(tar.gz)
Source code(zip)

Owner

Douglas Trajano

Data Scientist

GitHub Repository

Binance Smart Chain Contract Scraper + Contract Evaluator

Pulls Binance Smart Chain feed of newly-verified contracts every 30 seconds, then checks their contract code for links to socials.Returns only those with socials information included, and then submit

14 Dec 09, 2022

Script used to download data for stocks.

This script is useful for downloading stock market data for a wide range of companies specified by their respective tickers. The script reads in the d

71 Oct 04, 2022

A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

7 Dec 03, 2022

A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

1 Dec 19, 2021

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

slocpi-scraper Sun Life of Canada Philippines Inc Investment Funds Scraper Install dependencies pip install -r requirements.txt Usage General format:

2 Jan 07, 2022

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Related tags

Overview

Toxicity comments crawler

Architecture

Usage

Running

Prerequisites

License

You might also like...

This program scrapes information and images for movies and TV shows.

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

A web crawler script that crawls the target website and lists its links

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

This is a script that scrapes the longitude and latitude on food.grab.com

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

Scrapes all articles and their headlines from theonion.com

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Releases(0.2.1)

0.2.1(Dec 27, 2021)

What's Changed

0.2.0(Dec 25, 2021)

What's Changed

0.1.4(Sep 26, 2021)

Changes

0.1.3(Sep 24, 2021)

Changes

0.1.2(Sep 24, 2021)

0.1.1(Sep 24, 2021)

0.1.0(Sep 24, 2021)

Owner

Douglas Trajano

Binance Smart Chain Contract Scraper + Contract Evaluator

Script used to download data for stocks.

A tool to easily scrape youtube data using the Google API

A simple django-rest-framework api using web scraping

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

Create crawler get some new products with maximum discount in banimode website

Iptvcrawl - A scrapy project for crawl IPTV playlist

An experiment to deploy a serverless infrastructure for a scrapy project.

Grab the changelog from releases on Github

A simple reddit scraper to get memes (only images) from r/ProgrammerHumor.

A Very simple free proxy list scraper.

NASA APOD Discord Bot - Fetches information from NASA APOD site.

This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Extract embedded metadata from HTML markup

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Get paper names from dblp.org

一个m3u8视频流下载脚本

Simple proxy scraper made by using ProxyScrape's api.

Visual scraping for Scrapy