The Toxicity Dataset

Saving the internet is fun. Combing through thousands of online comments to build a toxicity dataset isn't. That's why we're creating the world's largest dataset of social media toxicity — so you can skip the slog and get to work.

We hope you find this dataset useful, whether you want to flag hateful speech, develop content moderation tools, or build classifiers to detect toxic messages.

Need a larger dataset of toxicity to train your ML models, or toxicity in other languages (Spanish, French, German, Japanese, Portuguese, and 17+ more)? We work with top AI and Safety companies around the world. Reach out to [email protected]!

Dataset

This repo contains 500 toxic and 500 non-toxic comments from a variety of popular social media platforms. Click on toxicity_en.csv to see a spreadsheet of 1000 English examples. Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic.

Columns

text: the text of the comment
is_toxic: whether or not the comment is toxic

Future

We'll be adding more languages and annotations (e.g., augmenting each comment with a severity ranking, adding categories, etc) over time.

If you're also interested in a dataset of profanity, check out our obscenity list.

The world's largest toxicity dataset.

Related tags

Overview

The Toxicity Dataset

Dataset

Columns

Future

Owner

Surge AI

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Stock-history-display - something like a easy yearly review for your stock performance

Conjugated Discrete Distributions for Distributional Reinforcement Learning (C2D)

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

a grammar based feedback fuzzer

Differential Privacy for Heterogeneous Federated Learning : Utility & Privacy tradeoffs

🎯 A comprehensive gradient-free optimization framework written in Python

Implementation for NeurIPS 2021 Submission: SparseFed

GyroSPD: Vector-valued Distance and Gyrocalculus on the Space of Symmetric Positive Definite Matrices

The Official PyTorch Implementation of "VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models" (ICLR 2021 spotlight paper)

EsViT: Efficient self-supervised Vision Transformers

Some pvbatch (paraview) scripts for postprocessing OpenFOAM data

The UI as a mobile display for OP25

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.

Vit-ImageClassification - Pytorch ViT for Image classification on the CIFAR10 dataset

YOLOV4运行在嵌入式设备上

Implementation of the ALPHAMEPOL algorithm, presented in Unsupervised Reinforcement Learning in Multiple Environments.

In this repo we reproduce and extend results of Learning in High Dimension Always Amounts to Extrapolation by Balestriero et al. 2021

clustering moroccan stocks time series data using k-means with dtw (dynamic time warping)

Companion repository to the paper accepted at the 4th ACM SIGSPATIAL International Workshop on Advances in Resilient and Intelligent Cities