In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Overview

Making Emojis More Predictable

by Karan Abrol, Karanjot Singh and Pritish Wadhwa, Natural Language Processing (CSE546) under the guidance of Dr. Shad Akhtar from Indraprastha Institute of Information Technology, Delhi.

Introduction

The advent of social media platforms like WhatsApp, Facebook (Meta) and Twitter, etc. has changed natural language conversations forever. Emojis are small ideograms depicting objects, people, and scenes (Cappallo et al., 2015). Emojis are used to complement short text messages with a visual enhancement and have become a de-facto standard for online communication. Our aim is to predict a single emoji that appears in the input tweets.

In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Project Pipeline Summary

We started off by collecting the data. The data was then thoroughly studied and preprocessed. Key features were also extracted at this stage. Due to computational restrictions, a subset of data was taken which was further divided into training, test- ing and validation split, such that the distribution of any class in any two sets were same. After this, various machine learning and deep learning models were applied on the data set and the results were generated and analysed.

Deployment

Emoji Prediction Website

Screenshots

Prediction Website1 Prediction Website2

Dataset

The data we used consists of a list of tweets associated with a single emoji, indexed by 20 labels for each of the 20 emojis. 5,00,000 Tweets by users in the United States, from October 2015 to Jan 2018, were retrieved using the Twitter API. The script for scraping this dataset was made available by the SemEval 2018 challenge. Due to computational limitations we merged the test and trial data, and further divided that into training, trial and test data with a split of 70:10:20. We maintained the label ratios for each emoji across the three sets to best reflect how frequently they are used in real life.

Models

  • Machine Learning Models:

    • Logistic Regression
    • K-Nearest Neighbours
    • Stochastic Gradient Descent
    • Random Forest Classifier
    • Naive Bayes
    • Adaboost Classifier
    • Support Vector Machine
  • Deep Learning Models:

    • RNN
    • LSTM
    • BiLSTM

Contact

For further queries feel free to reach out to following contributors.
Karan Abrol ([email protected])
Karanjot Singh ([email protected])
Pritish Wadhwa ([email protected])

Final Report

Final Report 1
Final Report 2
Final Report 3
Final Report 4
Final Report 5
Final Report 6
Final Report 7

Owner
Karanjot Singh
GDSC Lead @dsc-iiitd | Outside Collaborator @oppia | Flutter/ Kotlin Developer | Cloud Enthusiast | CSE Junior @IIIT-Delhi
Karanjot Singh
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022
Adversarial Examples for Extreme Multilabel Text Classification

Adversarial Examples for Extreme Multilabel Text Classification The code is adapted from the source codes of BERT-ATTACK [1], APLC_XLNet [2], and Atte

1 May 14, 2022
An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Khalid Saifullah 37 Sep 05, 2022
Sequence Modeling with Structured State Spaces

Structured State Spaces for Sequence Modeling This repository provides implementations and experiments for the following papers. S4 Efficiently Modeli

HazyResearch 902 Jan 06, 2023
Code for paper: An Effective, Robust and Fairness-awareHate Speech Detection Framework

BiQQLSTM_HS Code and data for paper: Title: An Effective, Robust and Fairness-awareHate Speech Detection Framework. Authors: Guanyi Mou and Kyumin Lee

Guanyi Mou 2 Dec 27, 2022
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
Few-shot Natural Language Generation for Task-Oriented Dialog

Few-shot Natural Language Generation for Task-Oriented Dialog This repository contains the dataset, source code and trained model for the following pa

172 Dec 13, 2022
An open source framework for seq2seq models in PyTorch.

pytorch-seq2seq Documentation This is a framework for sequence-to-sequence (seq2seq) models implemented in PyTorch. The framework has modularized and

International Business Machines 1.4k Jan 02, 2023
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 A repository part of the MarIA project. Corpora 📃 Corpora Number of documents Number of tokens Size (GB) BNE 201,080,084

Plan de Tecnologías del Lenguaje - Gobierno de España 203 Dec 20, 2022
Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Open Tech 61 Dec 13, 2022
Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

Implementation of some unbalanced loss for NLP task like focal_loss, dice_loss, DSC Loss, GHM Loss et.al Summary Here is a loss implementation reposit

121 Jan 01, 2023
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
LeBenchmark: a reproducible framework for assessing SSL from speech

LeBenchmark: a reproducible framework for assessing SSL from speech

11 Nov 30, 2022
A workshop with several modules to help learn Feast, an open-source feature store

Workshop: Learning Feast This workshop aims to teach users about Feast, an open-source feature store. We explain concepts & best practices by example,

Feast 52 Jan 05, 2023
Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
Задания КЕГЭ по информатике 2021 на Python

КЕГЭ 2021 на Python В этом репозитории мои решения типовых заданий КЕГЭ по информатике в 2021 году, БЕСПЛАТНО! Задания Взяты с https://inf-ege.sdamgia

8 Oct 13, 2022
Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

MINDs Lab 881 Jan 03, 2023
Japanese NLP Library

Japanese NLP Library Back to Home Contents 1 Requirements 1.1 Links 1.2 Install 1.3 History 2 Libraries and Modules 2.1 Tokenize jTokenize.py 2.2 Cabo

Pulkit Kathuria 144 Dec 27, 2022