NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Overview

Project 3: Web APIs & NLP

Problem Statement

How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration?

The goal of the project is to see how these two ideologically similar subreddits perceive Biden and his term as president so far.

Success in this project isn't to necessarily develop a model that accurately predicts consistently, but rather to convey what issues these two ideologies care about and the overall sentiment both subreddits have regarding Biden. Considering a lot of this information will be rather focused on EDA, it's hard to necessarily judge the success of this project on the individual models created, rather the success of this project will be determined primarily in the EDA, Visualization, and Presentation sections of the actual project. With that being said however, I will still use a wide variety of models to determine the predictive value of the data I gathered.

Hypothesis: I believe that the two subreddits will differ significantly on what issues they discuss and their sentiment towards Biden, I think because of these differences a model can be made that can accurately predict which post belongs to who. Primarily, I will be focusing on the differences between these subreddits in sentiment and words used.

Data Collection

When collecting data, I initially didn't have the problem statement in mind necessarily before I started. When I began data collecting, I knew I wanted to do something political specifically on the Biden admin post innaguration but I really wanted to go through the process experimenting with different subreddits which made for an interesting situation.

I definitely learned a lot more about the API going into the data collection process blind,such as knowing to avoid deleted posts by excluding "[deleted]" from the selftext among other things, especially about using score and created_utc for gathering posts. I would say the most difficult process was just finding subreddits and then subsequently seeing if they have enough posts while trying to construct different problem statements using the viable subreddits.

At the end, I decided on just choosing r/neoliberal and r/libertarian, there might've been easier options for model creation but personally, I found it a lot more interesting especially since I already browse r/neoliberal fairly frequently so I was invested in the analysis.

Data Cleaning and EDA

When performing data cleaning and EDA, I really did these two tasks in two seperate notebooks. My logistic regression notebook and in my notebook dedicated to EDA and data cleaning. The reason for that being, I initially just had the logistic regression notebook but then wanted to do further analysis on vectorized sets so I created it's own notebook for that while still at times referencing ideal vectorizer parameters I found in my logistic regression notebook.

Truth be told, I did some cleaning in the data gathering notebook, just checking if there were any duplicates or if there were any oddities that I found and I didn't find much, there might have been a few removed posts that snuck in to my analysis but truth be told, it wasn't anything warranting an editing of my data gathering techniques or anything that would stop me from using the data I already gathered.

EDA primarily was just trying to find words that stuck out using count vectorizers, luckily, that was fairly easy to do considering the NLP process came fairly naturally to me. I used lemmatizers for model creation but I rarely used it for my actual EDA, I primarily just used a basic tokenizer without any added features. The bulk of my presentation directly comes from this and domain knowledge where I can create conclusions from the information gathered from this EDA process. EDA helped present a narrative that I was able to fully formulate with my domain knowledge which then resulted in the conclusions found in my presentation.

Another part of EDA that was critical, was the usage of sentiment analysis to find the difference in overall tone between the two subreddits on Biden, this was especially important in my analysis as it also ended up being apart of my preprocessing as well. Sentiment analysis was used in my presentation to present the differences in tone towards Biden but also emphasize the amount of neturality in the posts themselves, this is due primarily to the posts being titles of politically neutral news titles or tweets.

Preprocessing and Modelling

Modelling was a very tenuous process and Preprocessing as well because a lot of it was very memory intensive which resulted in a lot of time spent baby-sitting my laptop but ultimately it provided a lot of valuable information not only on the data I was investigating but also on the models I was using. I used bagging classifiers, logistic regression models, decision trees, random forest models, and boosted models. All of these I had to very mixed success but logistic regression was the one I had the most consistency with, especially with self text exclusive posts. Random forest, decision trees, and boosted models, I all had high expectations for but was not as consistently effective as the logistic regression models. Due to general model underperformance, I will be primarily talking about the logistic regression models I created in the logreg notebook as I had dedicated the most time finetuning those models and had generally more consistent performance with those models than I did others.

I specifically had massive troubles with predicting neoliberal posts while Libertarian posts, I generally managed a decent rate at. My specificity was a lot better than my sensitivity. When I judged my model's ability to predict, I looked at self-text, title-exclusive, and total text. This allowed me to individually look at what each model was good at predicting and also what data to gather the next time I interact with this API.

My preprocessing was very meticulous, specifically experimenting with different vectorizer parameters when using my logistic regression model. Adjustment of parameters and the addition of sentiment scores to try and help the model's performance. Adjusting the vectorizer parameters such as binary and others were heavily tweaked depending on the X variable used (selftext, title, totaltext).

Conclusion

When analyzing this data, it is clear that there are three key takeaways from my modeling process and EDA stage.

  1. The overwhelming neutrality in the text (specifically the title) itself, can hide the true opinions of those in the subreddit.

  2. Predictive models are incredibly difficult to perform on these subreddits in particular and potentially other political subreddits.

  3. The issues in which the subreddits most differ on, is primarily due to r/Libertarian focusing more on surveillance and misinformation in the media while r/Neoliberal is concerned with global politics, climate, and sitting senate representatives.

  4. They both discuss tax, covid, stimulus, china and other current topics relatively often

Sources Used

Britannica Definition of Libertarianism

Neoliberal Project

Stanford Philosophy: Libertarianism

Stanford Philosophy: Neoliberalism

Neoliberal Podcast: Defining Neoliberalism

r/Libertarian

r/neoliberal

Owner
Adam Muhammad Klesc
Hopeful data scientist. Currently in General Assembly and taking their data science immersive course!
Adam Muhammad Klesc
Write Alphabet, Words and Sentences with your eyes.

The-Next-Gen-AI-Eye-Writer The Eye tracking Technique has become one of the most popular techniques within the human and computer interaction era, thi

Rohan Kasabe 2 Apr 05, 2022
無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

VOICEVOX ENGINE VOICEVOXの音声合成エンジン。 実態は HTTP サーバーなので、リクエストを送信すればテキスト音声合成できます。 API ドキュメント VOICEVOX ソフトウェアを起動した状態で、ブラウザから

Hiroshiba 3 Jul 05, 2022
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 09, 2023
2021语言与智能技术竞赛:机器阅读理解任务

LICS2021 MRC 1. 项目&任务介绍 本项目基于官方给定的baseline(DuReader-Checklist-BASELINE)进行二次改造,对整个代码框架做了简单的重构,对核心网络结构添加了注释,解耦了数据读取的模块,并添加了阈值确认的功能,一些小的细节也做了改进。 本次任务为202

roar 29 Dec 05, 2022
This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

Pattern-Exploiting Training (PET) This repository contains the code for Exploiting Cloze Questions for Few-Shot Text Classification and Natural Langua

Timo Schick 1.4k Dec 30, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Introduction Funnel-Transformer is a new self-attention model that gradually compresses the sequence of hidden states to a shorter one and hence reduc

GUOKUN LAI 197 Dec 11, 2022
This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Twitter COVID-19 Sentiment Analysis Members: Christopher Bach | Khalid Hamid Fallous | Jay Hirpara | Jing Tang | Graham Thomas | David Wetherhold Pro

4 Oct 15, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 A repository part of the MarIA project. Corpora 📃 Corpora Number of documents Number of tokens Size (GB) BNE 201,080,084

Plan de Tecnologías del Lenguaje - Gobierno de España 203 Dec 20, 2022
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 06, 2023
Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

ASReview hackathon for Follow the Money 2 Nov 28, 2021
BiNE: Bipartite Network Embedding

BiNE: Bipartite Network Embedding This repository contains the demo code of the paper: BiNE: Bipartite Network Embedding. Ming Gao, Leihui Chen, Xiang

leihuichen 214 Nov 24, 2022
Training open neural machine translation models

Train Opus-MT models This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Ma

Language Technology at the University of Helsinki 167 Jan 03, 2023
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022
Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

derwen.ai 1.9k Jan 06, 2023
A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and 🤗 Transformers. How to use Install the library from PyPI: pip install transf

Riccardo Orlando 27 Nov 20, 2022