NLP

T5 Project proposal

Topic Modeling and Clustering of News-Articles-and-Essays

Students:

Nasser Alshehri
Abdullah Bushnag
Abdulrhman Alqurashi

OVERVIEW

News come in different formats, different types and different categories. Here we attempt to use Topic modeling and Clustering to get answers on what each content containt based on its content and then we try to do it based only on its title.

The process would be: We load the data. Keep what we need from the data. Clean the text(ex:stopwords).

Build the bag of words for all documents. Build the bag of words for each document.

Vectorize the data. Run the LDA model. Run the model on all data and save the output to dataframe

Run the Clustering algorithm. Save the data to csv. Make the charts.

Data

The data is acquired from: https://components.one/datasets/all-the-news-articles-dataset

The Raw data containts 12 features: id, title, author, date, content, year, month, publication, category, digital, section, url.

The features we are using are only the 'title' and 'content'.

The data we are not interested in will be dropped/ignored.

The 'title' is the headling/name/title of the news/Article/Essay. The 'Content' is the body/content/Essay/Article/News itself.

TOOLS

Pandas Numpy Scikit-learn Matplotlib Seaborn nltk gensim

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

Related tags

Overview

NLP

Students:

OVERVIEW

Data

TOOLS

Owner

Reformer, the efficient Transformer, in Pytorch

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

Fidibo.com comments Sentiment Analyser

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

Simple, hackable offline speech to text - using the VOSK-API.

Translation for Trilium Notes. Trilium Notes 中文版.

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

A library for end-to-end learning of embedding index and retrieval model

Unsupervised text tokenizer for Neural Network-based text generation.

Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

Words-per-minute - A terminal app written in python utilizing the curses module that tests the user's ability to type

Gold standard corpus annotated with verb-preverb connections for Hungarian.

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

Journalism AI – Quotes extraction for modular journalism

Neural-Machine-Translation - Implementation of revolutionary machine translation models

Train and use generative text models in a few lines of code.

A sentence aligner for comparable corpora