offline-training-pipeline

This is the offline-training-pipeline for our project.

We adopt the offline training and online prediction Machine Learning System framework structure.

We used the recent DistilBERT pre-trained large-scale NLP language model and fine-tuned it for the downstream fake news classification task.

Initial fine-tune training dataset are adopted from CONSTRAINT workshop of AAAI21. For offline routine training and updating in the future, we will adopt the Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Fakenewsnet offers up-to-date datasets and is continuously been updated on a regular basis. We hope to track the lastest trend of popular fake news and broader fake news topic as well by doing offline-training of our model and achieve better performance in the online prediction.

References:

@misc{patwa2020fighting, title={Fighting an Infodemic: COVID-19 Fake News Dataset}, author={Parth Patwa and Shivam Sharma and Srinivas PYKL and Vineeth Guptha and Gitanjali Kumari and Md Shad Akhtar and Asif Ekbal and Amitava Das and Tanmoy Chakraborty}, year={2020}, eprint={2011.03327}, archivePrefix={arXiv}, primaryClass={cs.CL} }

@article{sanh2019distilbert, title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}, author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas}, journal={arXiv preprint arXiv:1910.01108}, year={2019} }

@article{shu2020fakenewsnet, title={Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media}, author={Shu, Kai and Mahudeswaran, Deepak and Wang, Suhang and Lee, Dongwon and Liu, Huan}, journal={Big data}, volume={8}, number={3}, pages={171--188}, year={2020}, publisher={Mary Ann Liebert, Inc., publishers 140 Huguenot Street, 3rd Floor New~…} }

This is the offline-training-pipeline for our project.

Related tags

Overview

offline-training-pipeline

Owner

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

Unsupervised Language Model Pre-training for French

A telegram bot to translate 100+ Languages

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Codes to pre-train Japanese T5 models

Predict the spans of toxic posts that were responsible for the toxic label of the posts

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

2021海华AI挑战赛·中文阅读理解·技术组·第三名

DAGAN - Dual Attention GANs for Semantic Image Synthesis

Conditional probing: measuring usable information beyond a baseline

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

Awesome-NLP-Research (ANLP)

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.