A naive Bayes model for cancer classification using a set of documents

Overview

Naivebayes text classifcation model for cancer and noncancer documents

Author: Alex King


  1. Purpose
  2. Requirements/files included
  3. How to use

1. Purpose

The Purpose of this program is to read in from csv files containing two columns:
                    Document | classifcation
                    xxxxxx   | cancer/nocancer
                    xxxxxx   | cancer/nocancer
                    xxxxxx   | cancer/nocancer

This program uses the data to read into classes containing each documents one file is used as the training set, and the other as the testing set. Each set goes through the same tokenization. From there one is trained and the other is tested.

2. Requirements/files used

* python3 * numpy library - for calculating log * pandas library - for reading in csv files * main.py and naivesbayes.py * stopwords.txt - list of stop words * Scoring.docx - list of scoring for precsion, Recall, F-score

3. How to use

This program has 3 modes of operation for tokenizing your sets:
                $python3 main.py -train 1 -test 1 

This first command will execute std tokenization on training set 1 and test set 1. To change which training set just change the 1 into a 2.

                $python3 main.py -train 2 -test 1 

#NOTE do not change testing set number leave it as 1 it was intended for multiple testing sets

For binary:

                $python3 main.py -train # -test 1 -b

For stopwords:

                $python3 main.py -train # -test 1 -s

For both stopwords and binary:

                $python3 main.py -train # -test 1 -b -s
Owner
Alex W King
Alex W King
Cohort Intelligence used to solve various mathematical functions

Cohort-Intelligence-for-Mathematical-Functions About Cohort Intelligence : Cohort Intelligence ( CI ) is an optimization technique. It attempts to mod

Aayush Khandekar 2 Oct 25, 2021
Banpei is a Python package of the anomaly detection.

Banpei Banpei is a Python package of the anomaly detection. Anomaly detection is a technique used to identify unusual patterns that do not conform to

Hirofumi Tsuruta 282 Jan 03, 2023
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

6.2k Jan 01, 2023
Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

Telekom Open Source Software 17 Nov 20, 2022
Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.1) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

Oracle 1.1k Dec 28, 2022
This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev

MLProject_01 This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev Context Dataset English question data set file F

Hadi Nakhi 1 Dec 18, 2021
In this Repo a simple Sklearn Model will be trained and pushed to MLFlow

SKlearn_to_MLFLow In this Repo a simple Sklearn Model will be trained and pushed to MLFlow Install This Repo is based on poetry python3 -m venv .venv

1 Dec 13, 2021
moDel Agnostic Language for Exploration and eXplanation

moDel Agnostic Language for Exploration and eXplanation Overview Unverified black box model is the path to the failure. Opaqueness leads to distrust.

Model Oriented 1.2k Jan 04, 2023
Lingtrain Alignment Studio is an ML based app for texts alignment on different languages.

Lingtrain Alignment Studio Intro Lingtrain Alignment Studio is the ML based app for accurate texts alignment on different languages. Extracts parallel

Sergei Averkiev 186 Jan 03, 2023
Python module for performing linear regression for data with measurement errors and intrinsic scatter

Linear regression for data with measurement errors and intrinsic scatter (BCES) Python module for performing robust linear regression on (X,Y) data po

Rodrigo Nemmen 56 Sep 27, 2022
using Machine Learning Algorithm to classification AppleStore application

AppleStore-classification-with-Machine-learning-Algo- using Machine Learning Algorithm to classification AppleStore application. the first step : 1: p

Mohammed Hussien 2 May 02, 2022
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 06, 2023
A python library for easy manipulation and forecasting of time series.

Time Series Made Easy in Python darts is a python library for easy manipulation and forecasting of time series. It contains a variety of models, from

Unit8 5.2k Jan 04, 2023
This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch. It uses a simple TestEnvironment to test the algorithm

Martin Huber 59 Dec 09, 2022
Climin is a Python package for optimization, heavily biased to machine learning scenarios

climin climin is a Python package for optimization, heavily biased to machine learning scenarios distributed under the BSD 3-clause license. It works

Biomimetic Robotics and Machine Learning at Technische Universität München 177 Sep 02, 2022
GAM timeseries modeling with auto-changepoint detection. Inspired by Facebook Prophet and implemented in PyMC3

pm-prophet Pymc3-based universal time series prediction and decomposition library (inspired by Facebook Prophet). However, while Faceook prophet is a

Luca Giacomel 314 Dec 25, 2022
Winning solution for the Galaxy Challenge on Kaggle

Winning solution for the Galaxy Challenge on Kaggle

Sander Dieleman 483 Jan 02, 2023
A python fast implementation of the famous SVD algorithm popularized by Simon Funk during Netflix Prize

⚡ funk-svd funk-svd is a Python 3 library implementing a fast version of the famous SVD algorithm popularized by Simon Funk during the Neflix Prize co

Geoffrey Bolmier 171 Dec 19, 2022
A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Research This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers

Domino Data Lab 73 Oct 17, 2022
Test symmetries with sklearn decision tree models

Test symmetries with sklearn decision tree models Setup Begin from an environment with a recent version of python 3. source setup.sh Leave the enviro

Rupert Tombs 2 Jul 19, 2022