Full-featured Decision Trees and Random Forests learner.

Last update: Aug 15, 2022

Overview

CID3

This is a full-featured Decision Trees and Random Forests learner. It can save trees or forests to disk for later use. It is possible to query trees and Random Forests and to fill out an unlabeled file with the predicted classes. Documentation is not yet available, although the program options can be shown with command:

% java -jar cid3.jar -h

usage: java -jar cid3.jar
 -a,--analysis <name>    show causal analysis report
 -c,--criteria <name>    input criteria: c[Certainty], e[Entropy], g[Gini]
 -f,--file <name>        input file
 -h,--help               print this message
 -o,--output <name>      output file
 -p,--partition          partition train/test data
 -q,--query <type>       query model, enter: t[Tree] or r[Random forest]
 -r,--forest <amount>    create random forest, enter # of trees
 -s,--save               save tree/random forest
 -t,--threads <amount>   maximum number of threads (default is 500)
 -v,--validation         create 10-fold cross-validation
 -ver,--version          version

List of features

It uses a new Certainty formula as splitting criteria.
Provides causal analysis report, which shows how some attribute values cause a particular classification.
Creates full trees, showing error rates for train and test data, attribute importance, causes and false positives/negatives.
If no test data is provided, it can split the train dataset in 80% for training and 20% for testing.
Creates random forests, showing error rates for train and test data, attribute importance, causes and false positives/negatives. Random forests are created in parallel, so it is very fast.
Creates 10 Fold Cross-Validation for trees and random forests, showing error rates, mean and Standard Error and false positives/negatives. Cross-Validation folds are created in parallel.
Saves trees and random forests to disk in a compressed file. (E.g. model.tree, model.forest)
Query trees and random forest from saved files. Queries can contain missing values, just enter the character: “?”.
Make predictions and fill out cases files with those predictions, either from single trees or random forests.
Missing values imputation for train and test data is implemented. Continuous attributes are imputed as the mean value. Discrete attributes are imputed as MODE, which selects the value that is most frequent.
Ignoring attributes is implemented. In the .names file just set the attribute type as: ignore.
Three different splitting criteria can be used: Certainty, Entropy and Gini. If no criteria is invoked then Certainty will be used.

Example run with titanic dataset

[email protected] datasets % java -jar cid3.jar -f titanic

CID3 [Version 1.1]              Saturday October 30, 2021 06:34:11 AM
------------------
[ ✓ ] Read data: 891 cases for training. (10 attributes)
[ ✓ ] Decision tree created.

Rules: 276
Nodes: 514

Importance Cause   Attribute Name
---------- -----   --------------
      0.57   yes ············ Sex
      0.36   yes ········· Pclass
      0.30   yes ··········· Fare
      0.28   yes ······· Embarked
      0.27   yes ·········· SibSp
      0.26   yes ·········· Parch
      0.23    no ············ Age


[==== TRAIN DATA ====] 

Correct guesses:  875
Incorrect guesses: 16 (1.8%)

# Of Cases  False Pos  False Neg   Class
----------  ---------  ---------   -----
       549         14          2 ····· 0
       342          2         14 ····· 1

Time: 0:00:00

Requirements

CID3 requires JDK 15 or higher.

The data format is similar to that of C4.5 and C5.0. The data file format is CSV, and it could be split in two separated files, like: titanic.data and titanic.test. The class attribute column must be the last column of the file. The other necessary file is the "names" file, which should be named like: titanic.names, and it contains the names and types of the attributes. The first line is the class attribute possible values. This line could be left empty with just a dot(.) Below is an example of the titanic.names file:

0,1.  
PassengerId: ignore.  
Pclass: 1,2,3.  
Sex : male,female.  
Age: continuous.  
SibSp: discrete.  
Parch: discrete.  
Ticket: ignore.  
Fare: continuous.  
Cabin: ignore.  
Embarked: discrete.

Example of causal analysis

% java -jar cid3.jar -f adult -a education

From this example we can see that attribute "education" is a cause, which is based on the certainty-raising inequality. Once we know that it is a cause we then compare the causal certainties of its values. When it's value is "Doctorate" it causes the earnings to be greater than $50,000, with a probability of 0.73. A paper will soon be published with all the formulas used to calculate the Certainty for splitting the nodes and the certainty-raising inequality, used for causal analysis.

Importance Cause   Attribute Name
---------- -----   --------------
      0.56   yes ······ education

Report of causal certainties
----------------------------

[ Attribute: education ]

    1st-4th --> <=50K  (0.97)

    5th-6th --> <=50K  (0.95)

    7th-8th --> <=50K  (0.94)

    9th --> <=50K  (0.95)

    10th --> <=50K  (0.94)

    11th --> <=50K  (0.95)

    12th --> <=50K  (0.93)

    Assoc-acdm --> <=50K  (0.74)

    Assoc-voc --> <=50K  (0.75)

    Bachelors --> Non cause.

    Doctorate --> >50K  (0.73)

    HS-grad --> <=50K  (0.84)

    Masters --> >50K  (0.55)

    Preschool --> <=50K  (0.99)

    Prof-school --> >50K  (0.74)

    Some-college --> <=50K  (0.81)

Releases(v1.2.4)

v1.2.4(Apr 28, 2022)

Fixed a bug when entering an attribute name for causal analysis report.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2.3(Mar 10, 2022)

Implemented progress animation when option -s is invoked.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2.2(Mar 2, 2022)

Added progress animation to the analysis report.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2.1(Jan 21, 2022)

Replaced a problematic character.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2(Nov 9, 2021)

This version includes de correct calculation of causal certainties and the certainty raising inequality. Also the analysis report is sorted by attribute values.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.5(Nov 7, 2021)

Implemented correctly the causal analysis, using the certainty-raising inequality and the causal certainties.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.3(Nov 7, 2021)

Implemented causes for specific attribute values.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.2(Nov 6, 2021)

Minor patch.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.1(Oct 31, 2021)

This is a hurried patch to fix a problem in the causal analysis report. Now the report works as it was intended.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1(Oct 30, 2021)

Release v1.1 contains many new features and fixes. Implemented report of causal certainties, which allows to see how certain attribute values cause a particular classification.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.0.7(Oct 28, 2021)

Code cleanup and new features implemented. When querying a tree now checks for invalid input and asks for correct input. This will be the last patch until version v1.1
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.0.6(Oct 28, 2021)

Correctly aligned text on console.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.0.5(Oct 27, 2021)

Reintroduced attribute importance for Entropy and Gini criteria.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.4(Oct 27, 2021)

Removed causal analysis from Entropy and Gini criteria. It only makes sense with Certainty.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.3(Oct 23, 2021)

Rolled back the parallel tests of Random Forests. It is much faster now.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.2(Oct 23, 2021)

Minor changes.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.1(Oct 23, 2021)

Now testing Random Forests is done in parallel.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0(Oct 18, 2021)

Releasing version v1.0
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)

Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

gHHC Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, D

35 Nov 16, 2022

A python library to build Model Trees with Linear Models at the leaves.

212 Dec 30, 2022

Full-featured Decision Trees and Random Forests learner.

Related tags

Overview

CID3

List of features

Example run with titanic dataset

Requirements

Example of causal analysis

You might also like...

Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Random-Afg - Afghanistan Random Old Idz Cloner Tools

ElegantRL is featured with lightweight, efficient and stable, for researchers and practitioners.

This program writes christmas wish programmatically. It is using turtle as a pen pointer draw christmas trees and stars.

Simulate genealogical trees and genomic sequence data using population genetic models

TreeSubstitutionCipher - Encryption system based on trees and substitution

Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

A python library to build Model Trees with Linear Models at the leaves.

Releases(v1.2.4)

v1.2.4(Apr 28, 2022)

v1.2.3(Mar 10, 2022)

v1.2.2(Mar 2, 2022)

v1.2.1(Jan 21, 2022)

v1.2(Nov 9, 2021)

v1.1.5(Nov 7, 2021)

v1.1.3(Nov 7, 2021)

v1.1.2(Nov 6, 2021)

v1.1.1(Oct 31, 2021)

v1.1(Oct 30, 2021)

v1.0.7(Oct 28, 2021)

v1.0.6(Oct 28, 2021)

v1.0.5(Oct 27, 2021)

v1.0.4(Oct 27, 2021)

v1.0.3(Oct 23, 2021)

v1.0.2(Oct 23, 2021)

v1.0.1(Oct 23, 2021)

v1.0(Oct 18, 2021)

Owner

Alejandro Penate-Diaz

Time Series Cross-Validation -- an extension for scikit-learn

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

A Light in the Dark: Deep Learning Practices for Industrial Computer Vision

Technical Indicators implemented in Python only using Numpy-Pandas as Magic - Very Very Fast! Very tiny! Stock Market Financial Technical Analysis Python library . Quant Trading automation or cryptocoin exchange

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Testing and Estimation of structural breaks in Stata

Opinionated code formatter, just like Python's black code formatter but for Beancount

DIT is a DTLS MitM proxy implemented in Python 3. It can intercept, manipulate and suppress datagrams between two DTLS endpoints and supports psk-based and certificate-based authentication schemes (RSA + ECC).

Asymmetric metric learning for knowledge transfer

Addon and nodes for working with structural biology and molecular data in Blender.

ICCV2021 - A New Journey from SDRTV to HDRTV.

This repository contains all data used for writing a research paper Multiple Object Trackers in OpenCV: A Benchmark, presented in ISIE 2021 conference in Kyoto, Japan.

Deep Sketch-guided Cartoon Video Inbetweening

Film review classification

Implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

A knowledge base construction engine for richly formatted data

COD-Rank-Localize-and-Segment (CVPR2021)

Ascend your Jupyter Notebook usage

(NeurIPS 2021) Pytorch implementation of paper "Re-ranking for image retrieval and transductive few-shot classification"

Request execution of Galaxy SARS-CoV-2 variation analysis workflows on input data you provide.