Winning solution for the Galaxy Challenge on Kaggle

Overview

kaggle-galaxies

Winning solution for the Galaxy Challenge on Kaggle (http://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge).

Documentation about the method and the code is available in doc/documentation.pdf. Information on how to generate the solution file can also be found below.

Generating the solution

Install the dependencies

Instructions for installing Theano and getting it to run on the GPU can be found here. It should be possible to install NumPy, SciPy, scikit-image and pandas using pip or easy_install. To install pylearn2, simply run:

git clone git://github.com/lisa-lab/pylearn2.git

and add the resulting directory to your PYTHONPATH.

The optional dependencies listed in the documentation don't have to be installed to reproduce the winning solution: the generated data files are already provided, so they don't have to be regenerated (but of course you can if you want to). If you want to install them, please refer to their respective documentation.

Download the code

To download the code, run:

git clone git://github.com/benanne/kaggle-galaxies.git

A bunch of data files (extracted sextractor parameters, IDs files, training labels in NumPy format, ...) are also included. I decided to include these since generating them is a bit tedious and requires extra dependencies. It's about 20MB in total, so depending on your connection speed it could take a minute. Cloning the repository should also create the necessary directory structure (see doc/documentation.pdf for more info).

Download the training data

Download the data files from Kaggle. Place and extract the files in the following locations:

  • data/raw/training_solutions_rev1.csv
  • data/raw/images_train_rev1/*.jpg
  • data/raw/images_test_rev1/*.jpg

Note that the zip file with the training images is called images_training_rev1.zip, but they should go in a directory called images_train_rev1. This is just for consistency.

Create data files

This step may be skipped. The necessary data files have been included in the git repository. Nevertheless, if you wish to regenerate them (or make changes to how they are generated), here's how to do it.

  • create data/train_ids.npy by running python create_train_ids_file.py.
  • create data/test_ids.npy by running python create_test_ids_file.py.
  • create data/solutions_train.npy by running python convert_training_labels_to_npy.py.
  • create data/pysex_params_extra_*.npy.gz by running python extract_pysex_params_extra.py.
  • create data/pysex_params_gen2_*.npy.gz by running python extract_pysex_params_gen2.py.

Copy data to RAM

Copy the train and test images to /dev/shm by running:

python copy_data_to_shm.py

If you don't want to do this, you'll need to modify the realtime_augmentation.py file in a few places. Please refer to the documentation for more information.

Train the networks

To train the best single model, run:

python try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py

On a GeForce GTX 680, this took about 67 hours to run to completion. The prediction file generated by this script, predictions/final/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.csv.gz, should get you a score that's good enough to land in the #1 position (without any model averaging). You can similarly run the other try_*.py scripts to train the other models I used in the winning ensemble.

If you have more than 2GB of GPU memory, I recommend disabling Theano's garbage collector with allow_gc=False in your .theanorc file or in the THEANO_FLAGS environment variable, for a nice speedup. Please refer to the Theano documentation for more information on how to get the most out Theano's GPU support.

Generate augmented predictions

To generate predictions which are averaged across multiple transformations of the input, run:

python predict_augmented_npy_maxout2048_extradense.py

This takes just over 4 hours on a GeForce GTX 680, and will create two files predictions/final/augmented/valid/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.npy.gz and predictions/final/augmented/test/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.npy.gz. You can similarly run the corresponding predict_augmented_npy_*.py files for the other models you trained.

Blend augmented predictions

To generate blended prediction files from all the models for which you generated augmented predictions, run:

python ensemble_predictions_npy.py

The script checks which files are present in predictions/final/augmented/test/ and uses this to determine the models for which predictions are available. It will create three files:

  • predictions/final/blended/blended_predictions_uniform.npy.gz: uniform blend.
  • predictions/final/blended/blended_predictions.npy.gz: weighted linear blend.
  • predictions/final/blended/blended_predictions_separate.npy.gz: weighted linear blend, with separate weights for each question.

Convert prediction file to CSV

Finally, in order to prepare the predictions for submission, the prediction file needs to be converted from .npy.gz format to .csv.gz. Run the following to do so (or similarly for any other prediction file in .npy.gz format):

python create_submission_from_npy.py predictions/final/blended/blended_predictions_uniform.npy.gz

Submit predictions

Submit the file predictions/final/blended/blended_predictions_uniform.csv.gz on Kaggle to get it scored. Note that the process of generating this file involves considerable randomness: the weights of the networks are initialised randomly, the training data for each chunk is randomly selected, ... so I cannot guarantee that you will achieve the same score as I did. I did not use fixed random seeds. This might not have made much of a difference though, since different GPUs and CUDA toolkit versions will also introduce different rounding errors.

Owner
Sander Dieleman
Sander Dieleman
A toolbox to iNNvestigate neural networks' predictions!

iNNvestigate neural networks! Table of contents Introduction Installation Usage and Examples More documentation Contributing Releases Introduction In

Maximilian Alber 1.1k Jan 05, 2023
NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

SUN Group @ UMN 28 Aug 03, 2022
Adaptive: parallel active learning of mathematical functions

adaptive Adaptive: parallel active learning of mathematical functions. adaptive is an open-source Python library designed to make adaptive parallel fu

741 Dec 27, 2022
A machine learning project that predicts the price of used cars in the UK

Car Price Prediction Image Credit: AA Cars Project Overview Scraped 3000 used cars data from AA Cars website using Python and BeautifulSoup. Cleaned t

Victor Umunna 7 Oct 13, 2022
Official code for HH-VAEM

HH-VAEM This repository contains the official Pytorch implementation of the Hierarchical Hamiltonian VAE for Mixed-type Data (HH-VAEM) model and the s

Ignacio Peis 8 Nov 30, 2022
AP1 Transcription Factor Binding Site Prediction

A machine learning project that predicted binding sites of AP1 transcription factor, using ChIP-Seq data and local DNA shape information.

1 Jan 21, 2022
Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Samrat Mitra 2 Nov 18, 2021
In this Repo a simple Sklearn Model will be trained and pushed to MLFlow

SKlearn_to_MLFLow In this Repo a simple Sklearn Model will be trained and pushed to MLFlow Install This Repo is based on poetry python3 -m venv .venv

1 Dec 13, 2021
Sequence learning toolkit for Python

seqlearn seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API. Comp

Lars 653 Dec 27, 2022
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

802 Jan 01, 2023
CrayLabs and user contibuted examples of using SmartSim for various simulation and machine learning applications.

SmartSim Example Zoo This repository contains CrayLabs and user contibuted examples of using SmartSim for various simulation and machine learning appl

Cray Labs 14 Mar 30, 2022
Traingenerator 🧙 A web app to generate template code for machine learning ✨

Traingenerator 🧙 A web app to generate template code for machine learning ✨ 🎉 Traingenerator is now live! 🎉

Johannes Rieke 1.2k Jan 07, 2023
Fundamentals of Machine Learning

Fundamentals-of-Machine-Learning This repository introduces the basics of machine learning algorithms for preprocessing, regression and classification

Happy N. Monday 3 Feb 15, 2022
Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

FINRA 25 Dec 28, 2022
Azure MLOps (v2) solution accelerators.

Azure MLOps (v2) solution accelerator Welcome to the MLOps (v2) solution accelerator repository! This project is intended to serve as the starting poi

Microsoft Azure 233 Jan 01, 2023
QML: A Python Toolkit for Quantum Machine Learning

QML is a Python2/3-compatible toolkit for representation learning of properties of molecules and solids.

176 Dec 09, 2022
AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy m

Robin 55 Dec 27, 2022
A high-performance topological machine learning toolbox in Python

giotto-tda is a high-performance topological machine learning toolbox in Python built on top of scikit-learn and is distributed under the G

giotto.ai 632 Dec 29, 2022
Automated Time Series Forecasting

AutoTS AutoTS is a time series package for Python designed for rapidly deploying high-accuracy forecasts at scale. There are dozens of forecasting mod

Colin Catlin 652 Jan 03, 2023
Formulae is a Python library that implements Wilkinson's formulas for mixed-effects models.

formulae formulae is a Python library that implements Wilkinson's formulas for mixed-effects models. The main difference with other implementations li

34 Dec 21, 2022