Distributed deep learning on Hadoop and Spark clusters.

Last update: Dec 28, 2022

Related tags

Overview

Note: we're lovingly marking this project as Archived since we're no longer supporting it. You are welcome to read the code and fork your own version of it and continue to use this code under the terms of the project license.

CaffeOnSpark

What's CaffeOnSpark?

CaffeOnSpark brings deep learning to Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.

As a distributed extension of Caffe, CaffeOnSpark supports neural network model training, testing, and feature extraction. Caffe users can now perform distributed learning using their existing LMDB data files and minorly adjusted network configuration (as illustrated).

CaffeOnSpark is a Spark package for deep learning. It is complementary to non-deep learning libraries MLlib and Spark SQL. CaffeOnSpark's Scala API provides Spark applications with an easy mechanism to invoke deep learning (see sample) over distributed datasets.

CaffeOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud. It's been in use by Yahoo for image search, content classification and several other use cases.

Why CaffeOnSpark?

CaffeOnSpark provides some important benefits (see our blog) over alternative deep learning solutions.

It enables model training, test and feature extraction directly on Hadoop datasets stored in HDFS on Hadoop clusters.
It turns your Hadoop or Spark cluster(s) into a powerful platform for deep learning, without the need to set up a new dedicated cluster for deep learning separately.
Server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck.
Caffe users' existing datasets (e.g. LMDB) and configurations could be applied for distributed learning without any conversion needed.
High-level API empowers Spark applications to easily conduct deep learning.
Incremental learning is supported to leverage previously trained models or snapshots.
Additional data formats and network interfaces could be easily added.
It can be easily deployed on public cloud (ex. AWS EC2) or a private cloud.

Using CaffeOnSpark

Please check CaffeOnSpark wiki site for detailed documentations such as building instruction, API reference and getting started guides for standalone cluster and AWS EC2 cluster.

Batch sizes specified in prototxt files are per device.
Memory layers should not be shared among GPUs, and thus "share_in_parallel: false" is required for layer configuration.

Building for Spark 2.X

CaffeOnSpark supports both Spark 1.x and 2.x. For Spark 2.0, our default settings are:

spark-2.0.0
hadoop-2.7.1
scala-2.11.7 You may want to adjust them in caffe-grid/pom.xml.

Mailing List

Please join CaffeOnSpark user group for discussions and questions.

License

The use and distribution terms for this software are covered by the Apache 2.0 license. See LICENSE file for terms.

Distributed deep learning on Hadoop and Spark clusters.

Related tags

Overview

Note: we're lovingly marking this project as Archived since we're no longer supporting it. You are welcome to read the code and fork your own version of it and continue to use this code under the terms of the project license.

CaffeOnSpark

What's CaffeOnSpark?

Why CaffeOnSpark?

Using CaffeOnSpark

Building for Spark 2.X

Mailing List

License

Owner

Yahoo

This handbook accompanies the course: Machine Learning with Hung-Yi Lee

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

MLBox is a powerful Automated Machine Learning python library.

The MLOps is the process of continuous integration and continuous delivery of Machine Learning artifacts as a software product, keeping it inside a loop of Design, Model Development and Operations.

Dragonfly is an open source python library for scalable Bayesian optimisation.

High performance Python GLMs with all the features!

A Python implementation of GRAIL, a generic framework to learn compact time series representations.

Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification

As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Crate will be the hub of various ML projects which will be the resources for the ML enthusiasts! Open Source Program: SWOC 2021 and JWOC 2022.

Bayesian optimization in JAX

PySpark ML Bank Churn Prediction

A Streamlit demo to interactively visualize Uber pickups in New York City

pandas, scikit-learn, xgboost and seaborn integration

Automated Time Series Forecasting

Open MLOps - A Production-focused Open-Source Machine Learning Framework

Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Bayesian Additive Regression Trees For Python

WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

ANNchor is a python library which constructs approximate k-nearest neighbour graphs for slow metrics.

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.