Estudos e projetos feitos com PySpark.

Last update: Nov 06, 2022

Related tags

Overview

PySpark (Spark com Python)

PySpark é uma biblioteca Spark escrita em Python, e seu objetivo é permitir a análise interativa dos dados em um ambiente distribuído. Seu uso é extremamente importante quando o assunto é grande volume de dados, BigData, por conta do seu processamento eficiente de grandes conjuntos de dados.

Documentação

Data

Os dados para esse tutorial foram obtidos no Kaggle, a base é pequena, então teoricamente utilizar o pyspark nesse caso seria "matar uma mosca com um canhão", mas como o objetivo é explorar as principais funções, esse dataset vai nos atender bem.

Para fazer download desse conjunto de dados você precisa ter uma conta no kaggle.

Tópicos

Vamos explorar as principais funções:

Count
Describe
Select
OrderBy
WithColumnRenamed
WithColumn
When
Drop
Filter
Where
GroupBy

Requisitos

Você precisará de Python 3 e pip. É altamente recomendado utilizar ambientes virtuais com o virtualenv ou com o conda e o arquivo requirements.txt para instalar os pacotes dependências do projeto:

Conda

$ conda create --name nameenv python
$ conda activate nameenv
$ pip install -r requirements.txt

Virtualenv

$ pip3 install virtualenv
$ virtualenv venv -p python3
$ source venv/bin/activate
$ pip install -r requirements.txt

Observação

Para executar o PySpark, você também precisa que o Java seja instalado.

Estudos e projetos feitos com PySpark.

Related tags

Overview

PySpark (Spark com Python)

Data

Para fazer download desse conjunto de dados você precisa ter uma conta no kaggle.

Tópicos

Requisitos

Observação

Owner

Karinne Cristina

A Python toolbox to churn out organic alkalinity calculations with minimal brain engagement.

Dieses Projekt ermöglicht es den Smartmeter der EVN (Netz Niederösterreich) über die Kundenschnittstelle auszulesen.

Tutorials, examples, collections, and everything else that falls into the categories: pattern classification, machine learning, and data mining

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.

MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training

Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill

A simple and lightweight genetic algorithm for optimization of any machine learning model

pandas, scikit-learn, xgboost and seaborn integration

PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.

Create large-scale ML-driven multiscale simulation ensembles to study the interactions

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

A Python package for time series classification

Repositório para o #alurachallengedatascience1

[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

Combines Bayesian analyses from many datasets.

This repository demonstrates the usage of hover to understand and supervise a machine learning task.

JMP is a Mixed Precision library for JAX.

Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

ParaMonte is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions