Exploratory Data Analysis for Employee Retention Dataset

Overview

Exploratory Data Analysis for Employee Retention Dataset

  • Employee turn-over is a very costly problem for companies.
  • The cost of replacing an employee if often larger than 100K USD, taking into account the time spent to interview and find a replacement, placement fees, sign-on bonuses and the loss of productivity for several months.
  • It is only natural then that data science has started being applied to this area.
  • Understanding why and when employees are most likely to leave can lead to actions to improve employee retention as well as planning new hiring in advance. This application of DS is sometimes called people analytics or people data science
  • We got employee data from a few companies. We have data about all employees who joined from 2011/01/24 to 2015/12/13. For each employee, we also know if they are still at the company as of 2015/12/13 or they have quit.
  • Beside that, we have general info about the employee, such as avg salary during her tenure, dept, and yrs of experience.

Goal:

In this challenge, you have a data set with info about the employees and have to predict when employees are going to quit by understanding the main drivers of employee churn.

  • Assume, for each company, that the headcount starts from zero on 2011/01/23. Estimate employee headcount, for each company, on each day, from 2011/01/24 to 2015/12/13. That is, if by 2012/03/02 2000 people have joined company 1 and 1000 of them have already quit, then company headcount on 2012/03/02 for company 1 would be 1000.
  • You should create a table with 3 columns: day, employee_headcount, company_id. What are the main factors that drive employee churn? Do they make sense? Explain your findings.
  • If you could add to this data set just one variable that could help explain employee churn, what would that be?

Data: (data/employee_retention_data.csv)

Columns:

  • employee_id : id of the employee. Unique by employee per company
  • company_id : company id.
  • dept : employee dept
  • seniority : number of yrs of work experience when hired
  • salary: avg yearly salary of the employee during her tenure within the company
  • join_date: when the employee joined the company, it can only be between 2011/01/24 and 2015/12/13
  • quit_date: when the employee left her job (if she is still employed as of 2015/12/13, this field is NA)

Question 1

Function that returns a list of the names of categorical variables

  • Define a function with name get_categorical_variables
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return list of all categorical fields available.

Question 2

Function that returns the list of the names of numeric variables

  • Define a function with name get_numerical_variables
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return list of all numerical fields available.

Question 3

Function that returns, for numeric variables, mean, median, 25, 50, 75th percentile

  • Define a function with name get_numerical_variables_percentile
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return dataframe with following columns:
    • variable name
    • mean
    • median
    • 25th percentile
    • 50th percentile
    • 75th percentile

Question 4

For categorical variables, get modes

  • Define a function with name get_categorical_variables_modes
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return dict object with following keys:
    • converted
    • country
    • new_user
    • source

Question 5

For each column, list the count of missing values

  • Define a function with name get_missing_values_count
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return dataframe with following columns:
    • var_name
    • missing_value_count

Question 6

Plot histograms using different subplots of all the numerical values in a single plot

  • Define a function with name plot_histogram_with_numerical_values
  • Pass dataframe and list of columns you want to plot as parameter
  • Plot the graph
  • Add column names as plot names (In case you dont understand this please connect with instructor)
  • Change the histogram colour to yellow
  • Fit a normal curve on those histograms (In case you dont understand this please connect with instructor)
Owner
kana sudheer reddy
curently studying in presidency university banglore
kana sudheer reddy
Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains This repository contains the source code for an end-to-end open-domain question

7 Sep 27, 2022
CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner. It is aimed to integrate this tool with several more features including providing a U

Ravi Prakash 3 Jun 27, 2021
A 2-dimensional physics engine written in Cairo

A 2-dimensional physics engine written in Cairo

Topology 38 Nov 16, 2022
Data-sets from the survey and analysis

bachelor-thesis "Umfragewerte.xlsx" contains the orginal survey results. "umfrage_alle.csv" contains the survey results but one participant is cancele

1 Jan 26, 2022
Using Python to derive insights on particular Pokemon, Types, Generations, and Stats

Pokémon Analysis Andreas Nikolaidis February 2022 Introduction Exploratory Analysis Correlations & Descriptive Statistics Principal Component Analysis

Andreas 1 Feb 18, 2022
My solution to the book A Collection of Data Science Take-Home Challenges

DS-Take-Home Solution to the book "A Collection of Data Science Take-Home Challenges". Note: Please don't contact me for the dataset. This repository

Jifu Zhao 1.5k Jan 03, 2023
A set of tools to analyse the output from TraDIS analyses

QuaTradis (Quadram TraDis) A set of tools to analyse the output from TraDIS analyses Contents Introduction Installation Required dependencies Bioconda

Quadram Institute Bioscience 2 Feb 16, 2022
A Python package for the mathematical modeling of infectious diseases via compartmental models

A Python package for the mathematical modeling of infectious diseases via compartmental models. Originally designed for epidemiologists, epispot can be adapted for almost any type of modeling scenari

epispot 12 Dec 28, 2022
Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

Keenan Cooper 1 Jan 12, 2022
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
A Python module for clustering creators of social media content into networks

sm_content_clustering A Python module for clustering creators of social media content into networks. Currently supports identifying potential networks

72 Dec 30, 2022
Developed for analyzing the covariance for OrcVIO

about This repo is developed for analyzing the covariance for OrcVIO environment setup platform ubuntu 18.04 using conda conda env create --file envir

Sean 1 Dec 08, 2021
Stitch together Nanopore tiled amplicon data without polishing a reference

Stitch together Nanopore tiled amplicon data using a reference guided approach Tiled amplicon data, like those produced from primers designed with pri

Amanda Warr 14 Aug 30, 2022
Basis Set Format Converter

Basis Set Format Converter Repository for the online tool that allows you to enter a basis set in the form of text input for a variety of Quantum Chem

Manas Sharma 3 Jun 27, 2022
Tools for working with MARC data in Catalogue Bridge.

catbridge_tools Tools for working with MARC data in Catalogue Bridge. Borrows heavily from PyMarc

1 Nov 11, 2021
COVID-19 deaths statistics around the world

COVID-19-Deaths-Dataset COVID-19 deaths statistics around the world This is a daily updated dataset of COVID-19 deaths around the world. The dataset c

Nisa Efendioğlu 4 Jul 10, 2022
Retentioneering 581 Jan 07, 2023
Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

Art Tucker 1 Sep 05, 2021
Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

PandasVault ⁠— Advanced Pandas Functions and Code Snippets The only Pandas utility package you would ever need. It has no exotic external dependencies

Derek Snow 374 Jan 07, 2023
This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

NSE-timeseries-form-CSV-file-creator-and-SQL-appender- This creates a ohlc timeseries from downloaded CSV files from National Stock Exchange India (NS

PILLAI, Amal 1 Oct 02, 2022