Mining the Stack Overflow Developer Survey

Overview

Mining the Stack Overflow Developer Survey

A prototype data mining application to compare the accuracy of decision tree and random forest regression models to predict annual compensation of tech workers in the US and Europe.

Objectives

Usage

To run, download the repository and execute the file main.py in the src directory with your python path variable. For example, python3 main.py.

Dependencies

  • python 3.8.1 and up
  • pandas 1.3.4 and up
  • matplotlib 3.4.3 and up
  • numpy 1.21.0 and up
  • sklearn 1.0.1 and up

Methodology

Preprocessing

The original data set provided by Stack Overflow contained 48 attribute columns and 83439 data records. Due to the large size of the data set, we wanted to narrow our focus to a certain subset of the data. In the preprocessing of the original data file, we decided to discard any records that were not employed full-time in the technology industry. Any record that did not contain country, converted annual salary, or yeared coded was also discarded, as this data is vital to our model. We also discarded some of the columns from the original data set that were open-ended. Out of the records that fit our requirements, we exported them to two output csv files. Records of United States data were put together in one output file, and records of European countries were put in the other. Data from any other countries were discarded. Once we have the two cleaned files, we applied additional preprocessing techniques. Any missing attributes that remained were replaced with 'NA' if the attributes were nominal. Two special cases existed in the columns for years coded and years coded professionally. Most contained a numerical value for the years, but some had a string for 'Less than one year' and 'More than 50 years'. These strings were replaced with 0 and 50, respectively, to keep these columns numerical. With these preprocessing steps complete, the data files are now ready to be processed to generate the models.

Models

We evaluated a variety of data mining models and algorithms to find the ones that would make the most sense for our data set and objectives. With our goal of predicting a numerical value for annual salary, we knew we needed to use a compatible regression model. We found regression models for decision trees and random forests and wanted to compare their accuracy. We wanted to see how the accuracy of a single decision tree compares to the accuracy of a random forest model, which is a number of trees together. The results are detailed in the results and analysis section. Below are the implementation details of each model.

Decision tree model

We selected the DecisionTreeRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameter we changed was the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that a maximum depth of ADD RES HERE resulted in the most accurate decision tree model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Random forest model

We selected the RandomForestRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameters we changed were the number of trees to estimate with and the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that ADD RES HERE trees in the forest with a maximum depth of ADD RES HERE resulted in the most accurate random forest model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Results and Analysis

Authors

Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. 😃 Motiv

Souvik Pratiher 31 Dec 16, 2022
NumPy aware dynamic Python compiler using LLVM

Numba A Just-In-Time Compiler for Numerical Functions in Python Numba is an open source, NumPy-aware optimizing compiler for Python sponsored by Anaco

Numba 8.2k Jan 07, 2023
vartests is a Python library to perform some statistic tests to evaluate Value at Risk (VaR) Models

gg I wasn't satisfied with any of the other available Gemini clients, so I wrote my own. Requires Python 3.9 (maybe older, I haven't checked) and opti

RAFAEL RODRIGUES 5 Jan 03, 2023
Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

Fabio Barbazza 1 Feb 15, 2022
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

📈 Statistical Quality Control 📉 This repo contains a simple but effective tool made using python which can be used for quality control in statistica

SasiVatsal 8 Oct 18, 2022
PyChemia, Python Framework for Materials Discovery and Design

PyChemia, Python Framework for Materials Discovery and Design PyChemia is an open-source Python Library for materials structural search. The purpose o

Materials Discovery Group 61 Oct 02, 2022
Flood modeling by 2D shallow water equation

hydraulicmodel Flood modeling by 2D shallow water equation. Refer to Hunter et al (2005), Bates et al. (2010). Diffusive wave approximation Local iner

6 Nov 30, 2022
PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

1 Feb 07, 2022
Performance analysis of predictive (alpha) stock factors

Alphalens Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open sour

Quantopian, Inc. 2.5k Jan 09, 2023
MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI Hallo

Florent Zahoui 1 Feb 07, 2022
A set of functions and analysis classes for solvation structure analysis

SolvationAnalysis The macroscopic behavior of a liquid is determined by its microscopic structure. For ionic systems, like batteries and many enzymes,

MDAnalysis 19 Nov 24, 2022
DaCe is a parallel programming framework that takes code in Python/NumPy and other programming languages

aCe - Data-Centric Parallel Programming Decoupling domain science from performance optimization. DaCe is a parallel programming framework that takes c

SPCL 330 Dec 30, 2022
Repository created with LinkedIn profile analysis project done

EN/en Repository created with LinkedIn profile analysis project done. The datase

Mayara Canaver 4 Aug 06, 2022
An easy-to-use feature store

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

ByteHub AI 48 Dec 09, 2022
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022
A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data struc

Zed(Zijun) Chen 40 Dec 12, 2022
Multiple Pairwise Comparisons (Post Hoc) Tests in Python

scikit-posthocs is a Python package that provides post hoc tests for pairwise multiple comparisons that are usually performed in statistical data anal

Maksim Terpilowski 264 Dec 30, 2022
Learn machine learning the fun way, with Oracle and RedBull Racing

Red Bull Racing Analytics Hands-On Labs Introduction Are you interested in learning machine learning (ML)? How about doing this in the context of the

Oracle DevRel 55 Oct 24, 2022
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Dec 31, 2022
Using approximate bayesian posteriors in deep nets for active learning

Bayesian Active Learning (BaaL) BaaL is an active learning library developed at ElementAI. This repository contains techniques and reusable components

ElementAI 687 Dec 25, 2022