Using Python to derive insights on particular Pokemon, Types, Generations, and Stats

Overview

Pokémon Analysis

Andreas Nikolaidis

February 2022

Introduction

In this project, I use Python to analayze stats on all Pokemon in Generations 1 - 8, and calculate some interesting statistics based on a number of factors.

We can use this data to answer questions such as:

  • Does a Pokemon's Type determine it's stats like: HP, Attack, Defense, etc.?
  • What is the most important stat for predicting other stats? i.e. which stats have a high correlation?

In the following sections, I will walk through my process of extracting and analyzing the information using in pandas DataFrames, creating some visualizations and perform modeling using scikit-learn.

Exploratory Analysis

Start by importing all the necessary packages into Python:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly.express as px
import plotly.graph_objects as go

sns.set_style('whitegrid')
%matplotlib inline

# Import for Linear Regression
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans

Read Data File:

df = pd.read_excel("pokemon.xlsx")

Create a separate dataframe including just the necessary stats:

df_stats = df[["Name","HP","Attack","Defense","SP_Attack","SP_Defense","Speed"]]

Although each stat is important in it's own right, the total value of all stats is what determines the category of a pokemon, therefore let's concatenate a column into the df that sums up the total values:

df['total'] = df.HP + df.Attack + df.Defense + df.SP_Attack + df.SP_Defense + df.Speed

Now let's view the range of total stats by each generation:

#palette: https://seaborn.pydata.org/tutorial/color_palettes.html?highlight=color
plt.figure(figsize=(13,10), dpi=80)
sns.violinplot(x='Gen', y='total', data=df, scale='width', inner='quartile', palette='Set2') 
plt.title('Violin Plot of Total Stats by Generation', fontsize=22)
plt.show()

2df65225-732a-4581-af16-46cbaf14b931

In the above violinplot we can see that each generation has quite a different range of total stats with Gens IV, VII, & VIII having the longest range, while Gen V had a relatively tight range of stats. All Generations from IV onwards had higher medians than the first 3 generations.

Looking at individual stats, Speed is one of (if not THE) most important stat in competitive play, so let's examine which generations had the best overall speed stats.

plt.figure(figsize=(13,10), dpi=80)
sns.violinplot(x='Gen', y='Speed', data=df, scale='width', inner='quartile', palette='Set2')

plt.title('Violin Plot of Total Stats by Generation', fontsize=22)
plt.show()

speed

Here we can clearly see Generation VIII has some of the fastest pokemon ever seen in games. Let's create a function to return the top 10 fastest pokemon in Gen VIII and their respective speed stat values:

def top_n(df, category, n):
    return (df.loc[df['Gen'] == 'VIII'].sort_values(category, ascending=False)[['Name','Gen',category]].head(n))
print('Top 10 Pokemon Speed')
top_n(df, 'Speed', 10)

speed_gen8

Those are definitely some fast pokemon!

Let's now see if we can get any indication of whether a particular pokemon's type has an advantage over others in total stats.

types_color_dict = {
    'grass':'#8ED752', 'fire':'#F95643', 'water':'#53AFFE', 'bug':"#C3D221", 'normal':"#BBBDAF", \
    'poison': "#AD5CA2", 'electric':"#F8E64E", 'ground':"#F0CA42", 'fairy':"#F9AEFE", \
    'fighting':"#A35449", 'psychic':"#FB61B4", 'rock':"#CDBD72", 'ghost':"#7673DA", \
    'ice':"#66EBFF", 'dragon':"#8B76FF", 'dark':"#1A1A1A", 'steel':"#C3C1D7", 'flying':"#75A4F9" }

plt.figure(figsize=(15,12), dpi=80)
sns.violinplot(x='Primary', y='total', data=df, scale='width', inner='quartile', palette=types_color_dict)

plt.title('Violin Plot of Total Stats by Type', fontsize=20)
plt.show()

total_type_stats

The dragon type definitely has quite a high upper interquartile range compared to other types. Meanwhile water & fairy types seem to have quite a large variance in total stats.

Let's see what the most common type of pokemon is:

types_color_dict = {
    'grass':'#8ED752', 'fire':'#F95643', 'water':'#53AFFE', 'bug':"#C3D221", 'normal':"#BBBDAF", \
    'poison': "#AD5CA2", 'electric':"#F8E64E", 'ground':"#F0CA42", 'fairy':"#F9AEFE", \
    'fighting':"#A35449", 'psychic':"#FB61B4", 'rock':"#CDBD72", 'ghost':"#7673DA", \
    'ice':"#66EBFF", 'dragon':"#8B76FF", 'dark':"#1A1A1A", 'steel':"#C3C1D7", 'flying':"#75A4F9" }


Type1 = pd.value_counts(df['Primary'])
sns.set()
dims = (11.7,8.27) #A4 dimensions
fig, ax=plt.subplots(figsize=dims)
BarT = sns.barplot(x=Type1.index, y=Type1, data=df, palette=types_color_dict, ax=ax)
BarT.set_xticklabels(BarT.get_xticklabels(), rotation= 90, fontsize=12)
BarT.set(ylabel = 'Freq')
BarT.set_title('Distribution of Primary Pokemon Types')
FigBar = BarT.get_figure()

type_distribution

We can see that the water and normal type pokemon are the most frequently appearing 'primary' types in the game.

Let's see how many pokemon are mono types vs dual-types so we can get a better sense of whether primary is sufficient.

labels = ['Mono type pokemon', 'Dual type pokemon']
sizes = [monotype, dualtype]
colors = ['lightskyblue', 'lightcoral']

patches, texts, _ = plt.pie(sizes, colors=colors, autopct='%1.1f%%', startangle=90, explode=(0,0.1))
plt.legend(patches, labels, loc="best")
plt.axis('equal')
plt.title('Dual-Type Ratio', fontsize=12)
plt.tight_layout()
plt.show()

mono_dual

Looks like there's actually more dual types than mono-types!

Aside from types, there are also 5 categories of pokemon: Regular, Pseudo-Legendary, Sub-Legendary, Legendary and Mythical. (There are of course also pre-evolutions, final evolutions, mega-evolutions etc.. but for the purposes of this analysis we will just bundle those together under 'regular' along with Pseudo-Legendary which are regular pokemon that have generally higher stats of 600 total. As for Sub Legendaries, Legendaries and Mythical - these pokemon typically exhibit 2 types of traits:

  1. Rarity: There is usually only 1 of those pokemon available in every game (some may not even be obtainable in certain games)
  2. Stats: These pokemon generally have much higher stats than the average 'regular' pokemon.

Let's create a diverging bar to determine the rate at which legendary pokemon appear in each generation:

#Sub-Legendary, Legendary or Mythical:
df.loc[df["is_sllm"]==False,"sllmid"] = 0
df.loc[df["is_sllm"]==True,"sllmid"] = 1

# calculate proportion of SL, L, M #
sllm_ratio = df.groupby("Gen").mean()["sllmid"]
sllm_ratio.round(4)*100
sns.set_style('darkgrid')
df_plot = pd.DataFrame(columns={"Gen","Rate","colors"})
x = sllm_ratio.values
df_plot["Gen"] = sllm_ratio.index
df_plot['Rate'] = (x - x.mean())/x.std()
df_plot['colors'] = ['red' if x < 0 else 'green' for x in df_plot['Rate']]
df_plot.sort_values('Rate', inplace=True)
df_plot.reset_index(inplace=True)

plt.figure(figsize=(14, 10))
plt.hlines(
    y=df_plot.index, xmin=0, xmax=df_plot.Rate,
    color=df_plot.colors,
    alpha=.4,
    linewidth=5)

plt.gca().set(xlabel='Rate', ylabel='Gen')
plt.yticks(df_plot.index, df_plot.Gen, fontsize=12)
plt.title('Diverging Bars of SubL, Legendary & Mythical Rate', fontdict={'size':20})
plt.show()

sub, legend,myth

Seems like Gen 7's Alola region has a huge volume of these 'legendaries & mythical' pokemon, which after digging further into it makes perfect sense given the introduction of a plethora of legendaries called 'ultra beasts' which were only ever introduced in that generation.

Correlations & Descriptive Statistics

Let's move to explore some correlations between stats.

#Correlation
Base_stats = ['Primary','Secondary','Classification','%Male','%Female',
              'Height','Weight','Capture_Rate','Base_Steps','HP','Attack','Defense',
              'SP_Attack','SP_Defense','Speed','is_sllm']

df_BS = df[Base_stats]
df_BS.head()
plt.figure(figsize=(14,12))

heatmap = sns.heatmap(df_BS.corr(), vmin=-1,vmax=1, annot=True, cmap='Blues')

heatmap.set_title('Correlation Base Stats Heatmap', fontdict={'fontsize':15}, pad=12)
plt.show()

correlation_plot

p1 = sns.jointplot(x="SP_Attack",y="SP_Defense",data=df,kind="hex",color="lightgreen")
p1.fig.suptitle("Hex Plot of Special Attack and Special Defense - Some Correlation")
p1.fig.subplots_adjust(top=0.95)
p2 = sns.jointplot(x="Defense",y="SP_Defense",data=df,kind="hex",color="lightblue")
p2.fig.suptitle("Hex Plot of Defense and Special Defense - Some Correlation")
p2.fig.subplots_adjust(top=0.95)
p3 = sns.jointplot(x="SP_Attack",y="Speed",data=df,kind="hex",color="pink")
p3.fig.suptitle("Hex Plot of Special Attack and Speed - Some Correlation")
p3.fig.subplots_adjust(top=0.95)
p4 = sns.jointplot(x="Attack",y="SP_Attack",data=df,kind="hex",color="orange")
p4.fig.suptitle("Hex Plot of Attack and Special Attack - Some Correlation")
p4.fig.subplots_adjust(top=0.95)
p5 = sns.jointplot(x="Attack",y="Defense",data=df,kind="hex",color="purple")
p5.fig.suptitle("Hex Plot of Attack and Defense - Some Correlation")
p5.fig.subplots_adjust(top=0.95)

hex_green hex_blue hex_red hex_orange hex_purple

from pandas import plotting
type1 = list(set(list(df['Primary'])))
cmap = plt.get_cmap('viridis')
colors = [cmap((type1.index(c) + 1) / (len(type1) + 2)) for c in df['Primary'].tolist()]
plotting.scatter_matrix(df.iloc[:, 13:18], figsize=(15, 15), color=colors, alpha=0.7) 
plt.show()

corrplot

import numpy as np
pd.DataFrame(np.corrcoef(df.iloc[:, 13:18].T.values.tolist()), 
             columns=df.iloc[:, 13:18].columns, index=df.iloc[:, 13:18].columns)

corrplot values

labels = ["Defense", "Attack"]
dims = (11.7, 8.27) #a4
fig, ax = plt.subplots(figsize=dims)
Defhist = sns.distplot(df['Defense'],color='g', hist=True, ax=ax)
Atthist = sns.distplot(df['Attack'],color='r', hist=True, ax=ax)
Atthist.set(title='Distribution of Defense & Attack')
plt.legend(labels, loc="best")
FigHist = Atthist.get_figure()

attack_defense

fig, ax = plt.subplots(2, 3, figsize=(14, 8), sharey=True)

spines = ["top","right","left"]
for i, col in enumerate(["HP", "Attack", "Defense", "SP_Attack", "SP_Defense", "Speed"]):
    sns.kdeplot(x=col, data=df, label=col, ax=ax[i//3][i%3],
                fill=True, color='lightblue', linewidth=2
               )
    
    ax[i//3][i%3].set_xlim(-5, 250)
    
    for s in spines:
        ax[i//3][i%3].spines[s].set_visible(False)
        

plt.tight_layout()
plt.show()

density_plots

df.describe()

std_dev_att_def

Looking at the summary statistics, we can see that the assumption about the variance and skewness of both plots was correct. The ‘std’ metric of the Attack is less than Defense, meaning that Defense statistics are more spread. Similarly, the Sp.Atk ‘std’ is larger than that of the Sp.Def. Skewness is determined by the positions of the median (50%) and the mean. Since in all instances (Attack, Defense, Sp.Attack and Sp.Defense) the mean is greater than the median, it is emphasised that the distribution is right-skewed (positively skewed).

Principal Component Analysis (PCA)

Let's analyze 800+ Pokemon as principal components and plot them in a two-dimensional plane using the first and second principal components. Principal component analysis (PCA) is a type of multivariate analysis method that is often used as a dimensionality reduction method.

In this data, the characteristics of 800+ Pokemon are represented by 6 types of "observed variables" (x1, x2, x3, x4, x5, x6). These 6 variables are used as explanatory variables. On the other hand, the synthetic variable synthesized by PCA is called "principal component score" and is given by a linear combination as shown in the following equation:

formula

In principal component analysis, the larger the eigenvalue (= variance of the principal component score), the more important the principal component score is. PCA is also sometimes regarded as a type of "unsupervised machine learning" and reveals the structure of the data itself. So let's start by importing PCA from Scikit-learn

from sklearn.decomposition import PCA
pca = PCA()
pca.fit(df.iloc[:, 13:18])
feature = pca.transform(df.iloc[:, 13:18])
plt.figure(figsize=(15, 15))
plt.scatter(feature[:, 0], feature[:, 1], alpha=0.8)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.grid()
plt.show()

PCA

import matplotlib.ticker as ticker
import numpy as np
plt.gca().get_xaxis().set_major_locator(ticker.MaxNLocator(integer=True))
plt.plot([0] + list( np.cumsum(pca.explained_variance_ratio_)), "-o")
plt.xlabel("Number of principal components")
plt.ylabel("Cumulative contribution ratio")
plt.grid()
plt.show()

components

Let's see if we can determine what makes a 'legendary' pokemon

pca = PCA()
pca.fit(df.iloc[:, 13:18])
feature = pca.transform(df.iloc[:, 13:18])
plt.figure(figsize=(15, 15))
for binary in [True, False]:
    plt.scatter(feature[df['is_sllm'] == binary, 0], feature[df['is_sllm'] == binary, 1], alpha=0.8, label=binary)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend(loc = 'best')
plt.grid()
plt.show()

pca_color

Nice! Although it's not 'exact' we can clearly see that when the first principal component (PC1) reaches 50, we start to see a significantly higher concentration of legendary pokemon! Now, let's illustrate how much PC1 actually contributes to the explanatory variable (parameter) with a loading plot.

components_stats

Assuming that the first principal component (PC1) is actually a strong indicator of whether or not a pokemon is classified as legendary, sub-legendary or mythical, it seems like Special Attack is the best indicator out of all stats (follow by Physical Attack)

In the second principal component (PC2), Defense and Speed contribute to the opposite: Positive & Negative.

"Factor Analysis" is a method that is similar to principal component analysis.

In PCA, we synthesized the "principal component" yPC1 which is a linear combination of the weight matrix (eigenvector) a for the explanatory variables. Here, define as many principal components as there are explanatory variables.

yPC1 = a1,1 x1 + a1,2 x2 + a1,3 x3 + a1,4 x4 + a1,5 + ...

In factor analysis, based on the idea that the explanatory variable (observed variable) x is synthesized from a latent variable called "factor", the factor score f, the weight matrix (factor load) w, and the unique factor e are specified. (There is no idea of ​​a unique factor in principal component analysis).

x1 = w1,1 f1 + w1,2 f2 + e1

x2 = w2,1 f1 + w2,2 f2 + e2

x3 = w3,1 f1 + w3,2 f2 + e3

x4 = w4,1 f1 + w4,2 f2 + e4

x5 = w5,1 f1 + w5,2 f2 + e5

x6 = w6,1 f1 + w6,2 f2 + e6

The factor score f is a latent variable unique to each individual (sample). The linear sum of the factor score and the factor load (w1,1 f1 + w1,2 f2, etc.) is called the "common factor" and can be observed as an "observed variable" by adding it to the "unique factor" e unique to the observed variable. It's a way of thinking. The number of factors is usually smaller than the explanatory variables and must be decided in advance.

(However, terms such as common factors and factors are very confusing because it seems that different people have different definitions as far as I can see)

from sklearn.decomposition import FactorAnalysis
fa = FactorAnalysis(n_components=2, max_iter=500)
factors = fa.fit_transform(df.iloc[:, 13:18])
plt.figure(figsize=(12, 12))
for binary in [True, False]:
    plt.scatter(factors[df['is_sllm'] == binary, 0], factors[df['is_sllm'] == binary, 1], alpha=0.8, label=binary)
plt.xlabel("Factor 1")
plt.ylabel("Factor 2")
plt.legend(loc = 'best')
plt.grid()
plt.show()

pca_color2

In this instance, the determining factor of a 'legendary' is whether or not the sum of factor 1 and factor 2 exceeds a certain level, but it seems that it is slightly biased toward the larger factor 2. So which parameters do factor 2 and factor 1 allude to?

plt.figure(figsize=(8, 8))
for x, y, name in zip(fa.components_[0], fa.components_[1], df.columns[13:18]):
    plt.text(x, y, name)
plt.scatter(fa.components_[0], fa.components_[1])
plt.grid()
plt.xlabel("Factor 1")
plt.ylabel("Factor 2")
plt.show()

component_stats(factor2)

Factor 1 highest value = "Defense" Factor 2 highest value = "Special Attack"

Let's create some charts!

Firstly I created a dendrogram (dendro = greek word for tree :)) for all pokemon (Image file is way too large to display clearly)

dfs = df.iloc[:, 13:18].apply(lambda x: (x-x.mean())/x.std(), axis=0)
from scipy.cluster.hierarchy import linkage, dendrogram
result1 = linkage(dfs, 
                  metric = 'euclidean', 
                  method = 'average')
plt.figure(figsize=(15, 150))
dendrogram(result1, orientation='right', labels=list(df['Name']), color_threshold=2)
plt.title("Dedrogram of Pokemon")
plt.xlabel("Threshold")
plt.grid()
plt.show()
def get_cluster_by_number(result, number):
    output_clusters = []
    x_result, y_result = result.shape
    n_clusters = x_result + 1
    cluster_id = x_result + 1
    father_of = {}
    x1 = []
    y1 = []
    x2 = []
    y2 = []
    for i in range(len(result) - 1):
        n1 = int(result[i][0])
        n2 = int(result[i][1])
        val = result[i][2]
        n_clusters -= 1
        if n_clusters >= number:
            father_of[n1] = cluster_id
            father_of[n2] = cluster_id

        cluster_id += 1

    cluster_dict = {}
    for n in range(x_result + 1):
        if n not in father_of:
            output_clusters.append([n])
            continue

        n2 = n
        m = False
        while n2 in father_of:
            m = father_of[n2]
            #print [n2, m]
            n2 = m

        if m not in cluster_dict:
            cluster_dict.update({m:[]})
        cluster_dict[m].append(n)

    output_clusters += cluster_dict.values()

    output_cluster_id = 0
    output_cluster_ids = [0] * (x_result + 1)
    for cluster in sorted(output_clusters):
        for i in cluster:
            output_cluster_ids[i] = output_cluster_id
        output_cluster_id += 1

    return output_cluster_ids
clusterIDs = get_cluster_by_number(result1, 50)
print(clusterIDs)

cluster_ids

plt.hist(clusterIDs, bins=50)
plt.show()

histo

Here we've created a histogram of clusters of pokemon that exhibit similar traits with each other. Here we've created 50 bins so there will be 50 different clusters of pokemon. That's quite a large number of charts to display so I'll just display several so you get the idea.

cluster4 cluster5 cluster6 cluster8 cluster10 cluster50

Some pokemon exhibit lots of traits similar to each other while others (like Regieleki) stand out.

Cross Validation & Regression Analysis

Since we saw earlier that Special Attack is a huge contributing factor to determining whether a pokemon is classified as 'legendary', let's use the rest of the stats to see if we can predict Special Attack!

X = df.iloc[:, 13:18]
y = df['total']
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, y)

print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination= ", regr.score(X, y))
df.columns[[12, 13, 14, 16, 17]]
X = df.iloc[:, [12, 13, 14, 16, 17]]
y = df['SP_Attack']

Cross Validation

In machine learning, in order to evaluate performance, known data is divided into training and test data. Training (learning) is performed using training data to build a prediction model, and performance evaluation is performed based on how accurately the test data that was not used to build the prediction model can be predicted. Such an evaluation method is called "cross-validation".

Training data (60% of all data) X_train: Explanatory variables for training data y_train: Objective variable for training data Test data (40% of all data) X_test: Explanatory variable for test data y_test: Objective variable for test data We aim to learn the relationship between X_train and y_train and predict y_test from X_test. If the training data seems to show good performance, but the test data not used for training has poor performance, the model is said to be "overfitted".

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination(train)= ", regr.score(X_train, y_train))
print("Coefficient of Determination(test)= ", regr.score(X_test, y_test))
Regression Coefficient=  [ 0.15598049  0.09796333 -0.11115187  0.47986509  0.32513351]
Intercept=  5.4684249031776915
Coefficient of Determination(train)=  0.39594357153305826
Coefficient of Determination(test)=  0.38127048972638855

The above values change with each calculation because the division into training data and test data is random. If you want to find a regression equation, you can do as above, but by standardizing the explanatory variables and objective variables and then regressing, you can find the "standard regression coefficient", which is an index of "importance of variables".

Xs = X.apply(lambda x: (x-x.mean())/x.std(), axis=0)
ys = list(pd.DataFrame(y).apply(lambda x: (x-x.mean())/x.std()).values.reshape(len(y),))
from sklearn import linear_model
regr = linear_model.LinearRegression()

regr.fit(Xs, ys)

print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination= ", regr.score(Xs, ys))
Regression Coefficient=  [ 0.152545    0.11255532 -0.09718819  0.40725508  0.28208903]
Intercept=  1.1730486200365748e-16
Coefficient of Determination=  0.3958130072204933
pd.DataFrame(regr.coef_, index=list(df.columns[[12, 13, 14, 16, 17]])).sort_values(0, ascending=False).style.bar(subset=[0])

sp attack prediction

It seems that Special Defense & Speed are very important in predicting "Special Attack"

Conclusion

Regression analysis, such as multiple regression analysis, uses numerical data as an explanatory variable and predicts numerical data as an objective variable. On the other hand, quantification type I predicts using non-numeric categorical data as an explanatory variable and numerical data as an objective variable. When the explanatory variables are a mixture of numerical data and categorical data, they are called extended quantification type I.

We saw that Special Attack is definitely a strong predictor for determining whether a pokemon is legendary or not - and we also saw that Special Defense & Speed are also important indicators of Special Attack Value.

Overall this was a way of exploring different pokemon traits and taking into account multiple factors. There's plenty more we can look into such as 'strengths', 'weaknesses' etc.. I hope you all enjoyed this, and thanks for reading all the way through!

Owner
Andreas
I love all kinds of data, though you will most likely see a heavier concentration of insights on categories like gaming, movies and sports
Andreas
Developed for analyzing the covariance for OrcVIO

about This repo is developed for analyzing the covariance for OrcVIO environment setup platform ubuntu 18.04 using conda conda env create --file envir

Sean 1 Dec 08, 2021
Data-sets from the survey and analysis

bachelor-thesis "Umfragewerte.xlsx" contains the orginal survey results. "umfrage_alle.csv" contains the survey results but one participant is cancele

1 Jan 26, 2022
Handle, manipulate, and convert data with units in Python

unyt A package for handling numpy arrays with units. Often writing code that deals with data that has units can be confusing. A function might return

The yt project 304 Jan 02, 2023
Office365 (Microsoft365) audit log analysis tool

Office365 (Microsoft365) audit log analysis tool The header describes it all WHY?? The first line of code was written long time before other colleague

Anatoly 1 Jul 27, 2022
Titanic data analysis for python

Titanic-data-analysis This Repo is an analysis on Titanic_mod.csv This csv file contains some assumed data of the Titanic ship after sinking This full

Hardik Bhanot 1 Dec 26, 2021
BErt-like Neurophysiological Data Representation

BENDR BErt-like Neurophysiological Data Representation This repository contains the source code for reproducing, or extending the BERT-like self-super

114 Dec 23, 2022
Working Time Statistics of working hours and working conditions by industry and company

Working Time Statistics of working hours and working conditions by industry and company

Feng Ruohang 88 Nov 04, 2022
Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

Fabio Barbazza 1 Feb 15, 2022
Incubator for useful bioinformatics code, primarily in Python and R

Collection of useful code related to biological analysis. Much of this is discussed with examples at Blue collar bioinformatics. All code, images and

Brad Chapman 560 Jan 03, 2023
Mining the Stack Overflow Developer Survey

Mining the Stack Overflow Developer Survey A prototype data mining application to compare the accuracy of decision tree and random forest regression m

1 Nov 16, 2021
Useful tool for inserting DataFrames into the Excel sheet.

PyCellFrame Insert Pandas DataFrames into the Excel sheet with a bunch of conditions Install pip install pycellframe Usage Examples Let's suppose that

Luka Sosiashvili 1 Feb 16, 2022
Conduits - A Declarative Pipelining Tool For Pandas

Conduits - A Declarative Pipelining Tool For Pandas Traditional tools for declaring pipelines in Python suck. They are mostly imperative, and can some

Kale Miller 7 Nov 21, 2021
Flenser is a simple, minimal, automated exploratory data analysis tool.

Flenser Have you ever been handed a dataset you've never seen before? Flenser is a simple, minimal, automated exploratory data analysis tool. It runs

John McCambridge 79 Sep 20, 2022
2019 Data Science Bowl

Kaggle-2019-Data-Science-Bowl-Solution - Here i present my solution to kaggle 2019 data science bowl and how i improved it to win a silver medal in that competition.

Deepak Nandwani 1 Jan 01, 2022
This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.

Overview Welcome to the Step-X repository. This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP. Be

Keanu Pang 0 Jan 20, 2022
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 663 Jan 05, 2023
Zipline, a Pythonic Algorithmic Trading Library

Zipline is a Pythonic algorithmic trading library. It is an event-driven system for backtesting. Zipline is currently used in production as the backte

Quantopian, Inc. 15.7k Jan 07, 2023
Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

Chris Carbonell 1 Dec 03, 2021
DefAP is a program developed to facilitate the exploration of a material's defect chemistry

DefAP is a program developed to facilitate the exploration of a material's defect chemistry. A large number of features are provided and rapid exploration is supported through the use of autoplotting

6 Oct 25, 2022
Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

Tirthajyoti Sarkar 249 Jan 08, 2023