Random dataframe and database table generator

Overview

Random database/dataframe generator

Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA

Introduction

Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (.DB or .sqlite) for practicing SQL commands. Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one's own choice?

After all, databases break every now and then and it is safest to practice with a randomly generated one :-)

https://imgs.xkcd.com/comics/exploits_of_a_mom.png

While it is easy to generate random numbers or simple words for Pandas or dataframe operation learning, it is often non-trivial to generate full data tables with meaningful yet random entries of most commonly encountered fields in the world of database, such as

  • name,
  • age,
  • birthday,
  • credit card number,
  • SSN,
  • email id,
  • physical address,
  • company name,
  • job title,

This Python package generates a random database TABLE (or a Pandas dataframe, or an Excel file) based on user's choice of data types (database fields). User can specify the number of samples needed. One can also designate a "PRIMARY KEY" for the database table. Finally, the TABLE is inserted into a new or existing database file of user's choice.

https://raw.githubusercontent.com/tirthajyoti/pydbgen/master/images/Top_image_1.png

Dependency and Acknowledgement

At its core, pydbgen uses Faker as the default random data generating engine for most of the data types. Original function is written for few data types such as realistic email and license plate. Also the default phone number generated by Faker is free-format and does not correspond to US 10 digit format. Therefore, a simple phone number data type is introduced in pydbgen. The original contribution of pydbgen is to take the single data-generating function from Faker and use it cleverly to generate Pandas data series or dataframe or SQLite database tables as per the specification of the user. Here is the link if you want to look up more about Faker package,

Faker Documentation Home

Installation

(On Linux and Windows) You can use pip to install pydbgen:

pip install pydbgen

(On Mac OS), first install pip,

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

Then proceed as above.

Usage

Current version (1.0.0) of pydbgen comes with the following primary methods,

  • gen_data_series()
  • gen_dataframe()
  • gen_table()
  • gen_excel()

The gen_table() method allows you to build a database with as many tables as you want, filled with random data and fields of your choice. But first, you have to create an object of pydb class:

myDB = pydbgen.pydb()

gen_data_series()

Returns a Pandas series object with the desired number of entries and data type. Data types available:

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Phone number can be of two types:

  • phone_number_simple generates 10 digit US number in xxx-xxx-xxxx format
  • phone_number_full may generate an international number with different format

Code example:

se=myDB.gen_data_series(data_type='date')
print(se)

0    1995-08-09
1    2001-08-01
2    1980-06-26
3    2018-02-18
4    1972-10-12
5    1983-11-12
6    1975-09-04
7    1970-11-01
8    1978-03-23
9    1976-06-03
dtype: object

gen_dataframe()

Generates a Pandas dataframe filled with random entries. User can specify the number of rows and data type of the fields/columns.

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Customization choices are following:

  • real_email: If True and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, Tirtha Sarkar name with this choice enabled, will generate emails like [email protected] or [email protected].
  • real_city: If True, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
  • phone_simple: If True, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.

Code example:

testdf=myDB.gen_dataframe(
25,fields=['name','city','phone',
'license_plate','email'],
real_email=True,phone_simple=True
)

gen_table()

Attempts to create a table in a database (.db) file using Python's built-in SQLite engine. User can specify various data types to be included as database table fields.

All data types (fields) in the SQLite table will be of VARCHAR type. Data types available:

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Customization choices are following:

  • real_email: If True and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, Tirtha Sarkar name with this choice enabled, will generate emails like [email protected] or [email protected].
  • real_city: If True, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
  • phone_simple: If True, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.
  • db_file: Name of the database where the TABLE will be created or updated. Default database name will be chosen if not specified by user.
  • table_name: Name of the table, to be chosen by user. Default table name will be chosen if not specified by user.
  • primarykey: User can choose a PRIMARY KEY from among the various fields. If nothing specified, the first data field will be made PRIMARY KEY. If user chooses a field, which is not in the specified list, an error will be thrown and no table will be generated.

Code example:

myDB.gen_table(
20,fields=['name','city','job_title','phone','company','email'],
db_file='TestDB.db',table_name='People',
primarykey='name',real_city=False
)

gen_excel()

Attempts to create an Excel file using Pandas excel_writer function. User can specify various data types to be included. All data types (fields) in the Excel file will be of text type. Data types available:

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Customization choices are following:

  • real_email: If True and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, Tirtha Sarkar name with this choice enabled, will generate emails like [email protected] or [email protected].
  • real_city: If True, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
  • phone_simple: If True, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.
  • filename: Name of the Excel file to be created or updated. Default file name will be chosen if not specified by user.

Code example:

myDB.gen_excel(15,fields=['name','year','email','license_plate'],
        filename='TestExcel.xlsx',real_email=True)

Other auxiliary methods available

Few other auxiliary functions available in this package.

Owner
Tirthajyoti Sarkar
Data Sc/Engineering manager , Industry 4.0, edge-computing, semiconductor technologist, Author, Python pkgs - pydbgen, MLR, and doepy,
Tirthajyoti Sarkar
Jupyter notebooks for the book "The Elements of Statistical Learning".

This repository contains Jupyter notebooks implementing the algorithms found in the book and summary of the textbook.

Madiyar 369 Dec 30, 2022
A Python package for modular causal inference analysis and model evaluations

Causal Inference 360 A Python package for inferring causal effects from observational data. Description Causal inference analysis enables estimating t

International Business Machines 506 Dec 19, 2022
A utility for functional piping in Python that allows you to access any function in any scope as a partial.

WithPartial Introduction WithPartial is a simple utility for functional piping in Python. The package exposes a context manager (used with with) calle

Michael Milton 1 Oct 26, 2021
Python beta calculator that retrieves stock and market data and provides linear regressions.

Stock and Index Beta Calculator Python script that calculates the beta (β) of a stock against the chosen index. The script retrieves the data and resa

sammuhrai 4 Jul 29, 2022
Common bioinformatics database construction

biodb Common bioinformatics database construction 1.taxonomy (Substance classification database) Download the database wget -c https://ftp.ncbi.nlm.ni

sy520 2 Jan 04, 2022
Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Allen Institute Visual Behavior Analysis package This repository contains code for analyzing behavioral data from the Allen Brain Observatory: Visual

Allen Institute 16 Nov 04, 2022
Retentioneering 581 Jan 07, 2023
This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Ishan Hegde 1 Nov 17, 2021
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

32 Apr 25, 2022
Python script for transferring data between three drives in two separate stages

Waterlock Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs ha

David Swanlund 13 Nov 10, 2021
Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine Intro This repo contains the python/stan version of the Statistical Rethinking

Andrés Suárez 3 Nov 08, 2022
CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner. It is aimed to integrate this tool with several more features including providing a U

Ravi Prakash 3 Jun 27, 2021
Detecting Underwater Objects (DUO)

Underwater object detection for robot picking has attracted a lot of interest. However, it is still an unsolved problem due to several challenges. We take steps towards making it more realistic by ad

27 Dec 12, 2022
Business Intelligence (BI) in Python, OLAP

Open Mining Business Intelligence (BI) Application Server written in Python Requirements Python 2.7 (Backend) Lua 5.2 or LuaJIT 5.1 (OML backend) Mong

Open Mining 1.2k Dec 27, 2022
A model checker for verifying properties in epistemic models

Epistemic Model Checker This is a model checker for verifying properties in epistemic models. The goal of the model checker is to check for Pluralisti

Thomas Träff 2 Dec 22, 2021
Spectral Analysis in Python

SPECTRUM : Spectral Analysis in Python contributions: Please join https://github.com/cokelaer/spectrum contributors: https://github.com/cokelaer/spect

Thomas Cokelaer 280 Dec 16, 2022
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022
Cleaning and analysing aggregated UK political polling data.

Analysing aggregated UK polling data The tweet collection & storage pipeline used in email-service is used to also collect tweets from @britainelects.

Ajay Pethani 0 Dec 22, 2021
CubingB is a timer/analyzer for speedsolving Rubik's cubes, with smart cube support

CubingB is a timer/analyzer for speedsolving Rubik's cubes (and related puzzles). It focuses on supporting "smart cubes" (i.e. bluetooth cubes) for recording the exact moves of a solve in real time.

Zach Wegner 5 Sep 18, 2022
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022