A library to generate synthetic time series data by easy-to-use factors and generator

Last update: Dec 20, 2022

Overview

timeseries-generator

This repository consists of a python packages that generates synthetic time series dataset in a generic way (under /timeseries_generator) and demo notebooks on how to generate synthetic timeseries data (under /examples). The goal here is to have non-sensitive data available to demo solutions and test the effectiveness of those solutions and/or algorithms. In order to test your algorithm, you want to have time series available containing different kinds of trends. The python package should help create different kinds of time series while still being maintainable.

`timeseries_generator` package

For this package, it is assumed that a time series is composed of a base value multiplied by many factors.

ts = base_value * factor1 * factor2 * ... * factorN + Noiser

These factors can be anything, random noise, linear trends, to seasonality. The factors can affect different features. For example, some features in your time series may have a seasonal component, while others do not.

Different factors are represented in different classes, which inherit from the BaseFactor class. Factor classes are input for the Generator class, which creates a dataframe containing the features, base value, all the different factors working on the base value and and the final factor and value.

Core concept

Generator: a python class to generate the time series. A generator contains a list of factors and noiser. By overlaying the factors and noiser, generator can produce a customized time series
Factor: a python class to generate the trend, seasonality, holiday factors, etc. Factors take effect by multiplying on the base value of the generator.
Noised: a python class to generate time series noise data. Noiser take effect by summing on top of "factorized" time series. This formula describes the concepts we talk above

Built-in Factors

LinearTrend: give a linear trend based on the input slope and intercept
CountryYearlyTrend: give a yearly-based market cap factor based on the GDP per - capita.
EUEcoTrendComponents: give a monthly changed factor based on EU industry product public data
HolidayTrendComponents: simulate the holiday sale peak. It adapts the holiday days - differently in different country
BlackFridaySaleComponents: simulate the BlackFriday sale event
WeekendTrendComponents: more sales at weekends than on weekdays
FeatureRandFactorComponents: set up different sale amount for different stores and different product
ProductSeasonTrendComponents: simulate season-sensitive product sales. In this example code, we have 3 different types of product:
- winter jacket: inverse-proportional to the temperature, more sales in winter
- basketball top: proportional to the temperature, more sales in summer
- Yoga Mat: temperature insensitive

Installation

pip install timeseries-generator

Usage

from timeseries_generator import LinearTrend, Generator, WhiteNoise, RandomFeatureFactor
import pandas as pd

# setting up a linear tren
lt = LinearTrend(coef=2.0, offset=1., col_name="my_linear_trend")
g = Generator(factors={lt}, features=None, date_range=pd.date_range(start="01-01-2020", end="01-20-2020"))
g.generate()
g.plot()

# update by adding some white noise to the generator
wn = WhiteNoise(stdev_factor=0.05)
g.update_factor(wn)
g.generate()
g.plot()

Example Notebooks

We currently have 2 example notebooks available:

generate_stationary_process: Good for introducing the basics of the timeseries_generator. Shows how to apply simple linear trends and how to introduce features and labels, as well as random noise.
use_external_factors: Goes more into detail and shows how to use the external_factors submodule. Shows how to create seasonal trends.

Web based prototyping UI

We also use Streamlit to build a web-based UI to demonstrate how to use this package to generate synthesis time series data in an interactive web UI.

streamlit run examples/streamlit/app.py

License

This package is released under the Apache License, Version 2.0

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Jan 5, 2023

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

15.4k Jan 7, 2023

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Dec 29, 2022

Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

76 Dec 15, 2022

PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

A python toolbox/library for data mining on partially-observed time series, supporting tasks of forecasting/imputation/classification/clustering on incomplete multivariate time series with missing values.

179 Dec 31, 2022

A collection of Scikit-Learn compatible time series transformers and tools.

tsfeast A collection of Scikit-Learn compatible time series transformers and tools. Installation Create a virtual environment and install: From PyPi p

0 Mar 30, 2022

Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

7k Jan 6, 2023

A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

6k Jan 6, 2023

Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

3.3k Jan 3, 2023

Comments

Time series data augmentation

There is a code example that gives to increase the amount of series data by adding slightly modified copies of already existing time series data or newly created synthetic series data from existing data?

opened by YAYAYru 0

KeyError: 'country'

From the following code,

from timeseries_generator import HolidayFactor, LinearTrend, Generator

lt = LinearTrend(coef=2.0, offset=1., col_name="my_linear_trend")

g: Generator = Generator(factors={lt}, features=None, date_range=pd.date_range(start="01-01-2020", end="01-01-2021"))

holiday_factor = HolidayFactor(
    country_feature_name="country",
)
g.add_factor(holiday_factor)
g.generate()

I get the error. I am not sure this is expected behavior.

File /usr/local/Caskroom/miniconda/base/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py:10083, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
...
-> 1849     raise KeyError(key)
   1851 # Check for duplicates
   1852 if values.ndim > 1:

KeyError: 'country'

opened by twobitunicorn 0

[Feature request] Customizable feature combinations
Hi team, Thanks for the useful library! I wonder if you'd be open to this idea:

I would like to be able to:

Set up categorizing features (let's say, for illustration, CATEGORY=[footwear, t-shirts, socks], SIZE=[S, M, L, US-Mens-8, US-Womens-6) and define Factors on them

Generate time-series with more restricted feature combinations than the outer product (again for illustration, "t-shirt sizes for t-shirts, shoe sizes for footwear")

Today, it seems like Generator.generate() hard-codes the assumption that time-series should be generated for the product of all provided feature values.

It'd be helpful if, instead, we could have the option of customizing this join to limit down generated combinations?

Some options I can think of:

Leave the library as-is: Users generate full outer product and limit down what they want in post-processing

This seems possible already, but very RAM-intensive if your desired combinations are sparse?

Accept an optional dataframe of factor combinations as parameter to the generate() method

Gives full flexibility over which combinations are kept / ignored, without assuming any particular rigid hierarchies between features

...But might need to do a bit of validation to protect against user errors? May not be super easy to use without some documented examples / functions to generate the dataframe

Some more complex API for feature configuration that accommodates specifying valid/invalid feature combinations

Might be nicer for usability, but difficult to make general: E.g. a straightforward hierarchy could be represented as a nested dict, but in practice many applications have multiple intersecting views of product category information e.g. brand, type, target segment, etc.
opened by athewsey 1
Generate hourly data

First of all, thank you for making this repository public! I enjoy its ease of use and the built-in factors.

Problem description

I'm currently trying to generate revenue data for a bar/restaurant on an hourly basis. As far as I can see, the timeseries-generator only supports generating one data point per day, not per hour.

I tried to generate hourly data like g = Generator(factors={lt}, features=None, date_range=pd.date_range(start='15/9/2021', end='30/9/2021', freq='h')) which didn't work.

Potential solution

Add the possibility to generate hourly data too. If this is a promising idea in your opinion, I'm willing to contribute to the implementation.

Thank you in advance!

opened by nileger 1

Releases(v0.1.0)

v0.1.0(Jul 20, 2021)
first release of time series generators, including:

base factor

linear trend factor

sinusoidal factor

white noise factor

random factor

holiday factor

weekday factor

country GDP factor

EU industry index factor

Examples

notebooks which includes some simple examples

streamlit dashboard

Source code(tar.gz)
Source code(zip)

Owner

Nike Inc.

GitHub Repository

Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.

Time series analysis today is an important cornerstone of quantitative science in many disciplines, including natural and life sciences as well as eco

129 Dec 24, 2022

SynapseML - an open source library to simplify the creation of scalable machine learning pipelines

Synapse Machine Learning SynapseML (previously MMLSpark) is an open source library to simplify the creation of scalable machine learning pipelines. Sy

3.9k Dec 30, 2022

Upgini : data search library for your machine learning pipelines

Automated data search library for your machine learning pipelines → find & deliver relevant external data & features to boost ML accuracy :chart_with_upwards_trend:

175 Jan 08, 2023

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

5.7k Dec 30, 2022

nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

241 Dec 26, 2022

Auto updating website that tracks closed & open issues/PRs on scikit-learn/scikit-learn.

Repository Status for Scikit-learn Live webpage Auto updating website that tracks closed & open issues/PRs on scikit-learn/scikit-learn. Running local

6 Dec 27, 2022

Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

Highly interpretable, sklearn-compatible classifier based on decision rules This is a scikit-learn compatible wrapper for the Bayesian Rule List class

482 Nov 19, 2022

A library to generate synthetic time series data by easy-to-use factors and generator

Related tags

Overview

timeseries-generator

timeseries_generator package

Core concept

Built-in Factors

Installation

Usage

Example Notebooks

Web based prototyping UI

License

You might also like...

A machine learning toolkit dedicated to time-series data

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

A machine learning toolkit dedicated to time-series data

Visualize classified time series data with interactive Sankey plots in Google Earth Engine

PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

A collection of Scikit-Learn compatible time series transformers and tools.

Automatic extraction of relevant features from time series:

A unified framework for machine learning with time series

Probabilistic time series modeling in Python

Comments

Time series data augmentation

KeyError: 'country'

[Feature request] Customizable feature combinations

Generate hourly data

Problem description

Potential solution

Releases(v0.1.0)

v0.1.0(Jul 20, 2021)

Owner

Nike Inc.

Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.

SynapseML - an open source library to simplify the creation of scalable machine learning pipelines

Upgini : data search library for your machine learning pipelines

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

Auto updating website that tracks closed & open issues/PRs on scikit-learn/scikit-learn.

Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

Predicting diabetes over a five year period using logistic regression and the Pima First-Nation dataset

Falken provides developers with a service that allows them to train AI that can play their games

Gaussian Process Optimization using GPy

Automated machine learning: Review of the state-of-the-art and opportunities for healthcare

Apple-voice-recognition - Machine Learning

Diabetes Prediction with Logistic Regression

Evidently helps analyze machine learning models during validation or production monitoring

Regularization and Feature Selection in Least Squares Temporal Difference Learning

Decision tree is the most powerful and popular tool for classification and prediction

MLflow App Using React, Hooks, RabbitMQ, FastAPI Server, Celery, Microservices

Visualize classified time series data with interactive Sankey plots in Google Earth Engine

using Machine Learning Algorithm to classification AppleStore application

A Lightweight Hyperparameter Optimization Tool 🚀

`timeseries_generator` package