Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

Overview

Trading Tesla with Machine Learning and Sentiment Analysis

An interactive program to train a Random Forest Classifier to predict Tesla daily prices using technical indicators and sentiment scores of Twitter posts, backtesting the trading strategy and producing performance metrics.

The project leverages techniques, paradigms and data structures such as:

  • Functional and Object-Oriented Programming
  • Machine Learning
  • Sentiment Analysis
  • Concurrency and Parallel Processing
  • Direct Acyclic Graph (D.A.G.)
  • Data Pipeline
  • Idempotence

Scope

The intention behind this project was to implement the end-to-end workflow of the backtesting of an Algorithmic Trading strategy in a program with a sleek interface, and with a level of automation such that the user is able to tailor the details of the strategy and the output of the program by entering a minimal amount of data, partly even in an interactive way. This should make the program reusable, meaning that it's easy to carry out the backtesting of the trading strategy on a different asset. Furthermore, the modularity of the software design should facilitate changes to adapt the program to different requirements (i.e. different data or ML models).

Strategy Backtesting Results

The Random Forest classifier model was trained and optimised with the scikit-learn GridSearchCV module. After computing the trading signals predictions and backtesting the strategy, the following performances were recorded:

Performance Indicators Summary
Return Buy and Hold (%) 273.94
Return Buy and Hold Ann. (%) 91.5
Return Trading Strategy (%) 1555.54
Return Trading Strategy Ann. (%) 298.53
Sharpe Ratio 0.85
Hit Ratio (%) 93.0
Average Trades Profit (%) 3.99
Average Trades Loss (%) -1.15
Max Drawdown (%) -7.69
Days Max Drawdown Recovery 2

drawdown

returns

Running the Program

This is straightforward. There are very few variables and methods to initialise and call in order to run the whole program.

Let me illustrate it in the steps below:

  1. Provide the variables in download_params, a dictionary containing all the strategy and data downloading details.

    download_params = {'ticker' : 'TSLA',
                       'since' : '2010-06-29', 
                       'until' : '2021-06-02',
                       'twitter_scrape_by_account' : {'elonmusk': {'search_keyword' : '',
                                                                   'by_hashtag' : False},
                                                      'tesla': {'search_keyword' : '',
                                                                'by_hashtag' : False},
                                                      'WSJ' : {'search_keyword' : 'Tesla',
                                                               'by_hashtag' : False},
                                                      'Reuters' : {'search_keyword' : 'Tesla',
                                                                   'by_hashtag' : False},
                                                      'business': {'search_keyword' : 'Tesla',
                                                                   'by_hashtag' : False},
                                                      'CNBC': {'search_keyword' : 'Tesla',
                                                               'by_hashtag' : False},
                                                      'FinancialTimes' : {'search_keyword' : 'Tesla',
                                                                          'by_hashtag' : True}},
                       'twitter_scrape_by_most_popular' : {'all_twitter_1': {'search_keyword' : 'Tesla',
                                                                           'max_tweets_per_day' : 30,
                                                                           'by_hashtag' : True}},
                       'language' : 'en'                                      
                       }
  2. Initialise an instance of the Pipeline class:

    TSLA_data_pipeline = Pipeline()
  3. Call the run method on the Pipeline instance:

    TSLA_pipeline_outputs = TSLA_data_pipeline.run()

    This will return a dictionary with the Pipeline functions outputs, which in this example has been assigned to TSLA_pipeline_outputs. It will also print messages about the status and operations of the data downloading and manipulation process.

  4. Retrieve the path to the aggregated data to feed into the Backtest_Strategy class:

    data = glob.glob('data/prices_TI_sentiment_scores/*')[0]
  5. Initialise an instance of the Backtest_Strategy class with the data variable assigned in the previous step.

    TSLA_backtest_strategy = Backtest_Strategy(data)
  6. Call the preprocess_data method on the Backtest_Strategy instance:

    TSLA_backtest_strategy.preprocess_data()

    This method will show a summary of the data preprocessing results such as missing values, infinite values and features statistics.

From this point the program becomes interactive, and the user is able to input data, save and delete files related to the training and testing of the Random Forest model, and proceed to display the strategy backtesting summary and graphs.

  1. Call the train_model method on the Backtest_Strategy instance:

    TSLA_backtest_strategy.train_model()

    Here you will be able to train the model with the scikit-learn GridSearchCV, creating your own parameters grid, save and delete files containing the parameters grid and the best set of parameters found.

  2. Call the test_model method on the Backtest_Strategy instance:

    TSLA_backtest_strategy.test_model()

    This method will allow you to test the model by selecting one of the model's best parameters files saved during the training process (or the "default_best_param.json" file created by default by the program, if no other file was saved by the user).

    Once the process is complete, it will display the testing summary metrics and graphs.

    If you are satisfied with the testing results, from here you can display the backtesting summary, which equates to call the next and last method below. In this case, the program will also save a csv file with the data to compute the strategy performance metrics.

  3. Call the strategy_performance method on the Backtest_Strategy instance:

    TSLA_backtest_strategy.strategy_performance()

    This is the method to display the backtesting summary shown above in this document. Assuming a testing session has been completed and there is a csv file for computing the performance metrics, the program will display the backtesting results straight away using the existing csv file, which in turn is overwritten every time a testing process is completed. Otherwise, it will prompt you to run a training/testing session first.

Tips

If the required data (historical prices and Twitter posts) have been already downloaded, the only long execution time you may encounter is during the model training: the larger the parameters grid search, the longer the time. I recommend that you start getting confident with the program by using the data already provided within the repo (backtesting on Tesla stock).

This is because any downloading of new data on a significantly large period of time such to be reliable for the model training will likely require many hours, essentially due to the Twitter scraping process.

That said, please be also aware that as soon as you change the variables in the download_params dictionary and run the Pipeline instance, all the existing data files will be overwritten. This is because the program recognise on its own the relevant data that need to be downloaded according to the parameters passed into download_params, and this is a deliberate choice behind the program design.

That's all! Clone the repository and play with it. Any feedback welcome.

Disclaimer

Please be aware that the content and results of this project do not represent financial advice. You should conduct your own research before trading or investing in the markets. Your capital is at risk.

References

Owner
Renato Votto
Renato Votto
Production Grade Machine Learning Service

This project is made to help you scale from a basic Machine Learning project for research purposes to a production grade Machine Learning web service

Abdullah Zaiter 10 Apr 04, 2022
A simple python program which predicts the success of a movie based on it's type, actor, actress and director

Movie-Success-Prediction A simple python program which predicts the success of a movie based on it's type, actor, actress and director. The program us

Mahalinga Prasad R N 1 Dec 17, 2021
All-in-one web-based development environment for machine learning

All-in-one web-based development environment for machine learning Getting Started • Features & Screenshots • Support • Report a Bug • FAQ • Known Issu

3 Feb 03, 2021
JMP is a Mixed Precision library for JAX.

Mixed precision training [0] is a technique that mixes the use of full and half precision floating point numbers during training to reduce the memory bandwidth requirements and improve the computatio

DeepMind 108 Dec 31, 2022
Predict the income for each percentile of the population (Python) - FRENCH

05.income-prediction Predict the income for each percentile of the population (Python) - FRENCH Effectuez une prédiction de revenus Prérequis Pour ce

1 Feb 13, 2022
This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing variance.

minvar_invest_portfolio This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing var

1 Jan 06, 2022
Ml based project which uses regression technique to predict the price.

Price-Predictor Ml based project which uses regression technique to predict the price. I have used various regression models and finds the model with

Garvit Verma 1 Jul 09, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 29, 2022
whylogs: A Data and Machine Learning Logging Standard

whylogs: A Data and Machine Learning Logging Standard whylogs is an open source standard for data and ML logging whylogs logging agent is the easiest

WhyLabs 2k Jan 06, 2023
Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

sklearn-porter Transpile trained scikit-learn estimators to C, Java, JavaScript and others. It's recommended for limited embedded systems and critical

Darius Morawiec 1.2k Jan 05, 2023
Lightning ⚡️ fast forecasting with statistical and econometric models.

Nixtla Statistical ⚡️ Forecast Lightning fast forecasting with statistical and econometric models StatsForecast offers a collection of widely used uni

Nixtla 2.1k Dec 29, 2022
Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Samrat Mitra 2 Nov 18, 2021
CS 7301: Spring 2021 Course on Advanced Topics in Optimization in Machine Learning

CS 7301: Spring 2021 Course on Advanced Topics in Optimization in Machine Learning

Rishabh Iyer 141 Nov 10, 2022
Machine Learning approach for quantifying detector distortion fields

DistortionML Machine Learning approach for quantifying detector distortion fields. This project is a feasibility study for training a surrogate model

Joel Bernier 1 Nov 05, 2021
Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill

Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill This is a port of the amazing openskill.js package

Open Debates Project 156 Dec 14, 2022
Apache (Py)Spark type annotations (stub files).

PySpark Stubs A collection of the Apache Spark stub files. These files were generated by stubgen and manually edited to include accurate type hints. T

Maciej 114 Nov 22, 2022
A simple and lightweight genetic algorithm for optimization of any machine learning model

geneticml This package contains a simple and lightweight genetic algorithm for optimization of any machine learning model. Installation Use pip to ins

Allan Barcelos 8 Aug 10, 2022
A logistic regression model for health insurance purchasing prediction

Logistic_Regression_Model A logistic regression model for health insurance purchasing prediction This code is using these packages, so please make sur

ShawnWang 1 Nov 29, 2021
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models. Solve a variety of tasks with pre-trained models or finetune them in

Backprop 227 Dec 10, 2022
About Solve CTF offline disconnection problem - based on python3's small crawler

About Solve CTF offline disconnection problem - based on python3's small crawler, support keyword search and local map bed establishment, currently support Jianshu, xianzhi,anquanke,freebuf,seebug

天河 32 Oct 25, 2022