Research using python - Guide for development of research code (using Anaconda Python)

Overview

Guide for development of research code
(using Anaconda Python)

TL;DR:

One time setup

  1. Install git and go through its one time setup, bare minimum:
    git config --global user.name “First Last”
    git config --global user.email “first[email protected]”
    git config --global core.editor editor_of_choice
    
    Editor option for the few folks on windows (haven't tried it myself):
    git config --global core.editor "'input/path/to/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"
    
  2. Install git-lfs and run git lfs install.
  3. Install miniconda.
  4. Sign up for a GitHub account.
  5. Generate an SSH key and add it to your GitHub account.

Once per repository setup

  1. Create empty repository on GitHub, lets call it my_project.
  2. Initial commit into local repository and push to remote: 0. Create local repository (also creates new directory) git init my_project
    1. Create a markdown file, README.md describing the project.
    2. Create an environment_dev.yml file based on this example. Change the environment name to an appropriate one and add relevant packages.
    3. Copy this pre-commit configuration file.
    4. Copy this .gitignore file and add file types you want git to ignore.
    5. Add file types to be tracked by git-lfs based on file extension, creates the .gitattributes file (e.g. git lfs track "*.pth")
    6. Copy this .flake8 file to customize the tool settings.
git add README.md environment_dev.yml .pre-commit-config.yaml .gitattributes .gitignore .flake8
git commit
git branch -M main
git remote add origin [email protected]:user_name/my_project.git
git push -u origin main
  1. Create virtual environment activate it and set up pre-commit:
    conda env create -f environment_dev.yml
    conda activate my_project_dev
    pre-commit install
    

Start working

  1. Activate virtual environment conda activate my_project_dev
  2. Create new branch off of main:
git checkout main
git checkout -b my_new_branch
  1. Work.
  2. Commit locally and push to remote (origin can be either a fork, if using a triangular workflow, or the original repository if using a centralized workflow):
git add file1 file2 file3
git commit
git push origin my_new_branch
  1. Create a pull request on GitHub and after tests pass merge into main branch.

If code is not in the remote repository, consider it lost.

Long version

Why should you care?

Most scientists need to write code as part of their research. This is a "physical" embodiment of the underlying algorithmic and mathematical theory. Traditionally the software engineering standards applied to code written as part of research have been rather low (rampant code duplication...). In the past decade we have seen this change. Primarily because it is now much more common for researchers to share their code (often due to the "encouragement" of funding agencies) in all its glory.

When sharing code, we expect it to comply with some minimal software engineering standards including design, readability, and testing.

I strive to follow the guidance below, but don't always. Still, it's important to have a goal to strive towards. To quote Lewis Carol (If you don't know where you're going, any road will take you there). From Alice's Adventures in Wonderland:

“Would you tell me, please, which way I ought to go from here?” “That depends a good deal on where you want to get to,” said the Cat. “I don’t much care where-” said Alice. “Then it doesn’t matter which way you go,” said the Cat. "-so long as I get somewhere,” Alice added as an explanation.“Oh, you’re sure to do that,” said the Cat, “if you only walk long enough.”

Personal pet peeves, in no particular order:

  • A single commit of all the code in the GitHub repository. Yes, you're sharing code but it did not magically materialize in its final form, be transparent so that we can trust the code and see how it developed over time. We can learn from paths that did not pan out almost as much as from the path that did. By providing all of the history we can see which algorithmic paths were attempted and did not work out. Help others avoid going down dead-end paths.
  • Repository contains .DS_Store files. Yes, we know you are proud of your Mac. I like OSX too, but seriously, you should have added this file type to the .gitignore file when setting up the repository.
  • Deep learning code sans-data, sans-weight files. This is completely useless in terms of reproducibility. Don't "share" like this.
  • Code duplication with minor, hard to detect, differences between copies.

Version control

  1. Use a version control system, currently Git is the VCS of the day. Learn how to use it (introduction to git slide deck).
  2. Use a remote repository, your cloud backup. Keep it private during development and then make it public upon publication acceptance. Free services GitHub, BitBucket.
  3. Do not commit binary or large files into the repository. Use git-lfs. Beware the Jupyter notebook. Do not commit notebooks with output as this will cause the repository size to blow up, particularly if output includes images. Clear the output before committing.
  4. Use the pre-commit framework to improve (1) compliance to code style (2) avoid commits of large/binary files, AWS credentials and private keys. We all need a little help complying with our self imposed constraints (example configuration file). Note that git pre-commit hooks do not preclude non-compliant commits, as a determined user can go around the hooks, git commit --no-verify.

Writing code (Python as a use case)

Many languages have style, testing and documentation tools and conventions. Here we focus on Python, but the concepts are similar for all languages.

  1. Style - Use consistent style and enforce it. Other human beings need to read the code and readily understand it. Write code that is compliant with PEP8 (the Python style guide):
    • Use flake8 to enforce PEP8.
    • Use the Black code formatter, works for scripts and Jupyter notebooks (for Jupyter notebook support pip install black[jupyter] instead of the regular pip install black). It does not completely agree with flake8, so use both?
    • Some folks don't like the Black formatting, it isn't all roses. An alternative is autopep8.
  2. Testing - Write nominal regression tests at the same time you implement the functionality. Non-rigorous regression testing is acceptable in a research setting as we explore various solutions. The more rigorous the testing the easier it will be for a development team to get code into production. Use pytest for this task.
  3. Documentation - Write the documentation while you are implementing. Start by adding a README file to your repository (use markdown or restructured text). It should include a general description of the repository contents, how to run the programs and possibly instructions on how to build them from source code. Generally, when we postpone writing documentation we will likely never do it. That's fine too, as long as you are willing to admit to yourself that you are consciously choosing to not document your code. In Python, use a consistent Docstring format. Two popular ones are Google style and NumPy style.
  4. Reproducible environment - include instructions or configuration files to reproduce the environment in which the code is expected to work. In Python you provide files listing all package dependencies enabling the creation of the appropriate virtual environment in which to run the program. A requirements.txt for plain Python, or an environment.yml for the anaconda Python distribution. For development we often rely on additional packages not required for usage (e.g. pytest). Consequentially we include a requirements_dev.txt (environment_dev.yml) in addition to the requirements.txt (environment.yml) files. Sample requirements.txt, requirements_dev.txt and environment.yml, environment_dev.yml files.
  5. Your code is a mathematical multi-parametric function that depends on many parameters beyond the input. These parameters are either:
  • Hard coded - best avoided if they need to be changed for different inputs.
  • Given as arguments on the command-line, appropriate when you have a few, less than five. Several popular Python modules/packages that support parsing command-line arguments: argparse, click and docopt. Personally I use argparse (example usage available here).
  • Specified in a configuration file. These usually use XML or JSON formats. I use JSON (example configuration file and short script that reads it). The parameters file is given on the command-line so we also get to use argparse.

Continuous integration

Automate testing and possibly delivery using continuous integration. There are many CI services that readily integrate with remote hosted git services. In the past I've used TravisCI and CircleCI. Currently using GitHub Actions. All of these rely on a yaml based configuration files to define workflows.

An example GitHub actions workflow which runs the same tests as the pre-commit defined above is available here.

Owner
Ziv Yaniv
Ziv Yaniv
It is a personal assistant chatbot, capable to perform many tasks same as Google Assistant plus more extra features...

PersonalAssistant It is an Personal Assistant, capable to perform many tasks with some unique features, that you haven'e seen yet.... Features / Tasks

Roshan Kumar 95 Dec 21, 2022
Yet another Python Implementation of the Elo rating system.

Python Implementation - Elo Rating System Yet another Python Implementation of the Elo rating system (how innovative am I right?). Only supports 1vs1

Kraktoos 5 Dec 22, 2022
Developing and Comparing Vision-based Algorithms for Vision-based Agile Flight

DodgeDrone: Vision-based Agile Drone Flight (ICRA 2022 Competition) Would you like to push the boundaries of drone navigation? Then participate in the

Robotics and Perception Group 115 Dec 10, 2022
Python Multilingual Ucrel Semantic Analysis System

PymUSAS Python Multilingual Ucrel Semantic Analysis System, it currently is a rule based token level semantic tagger which can be added to any spaCy p

UCREL 13 Nov 18, 2022
Some basic sorting algos

Sorting-Algos Some basic sorting algos HacktoberFest 2021 This repository consists of mezzo-level projects that undertake a simple task and perform it

Manthan Ghasadiya 7 Dec 13, 2022
Simple Python script I use to manage and build my Reflux themes.

Simple Python script I use to manage and build my Reflux themes. Built for personal use, but anyone can easily fork and tweak to suit thier needs.

Ire 3 Jan 25, 2022
Decoupled Smoothing in Probabilistic Soft Logic

Decoupled Smoothing in Probabilistic Soft Logic Experiments for "Decoupled Smoothing in Probabilistic Soft Logic". Probabilistic Soft Logic Probabilis

Kushal Shingote 1 Feb 08, 2022
An universal linux port of deezer, supporting both Flatpak and AppImage

Deezer for linux This repo is an UNOFFICIAL linux port of the official windows-only Deezer app. Being based on the windows app, it allows downloading

Aurélien Hamy 154 Jan 06, 2023
Radiosonde Telemetry Decoders

Radiosonde Telemetry Frame Decoders This repository is an attempt to collate the various sources of information on how to decode radiosonde telemetry

Project Horus 3 Jan 04, 2022
Simple application that does transformation with HPF and LPFs.

Simple application that applies Butterworth, Gaussian & Ideal kernels on HPF and LPFs -aka Frequency Domain Filtering- Upload image from sidebar, set

Merve Noyan 3 Jul 06, 2022
Oregon State University grade distributions from Fall 2018 through Summer 2021

Oregon State University Grades Oregon State University grade distributions from Fall 2018 through Summer 2021 obtained through a Freedom Of Informatio

Melanie Gutzmann 5 May 02, 2022
A dashboard for your code. A build system.

NOTICE: THIS REPO IS NO LONGER UPDATED Changes Changes is a build coordinator and reporting solution written in Python. The project is primarily built

Dropbox 763 Sep 09, 2022
Multi-Process / Censorship Detection

Multi-Process / Censorship Detection

Baris Dincer 2 Dec 22, 2021
Runtime fault injection platform by Daniele Rizzieri (2021)

GDBitflip [v1.04] Runtime fault injection platform by Daniele Rizzieri (2021) This platform executes N times a binary and during each execution it inj

Daniele Rizzieri 1 Dec 07, 2021
A simple countdown timer in eazy code to show timer with python

Countdown_Timer The simple CLI countdown timer in eazy code to show timer How Work First you fill the input by int-- (Enter the time in Seconds:) for

Yasin Rezvani 3 Nov 15, 2022
Structured, dependable legos for starknet development.

Structured, dependable legos for starknet development.

Alucard 127 Nov 23, 2022
A flexible free and unlimited python tool to translate between different languages in a simple way using multiple translators.

deep-translator Translation for humans A flexible FREE and UNLIMITED tool to translate between different languages in a simple way using multiple tran

Nidhal Baccouri 806 Jan 04, 2023
hey, this repo is the backend of the sociio project

sociio backend Hey, this repository is a part of sociio project , In this repo we are working to create an independent server for everything you can i

2 Jun 09, 2022
A simple, light-weight and highly maintainable online judge system for secondary education

y³OJ a simple, light-weight and highly maintainable online judge system for secondary education 一个简单、轻量化、易于维护的、为中学信息技术学科课业教学设计的 Online Judge 系统。 Onlin

20 Oct 04, 2022
laTEX is awesome but we are lazy -> groff with markdown syntax and inline code execution

pyGroff A wrapper for groff using python to have a nicer syntax for groff documents DOCUMENTATION Very similar to markdown. So if you know what that i

Subhaditya Mukherjee 27 Jul 23, 2022