Python script for transferring data between three drives in two separate stages

Last update: Nov 10, 2021

Related tags

Overview

Waterlock

Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs hash verification and persistently tracks data transfer progress using SQLite.

I am not responsible for any lost data. This was an evening coding project. Use at your own discretion.

Use Case & Features

The use-case Waterlock was designed for is moving files from one computer (i.e. your home server) to a intermediary drive (i.e. a portable hard drive), and then from the hard drive to another computer (i.e. an offsite backup server).

It will fill the intermediary drive with as many files as it can, aside from a user-configurable amount of reserve-space.
It performs blake2 checksums with every file copy, comparing it to the initial hash value stored in the SQLite database to ensure that data is not corrupted.
It uses a SQLite database to track what data has been moved. As a result, you can incrementally move data from one location to another with minimal user input.
Every time Waterlock is run on the source location, it will check for any files that have been recently modified (based on timestamp, not hash). Any modified files will have their hash & modification timestamps updated in the database, in addition to being marked as unmoved such that they are transferred again and updated. Note that Waterlock does not version files. Nevertheless, silently corrupted files should theoretically not be transferred over unless their modification timestamp has been adjusted.
Every time Waterlock is run on the source location, it will check for any files that were previously moved to the intermediary drive but did not reach the destination. If these files are no longer on the intermediary drive due to accidental deletion for instance, Waterlock will move those files to the intermediary drive again.

Example Use Case: I use Waterlock to transfer large files that are too large to transfer over the network to an offsite backup location at a relatives house. Each time I visit I run the script on my home server to load the external drive, then run it again on the offsite-backup server.

Usage

Change the settings at the top of the script, using absolute file paths. While relative paths may work, they are more error prone due to string formatting issues. Store the script on the intermediary drive itself and run it from there. It will automatically create waterlock.db and a cargo folder where the data will be stored. Note that after the final transfer to the destination, Waterlock will not delete data on the intermediary drive.

python waterlock.py

If you are familiar with Python, you can also fully verify all the files on the middle or destination drives to ensure that the hashes match what is stored in the database. This is done using two additional class functions called verify_middle() and verify_destination(). The code to verify files on the destination would be as follows:

if __name__ == "__main__":
    wl = Waterlock( source_directory=source_directory, 
                    end_directory=end_direcotry, 
                    reserved_space=reserved_space
                    )
    wl.start()
    wl.verify_destination()

Why 'Waterlock'?

It is named Waterlock after marine locks used to move ships through waterways of different water levels in multiple stages.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Python script for transferring data between three drives in two separate stages

Related tags

Overview

Waterlock

Use Case & Features

Usage

Why 'Waterlock'?

You might also like...

Catalogue data - A Python Scripts to prepare catalogue data

This is a python script to navigate and extract the FSD50K dataset

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

A data parser for the internal syncing data format used by Fog of World.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Fancy data functions that will make your life as a data scientist easier.

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Releases(latest)

Owner

David Swanlund

InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

Extract Thailand COVID-19 Cluster data from daily briefing pdf.

A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

High Dimensional Portfolio Selection with Cardinality Constraints

Finding project directories in Python (data science) projects, just like there R rprojroot and here packages

Show you how to integrate Zeppelin with Airflow

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Toolchest provides APIs for scientific and bioinformatic data analysis.

💬 Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

CS50 pset9: Using flask API to create a web application to exchange stocks' shares.

Python implementation of Principal Component Analysis

Convert tables stored as images to an usable .csv file

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

PyPDC is a Python package for calculating asymptotic Partial Directed Coherence estimations for brain connectivity analysis.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

A real data analysis and modeling project - restaurant inspections

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

Evidence enables analysts to deliver a polished business intelligence system using SQL and markdown.

Data exploration done quick.