Catalogue data - A Python Scripts to prepare catalogue data

Last update: Mar 03, 2022

Related tags

Data Analysis catalogue_data

Overview

catalogue_data

Scripts to prepare catalogue data.

Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

sudo apt-get install git-lfs
git lfs install

Install dependencies:

sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar

Create virtual environment, activate it and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env file at the root directory:

HF_USERNAME=
   
    
HF_USER_ACCESS_TOKEN=
    
     
GIT_USER=
     
      
GIT_EMAIL=

Create metadata

To create dataset metadata (in file dataset_infos.json) run:

python create_metadata.py --repo <repo_id>

where you should replace , e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad

Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>

where you should replace:

path_to_file_with_dataset_ratios: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.
: directory path to save the aggregated dataset

Catalogue data - A Python Scripts to prepare catalogue data

Related tags

Overview

catalogue_data

Setup

Create metadata

Aggregate datasets

Owner

BigScience Workshop

A set of functions and analysis classes for solvation structure analysis

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics!

Flenser is a simple, minimal, automated exploratory data analysis tool.

A data parser for the internal syncing data format used by Fog of World.

PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Python for Data Analysis, 2nd Edition

Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

Create HTML profiling reports from pandas DataFrame objects

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

A multi-platform GUI for bit-based analysis, processing, and visualization

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

A set of procedures that can realize covid19 virus detection based on blood.

Geospatial data-science analysis on reasons behind delay in Grab ride-share services

Cleaning and analysing aggregated UK political polling data.

Python-based Space Physics Environment Data Analysis Software

Zipline, a Pythonic Algorithmic Trading Library

Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

Conduits - A Declarative Pipelining Tool For Pandas