Catalogue data - A Python Scripts to prepare catalogue data

Overview

catalogue_data

Scripts to prepare catalogue data.

Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

sudo apt-get install git-lfs
git lfs install

Install dependencies:

sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar

Create virtual environment, activate it and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env file at the root directory:

HF_USERNAME=
   
    
HF_USER_ACCESS_TOKEN=
    
     
GIT_USER=
     
      
GIT_EMAIL=
      

      
     
    
   

Create metadata

To create dataset metadata (in file dataset_infos.json) run:

python create_metadata.py --repo <repo_id>

where you should replace , e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad

Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>

where you should replace:

  • path_to_file_with_dataset_ratios: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.
  • : directory path to save the aggregated dataset
Owner
BigScience Workshop
Research workshop on large language models - The Summer of Language Models 21
BigScience Workshop
Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

PyUpBit CS490 Large Scale Data Analytics — Implementation of Updatable Compressed Bitmap Indexing Paper Table of Contents About The Project Usage Cont

Hyeong Kyun (Daniel) Park 1 Jun 28, 2022
Randomisation-based inference in Python based on data resampling and permutation.

Randomisation-based inference in Python based on data resampling and permutation.

67 Dec 27, 2022
Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022
Projects that implement various aspects of Data Engineering.

DATAWAREHOUSE ON AWS The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming applicatio

2 Oct 14, 2021
BIGDATA SIMULATION ONE PIECE WORLD CENSUS

ONE PIECE is a Japanese manga of great international success. The story turns inhabited in a fictional world, tells the adventures of a young man whose body gained rubber properties after accidentall

Maycon Cypriano 3 Jun 30, 2022
Feature engineering and machine learning: together at last

Feature engineering and machine learning: together at last! Lambdo is a workflow engine which significantly simplifies data analysis by unifying featu

Alexandr Savinov 14 Sep 15, 2022
Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

Chris Carbonell 1 Dec 03, 2021
Modular analysis tools for neurophysiology data

Neuroanalysis Modular and interactive tools for analysis of neurophysiology data, with emphasis on patch-clamp electrophysiology. Functions for runnin

Allen Institute 5 Dec 22, 2021
Bigdata Simulation Library Of Dream By Sandman Books

BIGDATA SIMULATION LIBRARY OF DREAM BY SANDMAN BOOKS ================= Solution Architecture Description In the realm of Dreaming, its ruler SANDMAN,

Maycon Cypriano 3 Jun 30, 2022
PyPSA: Python for Power System Analysis

1 Python for Power System Analysis Contents 1 Python for Power System Analysis 1.1 About 1.2 Documentation 1.3 Functionality 1.4 Example scripts as Ju

758 Dec 30, 2022
A library to create multi-page Streamlit applications with ease.

A library to create multi-page Streamlit applications with ease.

Jackson Storm 107 Jan 04, 2023
Ejercicios Panda usando Pandas

Readme Below we add configuration details to locally test your application To co

1 Jan 22, 2022
Scraping and analysis of leetcode-compensations page.

Leetcode compensations report Scraping and analysis of leetcode-compensations page.

utsav 96 Jan 01, 2023
Binance Kline Data With Python

Binance Kline Data by seunghan(gingerthorp) reference https://github.com/binance/binance-public-data/ All intervals are supported: 1m, 3m, 5m, 15m, 30

shquant 5 Jul 13, 2022
A distributed block-based data storage and compute engine

Nebula is an extremely-fast end-to-end interactive big data analytics solution. Nebula is designed as a high-performance columnar data storage and tabular OLAP engine.

Columns AI 131 Dec 26, 2022
Pip install minimal-pandas-api-for-polars

Minimal Pandas API for Polars Install From PyPI: pip install minimal-pandas-api-for-polars Example Usage (see tests/test_minimal_pandas_api_for_polars

Austin Ray 6 Oct 16, 2022
Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

102 Dec 30, 2022
Repository created with LinkedIn profile analysis project done

EN/en Repository created with LinkedIn profile analysis project done. The datase

Mayara Canaver 4 Aug 06, 2022
Wafer Fault Detection - Wafer circleci with python

Wafer Fault Detection Problem Statement: Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor, such as a crystal

Avnish Yadav 14 Nov 21, 2022
Projeto para realizar o RPA Challenge . Utilizando Python e as bibliotecas Selenium e Pandas.

RPA Challenge in Python Projeto para realizar o RPA Challenge (www.rpachallenge.com), utilizando Python. O objetivo deste desafio é criar um fluxo de

Henrique A. Lourenço 1 Apr 12, 2022