Rootski - Full codebase for rootski.io (without the data)

Overview

breakdown-svg

📣 Welcome to the Rootski codebase!

This is the codebase for the application running at rootski.io.

🗒 Note: You can find information and training on the architecture, ticket board, development practices, and how to contribute on our knowledge base.

Rootski is a full-stack application for studying the Russian language by learning roots.

Rootski uses an A.I. algorithm called a "transformer" to break Russian words into roots. Rootski enriches the word breakdowns with data such as definitions, grammar information, related words, and examples and then displays this information to users for them to study.

How is the Rootski project run? (Hint, get involved here 😃 )

Rootski is developed by volunteers!

We use Rootski as a platform to learn and mentor anyone with an interest in frontend/backend development, developing data science models, data engineering, MLOps, DevOps, UX, and running a business. Although the code is open-source, the license for reuse and redistribution is tightly restricted.

The premise for building Rootski "in the open" is this: possibly the best ways to learn to write production-ready, high quality software is to

  1. explore other high-quality software that is already written
  2. develop an application meant to support a large number of users
  3. work with experienced mentors

For better or worse, it's hard to find code for large software systems built to be hosted in the cloud and used by a large number of customers. This is because virtually all apps that fit this description... are proprietary 🤣 . That makes (1) hard.

(2) can be inaccessible due to the amount of time it takes to write well-written software systems without a team (or mentorship). If you're only interested in a sub-part of engineering, or if you are a beginner, it can be infeasible to build an entire production system on your own. Think of this as working on a personal project... with a bunch of other fun people working on it with you.

Contributors

Onboarded and contributed features :D

  • Eric Riddoch - Been working on Rootski for 3 years and counting!
  • Ryan Gardner - Helping with all of the legal/business aspects and dabbling in development

Friends

Completed a lot of the Rootski onboarding and chat with us in our Slack workspace about miscellanious code questions, careers, advice, etc.

  • Isaac Robbins - Learning and building experience in MLOps and DevOps!
  • Colin Varney - Full-stack python guy. Is working his first full-time software job!
  • Fazleem Baig - MLOps guy. Quite experienced with Python and learning about AWS. Working for an AI startup in Canada.
  • Ayse (Aysha) Arslan - Learning about all things MLOps. Working her first MLE/MLOps job!
  • Sebastian Sanchez - Learning about frontend development.
  • Yashwanth (Yash) Kumar - Finishing up the Georgia Tech online masters in CS.






The Technical Stuff

How to deploy an entire Rootski environment from scratch

Going through this, you'll notice that there are several one-time, manual steps. This is common even for teams with a heavily automated infrastructure-as-code workflow, particularly when it comes to the creation of users and storing of credentials.

Once these steps are complete, all subsequent interactions with our Rootski infrastructure can be done using our infrastructure as code and other automation tools.

1. Create an AWS account and user

  1. Create an IAM user with programmatic access
  2. Install the AWS CLI
  3. Run aws configure --profile rootski and copy the credentials from step (1). Set the region to us-west-2.

🗒 Note: this IAM user will need sufficient permissions to create and access the infrastructure that will be discussed below. This includes creating several types of infrastructure using CloudFormation.

2. Create an SSH key pair

  1. In the AWS console, go to EC2 and create an SSH key pair named rootski.
  2. Download the key pair.
  3. Save the key pair somewhere you won't forget. If the pair isn't already named, I like to rename them and store them at ~/.ssh/rootski/rootski.id_rsa (private key) and ~/.ssh/rootski/rootski.id_rsa.pub (public key).
  4. Create a new GitHub account for a "Machine User". Copy/paste the contents of rootski.id_rsa.pub into any boxes you have to to make this work :D this "machine user" is now authorized to clone the rootski repository!

3. Create several parameters in AWS SSM Parameter Store

Parameter Description
/rootski/ssh/private_key The contents of the private key needed to clone the rootski repository.
/rootski/prod/database_config A stringified JSON object with database connection information (see below)
{
    "postgres_user": "rootski-db-user",
    "postgres_password": "rootski-db-pass",
    "postgres_host": "database.rootski.io",
    "postgres_port": "5432",
    "postgres_db": "rootski-db-database-name"
}

4. Purchase a domain name that happens to be rootski.io

You know, the domain name rootski.io is hard coded in a few places throughout the Rootski infrastructure. It felt wasteful to parameterize this everywhere since... it's unlikely that we will ever change our domain name.

If we ever have a need for this, we can revisit it :D

5. Create an ACM TLS certificate verified with the DNS challenge for *.rootski.io

You'll need to do this in the AWS console. This certificate will allow us to access rootski.io and all of its subdomains over HTTPS. You'll need the ARN of this certificate for a later step.

4. Create the rootski infrastructure

Before running these commands, copy/paste the ARN of the *.rootski.io ACM certificate into the appropriate place in infrastructure/iac/cloudformation/front-end/static-website.yml.

# create the S3 bucket and Route53 hosted zone for hosting the React application as a static site
...

# create the AWS Cognito user pool
...

# create the AWS Lightsail instance with the backend database (simultaneously deploys the database)
...

# deploy the API Gateway and Lambda function
...

5. Deploy the frontend site

make deploy-frontend

DONE!

Owner
Eric
In modern Applied Mathematics, we specialize in algorithms. I'm a data scientist with a strong background in algorithm design and software development.
Eric
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

BROS (BERT Relying On Spatiality) is a pre-trained language model focusing on text and layout for better key information extraction from documents. Given the OCR results of the document image, which

Clova AI Research 94 Dec 30, 2022
Implementation of legal QA system based on SentenceKoBART

LegalQA using SentenceKoBART Implementation of legal QA system based on SentenceKoBART How to train SentenceKoBART Based on Neural Search Engine Jina

Heewon Jeon(gogamza) 75 Dec 27, 2022
TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

TFPNER TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech Named entity recognition (NER), which aims at identifyin

1 Feb 07, 2022
A fast, efficient universal vector embedding utility package.

Magnitude: a fast, simple vector embedding utility library A feature-packed Python package and vector storage file format for utilizing vector embeddi

Plasticity 1.5k Jan 02, 2023
Reproduction process of BERT on SST2 dataset

BERT-SST2-Prod Reproduction process of BERT on SST2 dataset 安装说明 下载代码库 git clone https://github.com/JunnYu/BERT-SST2-Prod 进入文件夹,安装requirements pip ins

yujun 1 Nov 18, 2021
The RWKV Language Model

RWKV-LM We propose the RWKV language model, with alternating time-mix and channel-mix layers: The R, K, V are generated by linear transforms of input,

PENG Bo 877 Jan 05, 2023
String Gen + Word Checker

Creates random strings and checks if any of them are a real words. Mostly a waste of time ngl but it is cool to see it work and the fact that it can generate a real random word within10sec

1 Jan 06, 2022
Search with BERT vectors in Solr and Elasticsearch

Search with BERT vectors in Solr and Elasticsearch

Dmitry Kan 123 Dec 29, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
Translation for Trilium Notes. Trilium Notes 中文版.

Trilium Translation 中文说明 This repo provides a translation for the awesome Trilium Notes. Currently, I have translated Trilium Notes into Chinese. Test

743 Jan 08, 2023
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 07, 2023
nlpcommon is a python Open Source Toolkit for text classification.

nlpcommon nlpcommon, Python Text Tool. Guide Feature Install Usage Dataset Contact Cite Reference Feature nlpcommon is a python Open Source

xuming 3 May 29, 2022
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.9k Jan 03, 2023
PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

5 Oct 29, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 30, 2022
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 09, 2022
TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

Alexa 98 Dec 09, 2022
自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

ja-timex 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器 概要 ja-timex は、現代日本語で書かれた自然文に含まれる時間情報表現を抽出しTIMEX3と呼ばれるアノテーション仕様に変換することで、プログラムが利用できるような形に規格化するルールベースの解析器です。

Yuki Okuda 116 Nov 09, 2022
【原神】自动演奏风物之诗琴的程序

疯物之诗琴 读取midi并自动演奏原神风物之诗琴。 可以自定义配置文件自动调整音符来适配风物之诗琴。 (原神1.4直播那天就开始做了!到现在才能放出来。。) 如何使用 在Release页面中下载打包好的程序和midi压缩包并解压。 双击运行“疯物之诗琴.exe”。 在原神中打开风物之诗琴,软件内输入

435 Jan 04, 2023