T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

Overview

T'rex Park(霸王龙公园)

Trexpark项目由有赞数据智能团队开源,是国内首个基于电商大数据训练的开源NLP和图像项目。我们预期将逐步开放基于商品标题,评论,客服对话等NLP语聊,以及商品主图,品牌logo等进行预训练的NLP和图像模型。


为什么是霸王龙?

霸王龙

霸王龙是有赞的吉祥物。呃,准确的说这不是个吉祥物,而是有赞人自我鞭策的精神图腾。早期我们的网站经常崩溃,导致浏览器会显示一个霸王龙的图案,提示页面崩溃了。于是我们就把霸王龙作为我们的吉祥物,让大家时刻警惕故障和缺陷。


为什么要开源模型?

和平台电商不同,有赞是一家商家服务公司,我们的使命是帮助每一位重视产品和服务的商家成功。因此我们放弃了通过开放接口提供服务的方式,直接把底层能力开放出来,提供给需要的商家和中小型电商企业,帮助他们在有赞的数据沉淀基础上,快速构建自己的机器学习应用。


为什么要做领域预训练模型?

目前各个开源大模型往往基于通用语料训练,而通用语料的语言模型用于特定领域的机器学习任务,往往效果不佳,或者需要对预训练模型部分进行finetune。我们的实践发现,基于电商数据finetune以后的预训练模型,能更好的学习到领域知识,并且在多项任务中,无须额外训练,或者仅仅对模型的预测部分进行训练就可以达到很好的效果。

我们基于电商领域语料训练的预训练模型非常适合小样本的机器学习任务,用于解决中小电商企业和商家的fewshot难题。以商品标题分类为例,每个类目只需要100个样本,就能得到很好的分类效果,具体例子可以看这里

我们的模型已经在HuggingFace的model hub上发布,想要使用我们的模型,只需要几行代码

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("youzanai/bert-product-title-chinese")
model = AutoModel.from_pretrained("youzanai/bert-product-title-chinese")

模型加载后,我们就可以执行简单的encoder任务了

batch = tokenizer(["青蒿精油手工皂", "超级飞侠乐迪太空车"])
outputs = model(**batch)
print(outputs.logits)

项目的src目录中有完整的代码和测试用的数据,可以直接运行浏览效果。


文档和帮助

详细的使用文档我们还在编写中,大家可以先参考src目录中的示例代码。为了让代码更容易理解,我们已经尽可能的对代码进行了精简。T'rex Park底层使用了HuggingFace的Transformer框架,关于Transformer的文档可以看这里

A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Simple-Vosk A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk. Check out the official Vosk G

2 Jun 19, 2022
A Streamlit web app that generates Rick and Morty stories using GPT2.

Rick and Morty Story Generator This project uses a pre-trained GPT2 model, which was fine-tuned on Rick and Morty transcripts, to generate new stories

₸ornike 33 Oct 13, 2022
Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

Princeton Natural Language Processing 92 Dec 27, 2022
Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

Max Adamski 12 Dec 23, 2022
GPT-3 command line interaction

Writer_unblock Straight-forward command line interfacing with GPT-3. Finding yourself stuck at a conceptual stage? Spinning your wheels needlessly on

Seth Nuzum 6 Feb 10, 2022
Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

"Kötü söz sahibine aittir." -Anonim Nedir? sinkaf uygunsuz yorumların bulunmasını sağlayan bir python kütüphanesidir. Farkı nedir? Diğer algoritmalard

KaraGoz 4 Feb 18, 2022
A highly sophisticated sequence-to-sequence model for code generation

CoderX A proof-of-concept AI system by Graham Neubig (June 30, 2021). About CoderX CoderX is a retrieval-based code generation AI system reminiscent o

Graham Neubig 39 Aug 03, 2021
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

Dat Quoc Nguyen 152 Sep 02, 2022
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

Tower 1 Nov 20, 2021
Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Jifan Chen 22 Oct 21, 2022
中文无监督SimCSE Pytorch实现

A PyTorch implementation of unsupervised SimCSE SimCSE: Simple Contrastive Learning of Sentence Embeddings 1. 用法 无监督训练 python train_unsup.py ./data/ne

99 Dec 23, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Yu Zhang 50 Nov 08, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Lau 1 Dec 17, 2021
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 06, 2022
GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

GCRC GCRC: A New Challenging MRC Dataset from Gaokao Chinese for Explainable Eva

Yunxiao Zhao 5 Nov 04, 2022