Question and answer retrieval in Turkish with BERT

Overview

trfaq

Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉

What is this?

At this repo, I'm releasing the training script and a full working inference example for my model mys/bert-base-turkish-cased-nli-mean-faq-mnr published on HuggingFace. Please note that the training code at finetune_tf.py is a simplified version of the original, which is intended for educational purposes and not optimized for anything. However, it contains an implementation of the Multiple Negatives Symmetric Ranking loss, and you can use it in your own work. Additionally, I cleaned and filtered the Turkish subset of the clips/mqa dataset, as it contains lots of mis-encoded texts. You can download this cleaned dataset here.

Model

This is a finetuned version of mys/bert-base-turkish-cased-nli-mean for FAQ retrieval, which is itself a finetuned version of dbmdz/bert-base-turkish-cased for NLI. It maps questions & answers to 768 dimensional vectors to be used for FAQ-style chatbots and answer retrieval in question-answering pipelines. It was trained on the Turkish subset of clips/mqa dataset after some cleaning/ filtering and with a Multiple Negatives Symmetric Ranking loss. Before finetuning, I added two special tokens to the tokenizer (i.e., for questions and for answers) and resized the model embeddings, so you need to prepend the relevant tokens to the sequences before feeding them into the model. Please have a look at my accompanying repo to see how it was finetuned and how it can be used in inference. The following code snippet is an excerpt from the inference at the repo.

Usage

see inference.py for a full working example.

" + q for q in questions] answers = ["" + a for a in answers] def answer_faq(model, tokenizer, questions, answers, return_similarities=False): q_len = len(questions) tokens = tokenizer(questions + answers, padding=True, return_tensors='tf') embs = model(**tokens)[0] attention_masks = tf.cast(tokens['attention_mask'], tf.float32) sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True) masked_embs = embs * tf.expand_dims(attention_masks, axis=-1) masked_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32) a = tf.math.l2_normalize(masked_embs[:q_len, :], axis=1) b = tf.math.l2_normalize(masked_embs[q_len:, :], axis=1) similarities = tf.matmul(a, b, transpose_b=True) scores = tf.nn.softmax(similarities) results = list(zip(answers, scores.numpy().squeeze().tolist())) sorted_results = sorted(results, key=lambda x: x[1], reverse=True) sorted_results = [{"answer": answer.replace("", ""), "score": f"{score:.4f}"} for answer, score in sorted_results] return sorted_results for question in questions: results = answer_faq(model, tokenizer, [question], answers) print(question.replace("", "")) print(results) print("---------------------") ">
questions = [
    "Merhaba",
    "Nasılsın?",
    "Bireysel araç kiralama yapıyor musunuz?",
    "Kurumsal araç kiralama yapıyor musunuz?"
]

answers = [
    "Merhaba, size nasıl yardımcı olabilirim?",
    "İyiyim, teşekkür ederim. Size nasıl yardımcı olabilirim?",
    "Hayır, sadece Kurumsal Araç Kiralama operasyonları gerçekleştiriyoruz. Size başka nasıl yardımcı olabilirim?",
    "Evet, kurumsal araç kiralama hizmetleri sağlıyoruz. Size nasıl yardımcı olabilirim?"
]


questions = ["" + q for q in questions]
answers = ["" + a for a in answers]


def answer_faq(model, tokenizer, questions, answers, return_similarities=False):
    q_len = len(questions)
    tokens = tokenizer(questions + answers, padding=True, return_tensors='tf')
    embs = model(**tokens)[0]

    attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
    sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
    masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
    masked_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)
    a = tf.math.l2_normalize(masked_embs[:q_len, :], axis=1)
    b = tf.math.l2_normalize(masked_embs[q_len:, :], axis=1)

    similarities = tf.matmul(a, b, transpose_b=True)
        
    scores = tf.nn.softmax(similarities)
    results = list(zip(answers, scores.numpy().squeeze().tolist()))
    sorted_results = sorted(results, key=lambda x: x[1], reverse=True)
    sorted_results = [{"answer": answer.replace("", ""), "score": f"{score:.4f}"} for answer, score in sorted_results]
    return sorted_results


for question in questions:
    results = answer_faq(model, tokenizer, [question], answers)
    print(question.replace("", ""))
    print(results)
    print("---------------------")

And the output is:

Merhaba
[{'answer': 'Merhaba, size nasıl yardımcı olabilirim?', 'score': '0.2931'}, {'answer': 'İyiyim, teşekkür ederim. Size nasıl yardımcı olabilirim?', 'score': '0.2751'}, {'answer': 'Hayır, sadece Kurumsal Araç Kiralama operasyonları gerçekleştiriyoruz. Size başka nasıl yardımcı olabilirim?', 'score': '0.2200'}, {'answer': 'Evet, kurumsal araç kiralama hizmetleri sağlıyoruz. Size nasıl yardımcı olabilirim?', 'score': '0.2118'}]
---------------------
Nasılsın?
[{'answer': 'İyiyim, teşekkür ederim. Size nasıl yardımcı olabilirim?', 'score': '0.2808'}, {'answer': 'Merhaba, size nasıl yardımcı olabilirim?', 'score': '0.2623'}, {'answer': 'Hayır, sadece Kurumsal Araç Kiralama operasyonları gerçekleştiriyoruz. Size başka nasıl yardımcı olabilirim?', 'score': '0.2320'}, {'answer': 'Evet, kurumsal araç kiralama hizmetleri sağlıyoruz. Size nasıl yardımcı olabilirim?', 'score': '0.2249'}]
---------------------
Bireysel araç kiralama yapıyor musunuz?
[{'answer': 'Hayır, sadece Kurumsal Araç Kiralama operasyonları gerçekleştiriyoruz. Size başka nasıl yardımcı olabilirim?', 'score': '0.2861'}, {'answer': 'Evet, kurumsal araç kiralama hizmetleri sağlıyoruz. Size nasıl yardımcı olabilirim?', 'score': '0.2768'}, {'answer': 'İyiyim, teşekkür ederim. Size nasıl yardımcı olabilirim?', 'score': '0.2215'}, {'answer': 'Merhaba, size nasıl yardımcı olabilirim?', 'score': '0.2156'}]
---------------------
Kurumsal araç kiralama yapıyor musunuz?
[{'answer': 'Evet, kurumsal araç kiralama hizmetleri sağlıyoruz. Size nasıl yardımcı olabilirim?', 'score': '0.3060'}, {'answer': 'Hayır, sadece Kurumsal Araç Kiralama operasyonları gerçekleştiriyoruz. Size başka nasıl yardımcı olabilirim?', 'score': '0.2929'}, {'answer': 'İyiyim, teşekkür ederim. Size nasıl yardımcı olabilirim?', 'score': '0.2066'}, {'answer': 'Merhaba, size nasıl yardımcı olabilirim?', 'score': '0.1945'}]
---------------------
Owner
M. Yusuf Sarıgöz
AI research engineer and Google Developer Expert on Machine Learning. Open to new opportunities.
M. Yusuf Sarıgöz
Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Korean Stereotype Detector Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra Web demo you can test this model easily in demo

Sae_Chan_Oh 11 Feb 18, 2022
open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

7 Nov 02, 2022
Search msDS-AllowedToActOnBehalfOfOtherIdentity

前言 现在进行RBCD的攻击手段主要是搜索mS-DS-CreatorSID,如果机器的创建者是我们可控的话,那就可以修改对应机器的msDS-AllowedToActOnBehalfOfOtherIdentity,利用工具SharpAllowedToAct-Modify 那我们索性也试试搜索所有计算机

Jumbo 26 Dec 05, 2022
A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

Niu Zhe 3 Jan 24, 2022
Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Open Tech 61 Dec 13, 2022
A benchmark for evaluation and comparison of various NLP tasks in Persian language.

Persian NLP Benchmark The repository aims to track existing natural language processing models and evaluate their performance on well-known datasets.

Mofid AI 68 Dec 19, 2022
Interpretable Models for NLP using PyTorch

This repo is deprecated. Please find the updated package here. https://github.com/EdGENetworks/anuvada Anuvada: Interpretable Models for NLP using PyT

Sandeep Tammu 19 Dec 17, 2022
KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

KoBERTopic 모델 소개 KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정했습니다. 기존 BERTopic : https://github.com/MaartenGr/BERTopic/tree/05a6790b21009d

Won Joon Yoo 26 Jan 03, 2023
News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also v

Jinhyuk Lee 543 Jan 08, 2023
A simple word search made in python

Word Search Puzzle A simple word search made in python Usage $ python3 main.py -h usage: main.py [-h] [-c] [-f FILE] Generates a word s

Magoninho 16 Mar 10, 2022
ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost LOVE is accpeted by ACL22 main conference as a long pape

Lihu Chen 32 Jan 03, 2023
Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

Easy-Translate is a script for translating large text files in your machine using the M2M100 models from Facebook/Meta AI. We also privide a script fo

Iker García-Ferrero 41 Dec 15, 2022
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
Knowledge Management for Humans using Machine Learning & Tags

HyperTag helps humans intuitively express how they think about their files using tags and machine learning. Represent how you think using tags. Find what you look for using semantic search for your t

Ravn Tech, Inc. 166 Jan 07, 2023
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

ASReview hackathon for Follow the Money 2 Nov 28, 2021
Generating new names based on trends in data using GPT2 (Transformer network)

MLOpsNameGenerator Overall Goal The goal of the project is to develop a model that is capable of creating Pokémon names based on its description, usin

Gustav Lang Moesmand 2 Jan 10, 2022