BERT-based Financial Question Answering System

Overview

Jina Jina Jina Jina Docs We are hiring tweet button Python 3.7 3.8 Docker

BERT-based Financial Question Answering System

In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-based Financial Question Answering System. We adapt a passage reranking approach by first retrieving the top-50 candidate answers, then reranking the candidate answers using FinBERT-QA, a BERT-based model fine-tuned on the FiQA dataset that achieved the state-of-the-art results.

🦉 Please refer to this tutorial for a step-by-step guide and detailed explanations.

Motivation

Motivated by the emerging demand in the financial industry for the automatic analysis of unstructured and structured data at scale, QA systems can provide lucrative and competitive advantages to companies by facilitating the decision making of financial advisers. The goal of our system is to search for a list of relevant answer passages given a question. Here is an example of a question and a ground truth answer from the FiQA dataset:

performance

Set up

Clone:

https://github.com/yuanbit/jina-financial-qa-search.git

We will use jina-financial-qa-search/ as our working directory.

Install:

pip install -r requirements.txt

Download data and model:

bash get_data.sh

Index Answers

We want to index a subset of the answer passages from the FiQA dataset, dataset/test_answers.csv:

398960	From  http://financial-dictionary.thefreedictionary.com/Business+Fundamentals  The  facts  that  affect  a  company's      underlying  value.  Examples  of  business      fundamentals  include  debt,  cash  flow,      supply  of  and  demand  for  the  company's      products,  and  so  forth.  For  instance,      if  a  company  does  not  have  a      sufficient  supply  of  products,  it  will      fail.  Likewise,  demand  for  the  product      must  remain  at  a  certain  level  in      order  for  it  to  be  successful.  Strong      business  fundamentals  are  considered      essential  for  long-term  success  and      stability.  See  also:  Value  Investing,      Fundamental  Analysis.  For  a  stock  the  basic  fundamentals  are  the  second  column  of  numbers  you  see  on  the  google  finance  summary  page,    P/E  ratio,  div/yeild,  EPS,  shares,  beta.      For  the  company  itself  it's  generally  the  stuff  on  the  'financials'  link    (e.g.  things  in  the  quarterly  and  annual  report,    debt,  liabilities,  assets,  earnings,  profit  etc.
19183	If  your  sole  proprietorship  losses  exceed  all  other  sources  of  taxable  income,  then  you  have  what's  called  a  Net  Operating  Loss  (NOL).  You  will  have  the  option  to  "carry  back"  and  amend  a  return  you  filed  in  the  last  2  years  where  you  owed  tax,  or  you  can  "carry  forward"  the  losses  and  decrease  your  taxes  in  a  future  year,  up  to  20  years  in  the  future.  For  more  information  see  the  IRS  links  for  NOL.  Note:  it's  important  to  make  sure  you  file  the  NOL  correctly  so  I'd  advise  speaking  with  an  accountant.  (Especially  if  the  loss  is  greater  than  the  cost  of  the  accountant...)
327002	To  be  deductible,  a  business  expense  must  be  both  ordinary  and  necessary.  An  ordinary  expense  is  one  that  is  common  and  accepted  in  your  trade  or  business.  A  necessary  expense  is  one  that  is  helpful  and  appropriate  for  your  trade  or  business.  An  expense  does  not  have  to  be  indispensable  to  be  considered  necessary.    (IRS,  Deducting  Business  Expenses)  It  seems  to  me  you'd  have  a  hard  time  convincing  an  auditor  that  this  is  the  case.    Since  business  don't  commonly  own  cars  for  the  sole  purpose  of  housing  $25  computers,  you'd  have  trouble  with  the  "ordinary"  test.    And  since  there  are  lots  of  other  ways  to  house  a  computer  other  than  a  car,  "necessary"  seems  problematic  also.

You can change the path to answer_collection.tsv to index with the full dataset.

Run

python app.py index

asciicast

At the end you will see the following:

✅ done in ⏱ 1 minute and 54 seconds 🐎 7.7/s
        [email protected][S]:terminated
    [email protected][I]:recv ControlRequest from ctl▸doc_indexer▸⚐
    [email protected][I]:Terminating loop requested by terminate signal RequestLoopEnd()
    [email protected][I]:#sent: 56 #recv: 56 sent_size: 1.7 MB recv_size: 1.7 MB
    [email protected][I]:request loop ended, tearing down ...
    [email protected][I]:indexer size: 865 physical size: 3.1 MB
    [email protected][S]:artifacts of this executor (vecidx) is persisted to ./workspace/doc_compound_indexer-0/vecidx.bin
    [email protected][I]:indexer size: 865 physical size: 3.2 MB
    [email protected][S]:artifacts of this executor (docidx) is persisted to ./workspace/doc_compound_indexer-0/docidx.bin

Search Answers

We need to build a custom Executor to rerank the top-50 candidate answers. We can do this with the Jina Hub API. Let's get make sure that the Jina Hub extension is installed:

pip install "jina[hub]"

We can build the custom Ranker, FinBertQARanker by running:

jina hub build FinBertQARanker/ --pull --test-uses --timeout-ready 60000

Run

We can now use our Financial QA search engine by running:

python app.py search

The Ranker might take some time to compute the relevancy scores since it is using a BERT-based model. You can try out this list of questions from the FiQA dataset:

• What does it mean that stocks are “memoryless”?
• What would a stock be worth if dividends did not exist?
• What are the risks of Dividend-yielding stocks?
• Why do financial institutions charge so much to convert currency?
• Is there a candlestick pattern that guarantees any kind of future profit?
• 15 year mortgage vs 30 year paid off in 15
• Why is it rational to pay out a dividend?
• Why do companies have a fiscal year different from the calendar year?
• What should I look at before investing in a start-up?
• Where do large corporations store their massive amounts of cash?

Community

  • Slack channel - a communication platform for developers to discuss Jina
  • Community newsletter - subscribe to the latest update, release and event news of Jina
  • LinkedIn - get to know Jina AI as a company and find job opportunities
  • Twitter Follow - follow Jina AI and interact with them using hashtag #JinaSearch
  • Company - know more about the company, Jina AI is fully committed to open-source!

License

Copyright (c) 2021 Jina's friend. All rights reserved.

Owner
Bithiah Yuan
Bithiah Yuan
Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

Lightning ASR Modular and extensible speech recognition library leveraging pytorch-lightning and hydra What is Lightning ASR • Installation • Get Star

Soohwan Kim 40 Sep 19, 2022
Knowledge Oriented Programming Language

KoPL: 面向知识的推理问答编程语言 安装 | 快速开始 | 文档 KoPL全称 Knowledge oriented Programing Language, 是一个为复杂推理问答而设计的编程语言。我们可以将自然语言问题表示为由基本函数组合而成的KoPL程序,程序运行的结果就是问题的答案。目前,

THU-KEG 62 Dec 12, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

461 Dec 28, 2022
kochat

Kochat 챗봇 빌더는 성에 안차고, 자신만의 딥러닝 챗봇 애플리케이션을 만드시고 싶으신가요? Kochat을 이용하면 손쉽게 자신만의 딥러닝 챗봇 애플리케이션을 빌드할 수 있습니다. # 1. 데이터셋 객체 생성 dataset = Dataset(ood=True) #

1 Oct 25, 2021
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 02, 2023
NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

NumPy String-Indexed NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels, rather than conventio

Aitan Grossman 1 Jan 08, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

PEGASUS library Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models, or PEGASUS, uses self-supervised

Google Research 1.4k Dec 22, 2022
The first online catalogue for Arabic NLP datasets.

Masader The first online catalogue for Arabic NLP datasets. This catalogue contains 200 datasets with more than 25 metadata annotations for each datas

ARBML 94 Dec 26, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 01, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

CMU Locus Lab 3.5k Jan 03, 2023
This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Speech-Backbones This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab. Grad-TTS Official implementation of the Grad-

HUAWEI Noah's Ark Lab 295 Jan 07, 2023
Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Dense Passage Retrieval Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the

Meta Research 1.1k Jan 07, 2023
中文无监督SimCSE Pytorch实现

A PyTorch implementation of unsupervised SimCSE SimCSE: Simple Contrastive Learning of Sentence Embeddings 1. 用法 无监督训练 python train_unsup.py ./data/ne

99 Dec 23, 2022
Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

Dennis Priskorn 9 Nov 17, 2022
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022
Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

BP-Transformer This repo contains the code for our paper BP-Transformer: Modeling Long-Range Context via Binary Partition Zihao Ye, Qipeng Guo, Quan G

Zihao Ye 119 Nov 14, 2022
Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology (EARIST)

🤖 Coeus - EARIST A.C.E 💬 Coeus is an Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology,

Dids Irwyn Reyes 3 Oct 14, 2022
Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022