Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Overview

Auto-Research

Auto-Research

A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.

Data Provider: arXiv Open Archive Initiative OAI

Requirements:

  • python 3.7 or above
  • poppler-utils
  • list of requirements in requirements.txt
  • 8GB disk space
  • 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)

Demo :

Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing

Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query

([TIP] click 'edit and run' to run the demo for your custom queries on a free GPU)

Steps to run (pip coming soon):

apt install -y poppler-utils libpoppler-cpp-dev
git clone https://github.com/sidphbot/Auto-Research.git

cd Auto-Research/
pip install -r requirements.txt
python survey.py [options] <your_research_query>

Artifacts generated (zipped):

  • Detailed survey draft paper as txt file
  • A curated list of top 25+ papers as pdfs and txts
  • Images extracted from above papers as jpegs, bmps etc
  • Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
  • Tables extracted from papers(optional)
  • Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump

Example run #1 - python utility

python survey.py 'multi-task representation learning'

Example run #2 - python class

from survey import Surveyor
mysurveyor = Surveyor()
mysurveyor.survey('quantum entanglement')

Research tools:

These are independent tools for your research or document text handling needs.

*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
  • abstractive_summary - takes a long text document (string) and returns a 1-paragraph abstract or “abstractive” summary (string)

    Input:

      `longtext` : string
    

    Returns:

      `summary` : string
    
  • extractive_summary - takes a long text document (string) and returns a 1-paragraph of extracted highlights or “extractive” summary (string)

    Input:

      `longtext` : string
    

    Returns:

      `summary` : string
    
  • generate_title - takes a long text document (string) and returns a generated title (string)

    Input:

      `longtext` : string
    

    Returns:

      `title` : string
    
  • extractive_highlights - takes a long text document (string) and returns a list of extracted highlights ([string]), a list of keywords ([string]) and key phrases ([string])

    Input:

      `longtext` : string
    

    Returns:

      `highlights` : [string]
      `keywords` : [string]
      `keyphrases` : [string]
    
  • extract_images_from_file - takes a pdf file name (string) and returns a list of image filenames ([string]).

    Input:

      `pdf_file` : string
    

    Returns:

      `images_files` : [string]
    
  • extract_tables_from_file - takes a pdf file name (string) and returns a list of csv filenames ([string]).

    Input:

      `pdf_file` : string
    

    Returns:

      `images_files` : [string]
    
  • cluster_lines - takes a list of lines (string) and returns the topic-clustered sections (dict(generated_title: [cluster_abstract])) and clustered lines (dict(cluster_id: [cluster_lines]))

    Input:

      `lines` : [string]
    

    Returns:

      `sections` : dict(generated_title: [cluster_abstract])
      `clusters` : dict(cluster_id: [cluster_lines])
    
  • extract_headings - [for scientific texts - Assumes an ‘abstract’ heading present] takes a text file name (string) and returns a list of headings ([string]) and refined lines ([string]).

    [Tip 1] : Use extract_sections as a wrapper (e.g. extract_sections(extract_headings(“/path/to/textfile”)) to get heading-wise sectioned text with refined lines instead (dict( heading: text))

    [Tip 2] : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!

    Input:

      `text_file` : string 		
    

    Returns:

      `refined` : [string], 
      `headings` : [string]
      `sectioned_doc` : dict( heading: text) (Optional - Wrapper case)
    

Access/Modify defaults:

  • inside code
from survey.Surveyor import DEFAULTS
from pprint import pprint

pprint(DEFAULTS)

or,

  • Modify static config file - defaults.py

or,

  • At runtime (utility)
python survey.py --help
usage: survey.py [-h] [--max_search max_metadata_papers]
                   [--num_papers max_num_papers] [--pdf_dir pdf_dir]
                   [--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
                   [--dump_dir dump_dir] [--models_dir save_models_dir]
                   [--title_model_name title_model_name]
                   [--ex_summ_model_name extractive_summ_model_name]
                   [--ledmodel_name ledmodel_name]
                   [--embedder_name sentence_embedder_name]
                   [--nlp_name spacy_model_name]
                   [--similarity_nlp_name similarity_nlp_name]
                   [--kw_model_name kw_model_name]
                   [--refresh_models refresh_models] [--high_gpu high_gpu]
                   query_string

Generate a survey just from a query !!

positional arguments:
  query_string          your research query/keywords

optional arguments:
  -h, --help            show this help message and exit
  --max_search max_metadata_papers
                        maximium number of papers to gaze at - defaults to 100
  --num_papers max_num_papers
                        maximium number of papers to download and analyse -
                        defaults to 25
  --pdf_dir pdf_dir     pdf paper storage directory - defaults to
                        arxiv_data/tarpdfs/
  --txt_dir txt_dir     text-converted paper storage directory - defaults to
                        arxiv_data/fulltext/
  --img_dir img_dir     image storage directory - defaults to
                        arxiv_data/images/
  --tab_dir tab_dir     tables storage directory - defaults to
                        arxiv_data/tables/
  --dump_dir dump_dir   all_output_dir - defaults to arxiv_dumps/
  --models_dir save_models_dir
                        directory to save models (> 5GB) - defaults to
                        saved_models/
  --title_model_name title_model_name
                        title model name/tag in hugging-face, defaults to
                        'Callidior/bert2bert-base-arxiv-titlegen'
  --ex_summ_model_name extractive_summ_model_name
                        extractive summary model name/tag in hugging-face,
                        defaults to 'allenai/scibert_scivocab_uncased'
  --ledmodel_name ledmodel_name
                        led model(for abstractive summary) name/tag in
                        hugging-face, defaults to 'allenai/led-
                        large-16384-arxiv'
  --embedder_name sentence_embedder_name
                        sentence embedder name/tag in hugging-face, defaults
                        to 'paraphrase-MiniLM-L6-v2'
  --nlp_name spacy_model_name
                        spacy model name/tag in hugging-face (if changed -
                        needs to be spacy-installed prior), defaults to
                        'en_core_sci_scibert'
  --similarity_nlp_name similarity_nlp_name
                        spacy downstream model(for similarity) name/tag in
                        hugging-face (if changed - needs to be spacy-installed
                        prior), defaults to 'en_core_sci_lg'
  --kw_model_name kw_model_name
                        keyword extraction model name/tag in hugging-face,
                        defaults to 'distilbert-base-nli-mean-tokens'
  --refresh_models refresh_models
                        Refresh model downloads with given names (needs
                        atleast one model name param above), defaults to False
  --high_gpu high_gpu   High GPU usage permitted, defaults to False

  • At runtime (code)

    during surveyor object initialization with surveyor_obj = Surveyor()

    • pdf_dir: String, pdf paper storage directory - defaults to arxiv_data/tarpdfs/
    • txt_dir: String, text-converted paper storage directory - defaults to arxiv_data/fulltext/
    • img_dir: String, image image storage directory - defaults to arxiv_data/images/
    • tab_dir: String, tables storage directory - defaults to arxiv_data/tables/
    • dump_dir: String, all_output_dir - defaults to arxiv_dumps/
    • models_dir: String, directory to save to huge models, defaults to saved_models/
    • title_model_name: String, title model name/tag in hugging-face, defaults to Callidior/bert2bert-base-arxiv-titlegen
    • ex_summ_model_name: String, extractive summary model name/tag in hugging-face, defaults to allenai/scibert_scivocab_uncased
    • ledmodel_name: String, led model(for abstractive summary) name/tag in hugging-face, defaults to allenai/led-large-16384-arxiv
    • embedder_name: String, sentence embedder name/tag in hugging-face, defaults to paraphrase-MiniLM-L6-v2
    • nlp_name: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to en_core_sci_scibert
    • similarity_nlp_name: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to en_core_sci_lg
    • kw_model_name: String, keyword extraction model name/tag in hugging-face, defaults to distilbert-base-nli-mean-tokens
    • high_gpu: Bool, High GPU usage permitted, defaults to False
    • refresh_models: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False

    during survey generation with surveyor_obj.survey(query="my_research_query")

    • max_search: int maximium number of papers to gaze at - defaults to 100
    • num_papers: int maximium number of papers to download and analyse - defaults to 25
You might also like...
 NLP topic mdel LDA - Gathered from New York Times website
NLP topic mdel LDA - Gathered from New York Times website

NLP topic mdel LDA - Gathered from New York Times website

This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API
topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

NLP Space News Topic Modeling Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com Table of Contents Project Idea Data acquisition Primary data sour

Biterm Topic Model (BTM): modeling topics in short texts
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Topic Inference with Zeroshot models

zeroshot_topics Table of Contents Installation Usage License Installation zeroshot_topics is distributed on PyPI as a universal wheel and is available

Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

Code for
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Releases(0.0.2)
Owner
Sidharth Pal
Deep learning researcher with a huge passion for open source and an undying motivation to help the community.
Sidharth Pal
本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。

【关于 NLP】那些你不知道的事 作者:杨夕、芙蕖、李玲、陈海顺、twilight、LeoLRH、JimmyDU、艾春辉、张永泰、金金金 介绍 本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。 目录架构 一、【

1.4k Dec 30, 2022
official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Plugin 3 Jan 12, 2022
Faster, modernized fork of the language identification tool langid.py

py3langid py3langid is a fork of the standalone language identification tool langid.py by Marco Lui. Original license: BSD-2-Clause. Fork license: BSD

Adrien Barbaresi 12 Nov 05, 2022
A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Persian-Image-Captioning We fine-tuning the Vision Encoder Decoder Model for the task of image captioning on the coco-flickr-farsi dataset. The implem

Hamtech-ai 15 Aug 25, 2022
Tracking Progress in Natural Language Processing

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Sebastian Ruder 21.2k Dec 30, 2022
DiY Oxygen Concentrator based on the OxiKit

M19O2 DiY Oxygen Concentrator based on / inspired by the OxiKit, OpenOx, Marut, RepRap and Project Apollo platforms. About Read about the project on H

Maker's Asylum 62 Dec 22, 2022
Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

Yeoun Yi 3 May 23, 2022
Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

korean extractive summarization 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드 Leaderboard Notice Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(ext

3 Aug 10, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

张博 1 Feb 02, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Microsoft 105 Jan 08, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 03, 2023
Mednlp - Medical natural language parsing and utility library

Medical natural language parsing and utility library A natural language medical

Paul Landes 3 Aug 24, 2022
CATs: Semantic Correspondence with Transformers

CATs: Semantic Correspondence with Transformers For more information, check out the paper on [arXiv]. Training with different backbones and evaluation

74 Dec 10, 2021
pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

297 Dec 29, 2022
Graph Coloring - Weighted Vertex Coloring Problem

Graph Coloring - Weighted Vertex Coloring Problem This project proposes several local searches and an MCTS algorithm for the weighted vertex coloring

Cyril 1 Jul 08, 2022
Blender addon - Scrub timeline from viewport with a shortcut

Viewport scrub timeline Move in the timeline directly in viewport and snap to nearest keyframe Note : This standalone feature will be added in the nat

Samuel Bernou 40 Nov 07, 2022
📝An easy-to-use package to restore punctuation of the text.

✏️ rpunct - Restore Punctuation This repo contains code for Punctuation restoration. This package is intended for direct use as a punctuation restorat

Daulet Nurmanbetov 72 Dec 30, 2022