Dataset and Source code of paper 'Enhancing Keyphrase Extraction from Academic Articles with their Reference Information'.

Overview

Enhancing Keyphrase Extraction from Academic Articles with their Reference Information

Overview

Dataset and code for paper "Enhancing Keyphrase Extraction from Academic Articles with their Reference Information".

The research content of this project is to analyze the impact of the introduction of reference title in scientific literature on the effect of keyword extraction. This project uses three datasets: SemEval-2010, PubMed and LIS-2000, which are located in the dataset folder. At the same time, we use two unsupervised methods: TF-IDF and TextRank, and three supervised learning methods: NaiveBayes, CRF and BiLSTM-CRF. The first four are traditional keywords extraction methods, located in the folder ML, and the last one is deep learning method, located in the folder DL.

Directory structure

Keyphrase_Extraction:                 Root directory
│  dl.bat:                            Batch commands to run deep learning model
│  ml.bat:                            Batch commands to run traditional models
│ 
├─Dataset:                            Store experimental datasets
│      SemEval-2010:                  Contains 244 scientific papers 
│      PubMed:                        Contains 1316 scientific papers
│      LIS-2000:                      Contains 2000 scientific papers
│ 
├─DL:                                 Store the source code of the deep learning model
│  │  build_path.py:                  Create file paths for saving preprocessed data
│  │  crf.py:                         Source code of CRF algorithm implementation(Use pytorch framework)
│  │  main.py:                        The main function of running the program
│  │  model.py:                       Source code of BiLSTM-CRF model
│  │  preprocess.py:                  Source code of preprocessing function
│  │  textrank.py:                    Source code of TextRank algorithm implementation.
│  │  tf_idf.py:                      Source code of TF-IDF algorithm implementation.
│  │  utils.py:                       Some auxiliary functions
│  ├─models:                          Parameter configuration of deep learning models
│  └─datas
│        tags:                        Label settings for sequence labeling
│ 
└─ML:                                 Store the source code of the traditional models
    │  build_path.py:                 Create file paths for saving preprocessed data
    │  configs.py:                    Path configuration file
    │  crf.py:                        Source code of CRF algorithm implementation(Use CRF++ Toolkit)
    │  evaluate.py:                   Source code for result evaluation
    │  naivebayes.py:                 Source code of naivebayes algorithm implementation(Use KEA-3.0 Toolkit)
    │  preprocessing.py:              Source code of preprocessing function
    │  textrank.py:                   Source code of TextRank algorithm implementation
    │  tf_idf.py:                     Source code of TF-IDF algorithm implementation
    │  utils.py:                      Some auxiliary functions
    ├─CRF++:                          CRF++ Toolkit
    └─KEA-3.0:                        KEA-3.0 Toolkit

Dataset Description

The dataset includes the following three json files:

  • SemEval-2010: SemEval-2010 Task 5 dataset, it contains 244 scientific papers and can be visited at: https://semeval2.fbk.eu/semeval2.php?location=data.
  • PubMed: Contains 1316 scientific papers from PubMed (https://github.com/boudinfl/ake-datasets/tree/master/datasets/PubMed).
  • LIS-2000: Contains 2000 scientific papers from journals in Library and Information Science (LIS).

    Each line of the json file includes:

  • title (T): The title of the paper.
  • abstract (A): The abstract of the paper.
  • introduction (I): The introduction of the paper.
  • conclusion (C): The conclusion of the paper.
  • body1 (Fp): The first sentence of each paragraph.
  • body2 (Lp): The last sentence of each paragraph.
  • full_text (F): The full text of the paper.
  • references (R): references list and only the title of each reference is provided.
  • keywords (K): the keywords of the paper and these keywords were annotated manually.

    Quick Start

    In order to facilitate the reproduction of the experimental results, the project uses bat batch command to run the program uniformly (only in Windows Environment). The dl.bat file is the batch command to run the deep learning model, and the ml.bat file is the batch command to run the traditional algorithm.

    How does it work?

    In the Windows environment, use the key combination Win + R and enter cmd to open the DOS command box, and switch to the project's root directory (Keyphrase_Extraction). Then input dl.bat, that is, run deep learning model to get the result of keyword extraction; Enter ml.bat to run traditional algorithm to get keywords Extract the results.

    Experimental results

    The following figures show that the influence of reference information on keyphrase extraction results of TF*IDF, TextRank, NB, CRF and BiLSTM-CRF.

    Table 1: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of SemEval-2010 Table1

    Table 2: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of PubMed Table2

    Table 3: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of LIS-2000 Table3

    Note: The yellow, green and blue bold fonts in the table represent the largest of the P, R and F1 value obtained from different corpora using the same model, respectively.

    Dependency packages

    Before running this project, check that the following Python packages are included in your runtime environment.

  • pytorch 1.7.1
  • nltk 3.5
  • numpy 1.19.2
  • pandas 1.1.3
  • tqdm 4.50.2

    Citation

    Please cite the following paper if you use this codes and dataset in your work.

    Chengzhi Zhang, Lei Zhao, Mengyuan Zhao, Yingyi Zhang. Enhancing Keyphrase Extraction from Academic Articles with their Reference Information. Scientometrics, 2021. (in press) [arXiv]

  • Owner
    Professor at iSchool of Nanjing University of Science and Technology
    Pytorch implementation of the paper: "A Unified Framework for Separating Superimposed Images", in CVPR 2020.

    Deep Adversarial Decomposition PDF | Supp | 1min-DemoVideo Pytorch implementation of the paper: "Deep Adversarial Decomposition: A Unified Framework f

    Zhengxia Zou 72 Dec 18, 2022
    Deduplicating Training Data Makes Language Models Better

    Deduplicating Training Data Makes Language Models Better This repository contains code to deduplicate language model datasets as descrbed in the paper

    Google Research 431 Dec 27, 2022
    Official code repository for A Simple Long-Tailed Rocognition Baseline via Vision-Language Model.

    This is the official code repository for A Simple Long-Tailed Rocognition Baseline via Vision-Language Model.

    peng gao 42 Nov 26, 2022
    A pytorch-based deep learning framework for multi-modal 2D/3D medical image segmentation

    A 3D multi-modal medical image segmentation library in PyTorch We strongly believe in open and reproducible deep learning research. Our goal is to imp

    Adaloglou Nikolas 1.2k Dec 27, 2022
    BanditPAM: Almost Linear-Time k-Medoids Clustering

    BanditPAM: Almost Linear-Time k-Medoids Clustering This repo contains a high-performance implementation of BanditPAM from BanditPAM: Almost Linear-Tim

    254 Dec 12, 2022
    RL-driven agent playing tic-tac-toe on starknet against challengers.

    tictactoe-on-starknet RL-driven agent playing tic-tac-toe on starknet against challengers. GUI reference: https://pythonguides.com/create-a-game-using

    21 Jul 30, 2022
    Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)

    E2FGVI (CVPR 2022) English | 简体中文 This repository contains the official implementation of the following paper: Towards An End-to-End Framework for Flo

    Media Computing Group @ Nankai University 537 Jan 07, 2023
    Embeddinghub is a database built for machine learning embeddings.

    Embeddinghub is a database built for machine learning embeddings.

    Featureform 1.2k Jan 01, 2023
    CC-GENERATOR - A python script for generating CC

    CC-GENERATOR A python script for generating CC NOTE: This tool is for Educationa

    Lêkzï 6 Oct 14, 2022
    Weakly Supervised Posture Mining with Reverse Cross-entropy for Fine-grained Classification

    Fine-grainedImageClassification Weakly Supervised Posture Mining with Reverse Cross-entropy for Fine-grained Classification We trained model here: lin

    ZhenchaoTang 14 Oct 21, 2022
    Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)

    Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation) Download Synthia dataset The model uses

    32 Sep 21, 2022
    ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation

    ClevrTex This repository contains dataset generation code for ClevrTex benchmark from paper: ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi

    Laurynas Karazija 26 Dec 21, 2022
    nextPARS, a novel Illumina-based implementation of in-vitro parallel probing of RNA structures.

    nextPARS, a novel Illumina-based implementation of in-vitro parallel probing of RNA structures. Here you will find the scripts necessary to produce th

    Jesse Willis 0 Jan 20, 2022
    Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers.

    Contra-OOD Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers. Requirements PyTorch Transformers datasets

    Wenxuan Zhou 27 Oct 28, 2022
    Hide screen when boss is approaching.

    BossSensor Hide your screen when your boss is approaching. Demo The boss stands up. He is approaching. When he is approaching, the program fetches fac

    Hiroki Nakayama 6.2k Jan 07, 2023
    Voice Conversion by CycleGAN (语音克隆/语音转换):CycleGAN-VC3

    CycleGAN-VC3-PyTorch 中文说明 | English This code is a PyTorch implementation for paper: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectr

    Kun Ma 110 Dec 24, 2022
    Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

    Focal Transformer This is the official implementation of our Focal Transformer -- "Focal Self-attention for Local-Global Interactions in Vision Transf

    Microsoft 486 Dec 20, 2022
    🕹️ Official Implementation of Conditional Motion In-betweening (CMIB) 🏃

    Conditional Motion In-Betweening (CMIB) Official implementation of paper: Conditional Motion In-betweeening. Paper(arXiv) | Project Page | YouTube in-

    Jihoon Kim 81 Dec 22, 2022
    FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics

    FusionNet_Pytorch FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics Requirements Pytorch 0.1.11 Pyt

    Choi Gunho 102 Dec 13, 2022
    Inverse Optimal Control Adapted to the Noise Characteristics of the Human Sensorimotor System

    Inverse Optimal Control Adapted to the Noise Characteristics of the Human Sensorimotor System This repository contains code for the paper Schultheis,

    2 Oct 28, 2022