BERTMap: A BERT-Based Ontology Alignment System

Overview

BERTMap: A BERT-based Ontology Alignment System

Important Notices

About

BERTMap is a BERT-based ontology alignment system, which utilizes the textual knowledge of ontologies to fine-tune BERT and make prediction. It also incorporates sub-word inverted indices for candidate selection, and (graph-based) extension and (logic-based) repair modules for mapping refinement.

Essential dependencies

The following packages are necessary but not sufficient for running BERTMap:

conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch  # pytorch
pip install cython  # the optimized parser of owlready2 relies on Cython
pip install owlready2  # for managing ontologies
pip install tensorboard  # tensorboard logging (optional)
pip install transformers  # huggingface library
pip install datasets  # huggingface datasets

Running BERTMap

IMPORTANT NOTICE: BERTMap relies on class labels for training, but different ontologies have different annotation properties to define the aliases (synonyms), so preprocessing is required for adding all the synonyms to rdf:label before running BERTMap. The preprocessed ontologies involved in our paper together with their reference mappings are available in data.zip.

Clone the repository and run:

# fine-tuning and evaluate bertmap prediction 
python run_bertmap.py -c config.json -m bertmap

# mapping extension (-e specify which mapping set {src, tgt, combined} to be extended)
python extend_bertmap.py -c config.json -e src

# evaluate extended bertmap 
python eval_bertmap.py -c config.json -e src

# repair and evluate final outputs (-t specify best validation threshold)
python repair_bertmap.py -c config.json -e src -t 0.999

# baseline models (edit similarity and pretrained bert embeddings)
python run_bertmap.py -c config.json -m nes
python run_bertmap.py -c config.json -m bertembeds

The script skips data construction once built for the first time to ensure that all of the models share the same set of pre-processed data.

The fine-tuning model is implemented with huggingface Trainer, which by default uses multiple GPUs, for restricting to GPUs of specified indices, please run (for example):

# only device (1) and (2) are visible to the script
CUDA_VISIBLE_DEVICES=1,2 python run_bertmap.py -c config.json -m bertmap 

Configurations

Here gives the explanations of the variables used in config.json for customized BERTMap running.

  • data:
    • task_dir: directory for saving all the output files.
    • src_onto: source ontology name.
    • tgt_onto: target ontology name.
    • task_suffix: any suffix of the task if needed, e.g. the LargeBio track has 'small' and 'whole'.
    • src_onto_file: source ontology file in .owl format.
    • tgt_onto_fil: target ontology file in .owl format.
    • properties: list of textual properties used for constructing semantic data , default is class labels: ["label"].
    • cut: threshold length for the keys of sub-word inverted index, preserve the keys only if their lengths > cut, default is 0.
  • corpora:
    • sample_rate: number of (soft) negative samples for each positive sample generated in corpora (not the ultimate fine-tuning data).
    • src2tgt_mappings_file: reference mapping file for evaluation and semi-supervised learning setting in .tsv format with columns: "Entity1", "Entity2" and "Value".
    • ignored_mappings_file: file in .tsv format but stores mappings that should be ignored by the evaluator.
    • train_map_ratio: proportion of training mappings to used in semi-supervised setting, default is 0.2.
    • val_map_ratio: proportion of validation mappings to used in semi-supervised setting, default is 0.1.
    • test_map_ratio: proportion of test mappings to used in semi-supervised setting, default is 0.7.
    • io_soft_neg_rate: number of soft negative sample for each positive sample generated in the fine-tuning data at the intra-ontology level.
    • io_hard_neg_rate: number of hard negative sample for each positive sample generated in the fine-tuning data at the intra-ontology level.
    • co_soft_neg_rate: number of soft negative sample for each positive sample generated in the fine-tuning data at the cross-ontology level.
    • depth_threshold: classes of depths larger than this threshold will not considered in hard negative generation, default is null.
    • depth_strategy: strategy to compute the depths of the classes if any threshold is set, default is max, choices are max and min.
  • bert
    • pretrained_path: real or huggingface library path for pretrained BERT, e.g. "emilyalsentzer/Bio_ClinicalBERT" (BioClinicalBERT).
    • tokenizer_path: real or huggingface library path for BERT tokenizer, e.g. "emilyalsentzer/Bio_ClinicalBERT" (BioClinicalBERT).
  • fine-tune
    • include_ids: include identity synonyms in the positive samples or not.
    • learning: choice of learning setting ss (semi-supervised) or us (unsupervised).
    • warm_up_ratio: portion of warm up steps.
    • max_length: maximum length for tokenizer (highly important for large task!).
    • num_epochs: number of training epochs, default is 3.0.
    • batch_size: batch size for fine-tuning BERT.
    • early_stop: whether or not to apply early stopping (patience has been set to 10), default is false.
    • resume_checkpoint: path to previous checkpoint if any, default is null.
  • map
    • candidate_limits: list of candidate limits used for mapping computation, suggested values are [25, 50, 100, 150, 200].
    • batch_size: batch size used for mapping computation.
    • nbest: number of top results to be considered.
    • string_match: whether or not to use string match before others.
    • strategy: strategy for classifier scoring method, default is mean.
  • eval:
    • automatic: whether or not automatically evaluate the mappings.

Should you need any further customizaions especially on the evaluation part, please set eval: automatic to false and use your own evaluation script.

Acknolwedgements

The repair module is credited to Ernesto Jiménez Ruiz et al., and the code can be found here.

Owner
KRR
Knowledge Representation and Reasoning Group - University of Oxford
KRR
🎃 Core identification module of AI powerful point reading system platform.

ppReader-Kernel Intro Core identification module of AI powerful point reading system platform. Usage 硬件: Windows10、GPU:nvdia GTX 1060 、普通RBG相机 软件: con

CrashKing 1 Jan 11, 2022
Lightweight stereo matching network based on MobileNetV1 and MobileNetV2

MobileStereoNet: Towards Lightweight Deep Networks for Stereo Matching

Cognitive Systems Research Group 139 Nov 30, 2022
A general 3D Object Detection codebase in PyTorch.

Det3D is the first 3D Object Detection toolbox which provides off the box implementations of many 3D object detection algorithms such as PointPillars, SECOND, PIXOR, etc, as well as state-of-the-art

Benjin Zhu 1.4k Jan 05, 2023
Official source code of paper 'IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo'

IterMVS official source code of paper 'IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo' Introduction IterMVS is a novel lear

Fangjinhua Wang 127 Jan 04, 2023
Implementation of PersonaGPT Dialog Model

PersonaGPT An open-domain conversational agent with many personalities PersonaGPT is an open-domain conversational agent cpable of decoding personaliz

ILLIDAN Lab 42 Jan 01, 2023
YOLOX Win10 Project

Introduction 这是一个用于Windows训练YOLOX的项目,相比于官方项目,做了一些适配和修改: 1、解决了Windows下import yolox失败,No such file or directory: 'xxx.xml'等路径问题 2、CUDA out of memory等显存不

5 Jun 08, 2022
This is the code repository for the paper "Identification of the Generalized Condorcet Winner in Multi-dueling Bandits" (NeurIPS 2021).

Code Repository for the Paper "Identification of the Generalized Condorcet Winner in Multi-dueling Bandits" (To appear in: Proceedings of NeurIPS20

1 Oct 03, 2022
High performance distributed framework for training deep learning recommendation models based on PyTorch.

PERSIA (Parallel rEcommendation tRaining System with hybrId Acceleration) is developed by AI 340 Dec 30, 2022

Galaxy images labelled by morphology (shape). Aimed at ML development and teaching

Galaxy images labelled by morphology (shape). Aimed at ML debugging and teaching.

Mike Walmsley 14 Nov 28, 2022
The Turing Change Point Detection Benchmark: An Extensive Benchmark Evaluation of Change Point Detection Algorithms on real-world data

Turing Change Point Detection Benchmark Welcome to the repository for the Turing Change Point Detection Benchmark, a benchmark evaluation of change po

The Alan Turing Institute 85 Dec 28, 2022
Implementation of a Transformer using ReLA (Rectified Linear Attention)

ReLA (Rectified Linear Attention) Transformer Implementation of a Transformer using ReLA (Rectified Linear Attention). It will also contain an attempt

Phil Wang 49 Oct 14, 2022
General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)

General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usec

The Kompute Project 1k Jan 06, 2023
Code for the paper "SmoothMix: Training Confidence-calibrated Smoothed Classifiers for Certified Robustness" (NeurIPS 2021)

SmoothMix: Training Confidence-calibrated Smoothed Classifiers for Certified Robustness (NeurIPS2021) This repository contains code for the paper "Smo

Jongheon Jeong 17 Dec 27, 2022
Variational autoencoder for anime face reconstruction

VAE animeface Variational autoencoder for anime face reconstruction Introduction This repository is an exploratory example to train a variational auto

Minzhe Zhang 2 Dec 11, 2021
[CVPR'22] Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast

wseg Overview The Pytorch implementation of Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast. [arXiv] Though image-level weakly

Ye Du 96 Dec 30, 2022
UMPNet: Universal Manipulation Policy Network for Articulated Objects

UMPNet: Universal Manipulation Policy Network for Articulated Objects Zhenjia Xu, Zhanpeng He, Shuran Song Columbia University Robotics and Automation

Columbia Artificial Intelligence and Robotics Lab 33 Dec 03, 2022
SuperSDR: multiplatform KiwiSDR + CAT transceiver integrator

SuperSDR SuperSDR integrates a realtime spectrum waterfall and audio receive from any KiwiSDR around the world, together with a local (or remote) cont

Marco Cogoni 30 Nov 29, 2022
MT3: Multi-Task Multitrack Music Transcription

MT3: Multi-Task Multitrack Music Transcription MT3 is a multi-instrument automatic music transcription model that uses the T5X framework. This is not

Magenta 867 Dec 29, 2022
Official code of "Mitigating the Mutual Error Amplification for Semi-Supervised Object Detection"

CrossTeaching-SSOD 0. Introduction Official code of "Mitigating the Mutual Error Amplification for Semi-Supervised Object Detection" This repo include

Bruno Ma 9 Nov 29, 2022
TensorFlow, PyTorch and Numpy layers for generating Orthogonal Polynomials

OrthNet TensorFlow, PyTorch and Numpy layers for generating multi-dimensional Orthogonal Polynomials 1. Installation 2. Usage 3. Polynomials 4. Base C

Chuan 29 May 25, 2022