LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

Last update: Aug 06, 2022

Related tags

Overview

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

Based on the work by Smith et al. (2021)

Querying both structured and unstructured data via a single common query interface such as SQL or natural language has been a long standing research goal. Moreover, as methods for extracting information from unstructured data become ever more powerful, the desire to integrate the output of such extraction processes with "clean", structured data grows. We are convinced that for successful integration into databases, such extracted information in the form of "triples" needs to be both 1) of high quality and 2) have the necessary generality to link up with varying forms of structured data. It is the combination of both these aspects, which heretofore have been usually treated in isolation, where our approach breaks new ground.

The cornerstone of our work is a novel, generic method for extracting open information triples from unstructured text, using a combination of linguistics and learning-based extraction methods, thus uniquely balancing both precision and recall. Our system called LILLIE (LInked Linguistics and Learning-Based Information Extractor) uses dependency tree modification rules to refine triples from a high-recall learning-based engine, and combines them with syntactic triples from a high-precision engine to increase effectiveness. In addition, our system features several augmentations, which modify the generality and the degree of granularity of the output triples. Even though our focus is on addressing both quality and generality simultaneously, our new method substantially outperforms current state-of-the-art systems on the two widely-used CaRB and Re-OIE16 benchmark sets for information extraction.

Installation

Requires Python 3.6.9.

pip install -r requirements.txt
python3 -m spacy download en_core_web_md
Clone ClausIE to ./learning_based/pyclausie (https://github.com/AnthonyMRios/pyclausie)
Install with: cd ./learning_based/pyclausie python3 setup.py install
Clone OpenIE5 to ./learning_based/OpenIE-Standalone (https://github.com/dair-iitd/OpenIE-standalone)
Run OIE5 with: cd ./learning_based/OpenIE-standalone java -Xmx16g -jar openie-assembly-5.0-SNAPSHOT.jar --httpPort 9000
Download Stanford CoreNLP Server 3.9.2 to ./rule_based/parser (https://stanfordnlp.github.io/CoreNLP/history.html)
Run the parser: java -mx6g -cp "./rule_based/parser/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 10000 -timeout 30000
Run the learning-based extractor: python3 ./learning_based/paralleloie.py -i data/pubmedabstracts.json
Run the rule-based extractor-refiner: python3 ./rule_based/extract_refine.py -i extracted_triples_learning.csv

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

Related tags

Overview

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

Installation

Owner

Accelerating model creation and evaluation.

This repository demonstrates the usage of hover to understand and supervise a machine learning task.

Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning.

Can a machine learning project be implemented to estimate the salaries of baseball players whose salary information and career statistics for 1986 are shared?

QML: A Python Toolkit for Quantum Machine Learning

To-Be is a machine learning challenge on CodaLab Platform about Mortality Prediction

GRaNDPapA: Generator of Rad Names from Decent Paper Acronyms

Climin is a Python package for optimization, heavily biased to machine learning scenarios

Automated Time Series Forecasting

Optimal Randomized Canonical Correlation Analysis

A flexible CTF contest platform for coming PKU GeekGame events

Fourier-Bayesian estimation of stochastic volatility models

Dieses Projekt ermöglicht es den Smartmeter der EVN (Netz Niederösterreich) über die Kundenschnittstelle auszulesen.

Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.

inding a method to objectively quantify skill versus chance in games, using reinforcement learning

Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

DaCeML - Machine learning powered by data-centric parallel programming.

Cool Python features for machine learning that I used to be too afraid to use. Will be updated as I have more time / learn more.