A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Overview

Crosslingual Coreference

Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.

Current Release Version pypi Version PyPi downloads Code style: black

Install

pip install crosslingual-coreference

Quickstart

from crosslingual_coreference import Predictor

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
    language="en_core_web_sm", device=-1, model_name="minilm"
)

print(predictor.predict(text)["resolved_text"])
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

Models

As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.

  • The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts.
  • The "info_xlm" model produces the best quality for multi-lingual texts.
  • The AllenNLP "spanbert" model produces the best quality for english texts.

Chunking/batching to resolve memory OOM errors

from crosslingual_coreference import Predictor

predictor = Predictor(
    language="en_core_web_sm",
    device=0,
    model_name="minilm",
    chunk_size=2500,
    chunk_overlap=2,
)

Use spaCy pipeline

import spacy

import crosslingual_coreference

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)

doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

More Examples

Comments
  • Which language model is using for minilm

    Which language model is using for minilm

    I am using the following code snippet for coreference resolution

    predictor = Predictor(language="en_core_web_sm", device=-1, model_name="minilm")
    

    While checking the below source code,

    "minilm": {
            "url": (
                "https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz"
            ),
            "f1_score_ontonotes": 74,
            "file_extension": ".tar.gz",
        },
    

    it seems that the language model using here is https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    Is this the same one that I can see in https://huggingface.co/models like https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384/tree/main or any other huggingface model?

    opened by pradeepdev-1995 7
  • Error when using coref as a spaCy pipeline

    Error when using coref as a spaCy pipeline

    Hi all, while trying to run a spacy test

    import spacy
    import crosslingual_coreference
    
    text = """
        Do not forget about Momofuku Ando!
        He created instant noodles in Osaka.
        At that location, Nissin was founded.
        Many students survived by eating these noodles, but they don't even know him."""
    
    # use any model that has internal spacy embeddings
    nlp = spacy.load('en_core_web_sm')
    nlp.add_pipe(
        "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0})
    
    doc = nlp(text)
    
    print(doc._.coref_clusters)
    print(doc._.resolved_text)
    

    I encountered the following issue:

    [nltk_data] Downloading package omw-1.4 to
    [nltk_data]     /home/user/nltk_data...
    [nltk_data]   Package omw-1.4 is already up-to-date!
    Traceback (most recent call last):
      File "/home/user/test_coref/test.py", line 12, in <module>
        nlp.add_pipe(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 792, in add_pipe
        pipe_component = self.create_pipe(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 674, in create_pipe
        resolved = registry.resolve(cfg, validate=validate)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 746, in resolve
        resolved, _ = cls._make(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 795, in _make
        filled, _, resolved = cls._fill(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 867, in _fill
        getter_result = getter(*args, **kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/__init__.py", line 33, in make_crosslingual_coreference
        return SpacyPredictor(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictorSpacy.py", line 18, in __init__
        super().__init__(language, device, model_name, chunk_size, chunk_overlap)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 55, in __init__
        self.set_coref_model()
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 85, in set_coref_model
        self.predictor = Predictor.from_path(self.filename, language=self.language, cuda_device=self.device)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/predictors/predictor.py", line 366, in from_path
        load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 232, in load_archive
        dataset_reader, validation_dataset_reader = _load_dataset_readers(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 268, in _load_dataset_readers
        dataset_reader = DatasetReader.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
        return retyped_subclass.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 636, in from_params
        kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 206, in create_kwargs
        constructed_arg = pop_and_construct_arg(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 314, in pop_and_construct_arg
        return construct_arg(class_name, name, popped_params, annotation, default, **extras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 394, in construct_arg
        value_dict[key] = construct_arg(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 348, in construct_arg
        result = annotation.from_params(params=popped_params, **subextras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
        return retyped_subclass.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 638, in from_params
        return constructor_to_call(**kwargs)  # type: ignore
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py", line 58, in __init__
        self._matched_indexer = PretrainedTransformerIndexer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 56, in __init__
        self._allennlp_tokenizer = PretrainedTransformerTokenizer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 72, in __init__
        self.tokenizer = cached_transformers.get_tokenizer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/cached_transformers.py", line 204, in get_tokenizer
        tokenizer = transformers.AutoTokenizer.from_pretrained(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 546, in from_pretrained
        return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
        return cls._from_pretrained(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1923, in _from_pretrained
        tokenizer = cls(*init_inputs, **init_kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 140, in __init__
        super().__init__(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
        fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    Exception: EOF while parsing a list at line 1 column 4920583
    

    Here's what I have installed (pulled by poetry add crosslingual-coreference or pip install crosslingual-coreference):

    (.venv) [email protected]$ pip freeze
    aiohttp==3.8.1
    aiosignal==1.2.0
    allennlp==2.9.3
    allennlp-models==2.9.3
    async-timeout==4.0.2
    attrs==21.4.0
    base58==2.1.1
    blis==0.7.7
    boto3==1.23.5
    botocore==1.26.5
    cached-path==1.1.2
    cachetools==5.1.0
    catalogue==2.0.7
    certifi==2022.5.18.1
    charset-normalizer==2.0.12
    click==8.0.4
    conllu==4.4.1
    crosslingual-coreference==0.2.4
    cymem==2.0.6
    datasets==2.2.1
    dill==0.3.5.1
    docker-pycreds==0.4.0
    en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl
    en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl
    fairscale==0.4.6
    filelock==3.6.0
    frozenlist==1.3.0
    fsspec==2022.5.0
    ftfy==6.1.1
    gitdb==4.0.9
    GitPython==3.1.27
    google-api-core==2.8.0
    google-auth==2.6.6
    google-cloud-core==2.3.0
    google-cloud-storage==2.3.0
    google-crc32c==1.3.0
    google-resumable-media==2.3.3
    googleapis-common-protos==1.56.1
    h5py==3.6.0
    huggingface-hub==0.5.1
    idna==3.3
    iniconfig==1.1.1
    Jinja2==3.1.2
    jmespath==1.0.0
    joblib==1.1.0
    jsonnet==0.18.0
    langcodes==3.3.0
    lmdb==1.3.0
    MarkupSafe==2.1.1
    more-itertools==8.13.0
    multidict==6.0.2
    multiprocess==0.70.12.2
    murmurhash==1.0.7
    nltk==3.7
    numpy==1.22.4
    packaging==21.3
    pandas==1.4.2
    pathtools==0.1.2
    pathy==0.6.1
    Pillow==9.1.1
    pluggy==1.0.0
    preshed==3.0.6
    promise==2.3
    protobuf==3.20.1
    psutil==5.9.1
    py==1.11.0
    py-rouge==1.1
    pyarrow==8.0.0
    pyasn1==0.4.8
    pyasn1-modules==0.2.8
    pydantic==1.8.2
    pyparsing==3.0.9
    pytest==7.1.2
    python-dateutil==2.8.2
    pytz==2022.1
    PyYAML==6.0
    regex==2022.4.24
    requests==2.27.1
    responses==0.18.0
    rsa==4.8
    s3transfer==0.5.2
    sacremoses==0.0.53
    scikit-learn==1.1.1
    scipy==1.6.1
    sentence-transformers==2.2.0
    sentencepiece==0.1.96
    sentry-sdk==1.5.12
    setproctitle==1.2.3
    shortuuid==1.0.9
    six==1.16.0
    smart-open==5.2.1
    smmap==5.0.0
    spacy==3.2.4
    spacy-alignments==0.8.5
    spacy-legacy==3.0.9
    spacy-loggers==1.0.2
    spacy-sentence-bert==0.1.2
    spacy-transformers==1.1.5
    srsly==2.4.3
    tensorboardX==2.5
    termcolor==1.1.0
    thinc==8.0.16
    threadpoolctl==3.1.0
    tokenizers==0.12.1
    tomli==2.0.1
    torch==1.10.2
    torchaudio==0.10.2
    torchvision==0.11.3
    tqdm==4.64.0
    transformers==4.17.0
    typer==0.4.1
    typing-extensions==4.2.0
    urllib3==1.26.9
    wandb==0.12.16
    wasabi==0.9.1
    wcwidth==0.2.5
    word2number==1.1
    xxhash==3.0.0
    yarl==1.7.2
    

    Do you have any recommendations? Is there an installation step missing?

    Thanks in advance!

    opened by alexander-belikov 4
  • Comparatively high initial prediction time for first predict() hit

    Comparatively high initial prediction time for first predict() hit

    I am using minilm model with language 'en_core_web_sm'. While comparing the prediction time, i.e., predictor.predict(text), the prediction time for first hit is always a bit high than the following hits. Suppose after creating a predictor object, I call predict as follows:

    predictor.predict(text) ---> first call predictor.predict(text) ---> second call predictor.predict(text) ---> third call

    Time taken for the first call is comparatively a bit higher(.2 sec) than the next prediction calls(.05 sec). Could you please help me understand why this initial hit takes a bit high prediction time?

    opened by nemeer 2
  • Why does this package need to install google cloud auth, storage, api etc?

    Why does this package need to install google cloud auth, storage, api etc?

    Hi,

    after installing the library I saw google-api-core-2.10.1 google-auth-2.12.0 google-cloud-core-2.3.2 google-cloud-storage-1.44.0 have been installed as well. In fact these packages can be found in the poetry.lock file.

    Is there a reason (I don't get) why this library needs these packages?

    Thanks

    opened by GiacomoCherry 1
  • HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    Python 3.8.13 Spacy - 3.1.0 en_core_web_sm-3.1.0 crosslingual_coreference - 0.2.8

    requests.exceptions.SSLError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

    opened by jscoder1009 1
  • Retrieving cluster heads without replacing corefs

    Retrieving cluster heads without replacing corefs

    I am interested in being able to extract the cluster heads with something like doc._.coref_cluster_heads to get the cluster heads without getting the reconstituted text. It could be a separate function that also acts as input into replace_corefs potentially.

    opened by MikeMikeMikeMike 1
  • [Errno 101] Network is unreachable

    [Errno 101] Network is unreachable

    Hello, when I try to run the code below

    predictor = Predictor(
        language="en_core_web_sm", device=1, model_name="info_xlm"
    )
    

    I get the following error:

    ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/infoxlm-base/cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff90cba1a00>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

    Is this url still valid and what should I use instead?

    opened by ttranslit 1
  • spaCy issues and suggestions

    spaCy issues and suggestions

    @martin-kirilov It might be worth looking into including batching + training a model for Spanish/Italian. See this issue from spaCy.

    • batching
    • empty cluster issue (resolved)
    • additional model pro-drop languages
    bug enhancement 
    opened by davidberenstein1957 0
  • feat: look into ONNX enhanched transformer embeddings

    feat: look into ONNX enhanched transformer embeddings

    Creating embeddings roughly takes 50% of the inference time. allennlp/modules/token_embedders/pretrained_transformer_embedder.py hold the logic for creating these embeddings. Make sure we can call them in a faster way.

    enhancement 
    opened by davidberenstein1957 3
Releases(0.2.9)
Owner
Pandora Intelligence
Pandora Intelligence is an independent intelligence company, specialized in security risks.
Pandora Intelligence
超轻量级bert的pytorch版本,大量中文注释,容易修改结构,持续更新

bert4pytorch 2021年8月27更新: 感谢大家的star,最近有小伙伴反映了一些小的bug,我也注意到了,奈何这个月工作上实在太忙,更新不及时,大约会在9月中旬集中更新一个只需要pip一下就完全可用的版本,然后会新添加一些关键注释。 再增加对抗训练的内容,更新一个完整的finetune

muqiu 317 Dec 18, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 01, 2022
Non-Autoregressive Predictive Coding

Non-Autoregressive Predictive Coding This repository contains the implementation of Non-Autoregressive Predictive Coding (NPC) as described in the pre

Alexander H. Liu 43 Nov 15, 2022
MMDA - multimodal document analysis

MMDA - multimodal document analysis

AI2 75 Jan 04, 2023
Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Applied Natural Language Processing in the Enterprise This is the companion repo for Applied Natural Language Processing in the Enterprise, an O'Reill

Applied Natural Language Processing in the Enterprise 95 Jan 05, 2023
Help you discover excellent English projects and get rid of disturbing by other spoken language

GitHub English Top Charts 「Help you discover excellent English projects and get

GrowingGit 544 Jan 09, 2023
Natural Language Processing Specialization

Natural Language Processing Specialization In this folder, Natural Language Processing Specialization projects and notes can be found. WHAT I LEARNED

Kaan BOKE 3 Oct 06, 2022
Reproduction process of BERT on SST2 dataset

BERT-SST2-Prod Reproduction process of BERT on SST2 dataset 安装说明 下载代码库 git clone https://github.com/JunnYu/BERT-SST2-Prod 进入文件夹,安装requirements pip ins

yujun 1 Nov 18, 2021
topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

NLP Space News Topic Modeling Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com Table of Contents Project Idea Data acquisition Primary data sour

edesz 1 Jan 03, 2022
Black for Python docstrings and reStructuredText (rst).

Style-Doc Style-Doc is Black for Python docstrings and reStructuredText (rst). It can be used to format docstrings (Google docstring format) in Python

Telekom Open Source Software 13 Oct 24, 2022
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
TalkNet: Audio-visual active speaker detection Model

Is someone talking? TalkNet: Audio-visual active speaker detection Model This repository contains the code for our ACM MM 2021 paper, TalkNet, an acti

142 Dec 14, 2022
Plugin repository for Macast

Macast-plugins Plugin repository for Macast. How to use third-party player plugin Download Macast from GitHub Release. Download the plugin you want fr

109 Jan 04, 2023
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022
An implementation of the Pay Attention when Required transformer

Pay Attention when Required (PAR) Transformer-XL An implementation of the Pay Attention when Required transformer from the paper: https://arxiv.org/pd

7 Aug 11, 2022
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

ReekyStive 3 Nov 11, 2022
NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

Natural Language Toolkit 11.4k Jan 04, 2023
Fast, general, and tested differentiable structured prediction in PyTorch

Torch-Struct: Structured Prediction Library A library of tested, GPU implementations of core structured prediction algorithms for deep learning applic

HNLP 1.1k Dec 16, 2022
This is the offline-training-pipeline for our project.

offline-training-pipeline This is the offline-training-pipeline for our project. We adopt the offline training and online prediction Machine Learning

0 Apr 22, 2022