Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship Attribution

Overview

Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship Attribution

Abstract

Within the Latin (and ancient Greek) production, it is well known that peculiar metric schemes were followed not only in poetic compositions, but also in many prose works. Such metric patterns were based on syllabic quantity, i.e., on on the length of the involved syllables (which can be long or short), and there is much evidence suggesting that certain authors held a preference for certain rhythmic schemes over others.
In this project, we investigate the possibility to employ syllabic quantity as a base to derive rhythmic features for the task of computational Authorship Attribution of Latin prose texts. We test the impact of these features on the attribution task when combined with other topic-agnostic features, employing three datasets and two different learning algorithms.

Syllabic Quantity for Authorship Attribution

Authorship Attribution (AA) is a subtask of the field of Authorship Analysis, which aims to infer various characteristics of the writer of a document, its identity included. In particular, given a set of candidate authors A1... Am and a document d, the goal of AA is to find the most probable author for the document d among the set of candidates; AA is thus a single-label multi-class classification problem, where the classes are the authors in the set.
In this project, we investigate the possibility to employ features extracted from the quantity of the syllables in a document as discriminative features for AA on Latin prose texts. Syllables are sound units a single word can be divided into; in particular, a syllable can be thought as an oscillation of sound in the word pronunciation, and is characterized by its quantity (long or short), which is the amount of time required to pronounce it. It is well known that classical Latin (and Greek) poetry followed metric patterns based on sequences of short and long syllables. In particular, syllables were combined in what is called a "foot", and in turn a series of "feet" composed the metre of a verse. Yet, similar metric schemes were followed also in many prose compositions, in order to give a certain cadence to the discourse and focus the attention on specific parts. In particular, the end of sentences and periods was deemed as especially important in this sense, and known as clausola. During the Middle Ages, Latin prosody underwent a gradual but profound change. The concept of syllabic quantity lost its relevance as a language discriminator, in favour of the accent, or stress. However, Latin accentuation rules are largely dependent on syllabic quantity, and medieval writers retained the classical importance of the clausola, which became based on stresses and known as cursus. An author's practice of certain rhythmic patterns might thus play an important role in the identification of that author's style.

Datasets

In this project, we employ 3 different datasets:

  • LatinitasAntiqua. The texts can be automatically downloaded with the script in the corresponding code file (src/dataset_prep/LatinitasAntiqua_prep.py). They come from the Corpus Corporum repository, developed by the University of Zurich, and in particular its sub-section called Latinitas antiqua, which contains various Latin works from the Perseus Digital library; in total, the corpus is composed of 25 Latin authors and 90 prose texts, spanning through the Classical, Imperial and Early-Medieval periods, and a variety of genres (mostly epistolary, historical and rhetoric).
  • KabalaCorpusA. The texts can be downloaded from the followig [link](https://www.jakubkabala.com/gallus-monk/). In particular, we use Corpus A, which consists of 39 texts by 22 authors from the 11-12th century.
  • MedLatin. The texts can be downloaded from the following link: . Originally, the texts were divided into two datasets, but we combine them together. Note that we exclude the texts from the collection of Petrus de Boateriis, since it is a miscellanea of authors. We delete the quotations from other authors and the insertions in languages other than Latin, marked in the texts.
The documents are automatically pre-processed in order to clean them from external information and noise. In particular, headings, editors' notes and other meta-information are deleted where present. Symbols (such as asterisks or parentheses) and Arabic numbers are deleted as well. Punctuation marks are normalized: every occurrence of question and exclamation points, semicolons, colons and suspension points are exchanged with a single point, while commas are deleted. The text is finally lower-cased and normalized: the character v is exchanged with the character u and the character j with the character i, and every stressed vowels is exchanged with the corresponding non-stressed vowel. As a final step, each text is divided into sentences, where a sentence is made of at least 5 distinct words (shorter sentences are attached to the next sentence in the sequence, or the previous one in case it is the last one in the document). This allows to create the fragments that ultimately form the training, validation and and test sets for the algorithms. In particular, each fragment is made of 10 consecutive, non-overlapping sentences.

Experiments

In order to transform the Latin texts into the corresponding syllabic quantity (SQ) encoding, we employ the prosody library available on the [Classical Language ToolKit](http://cltk.org/).
We also experiment with the four Distortion Views presented by [Stamatatos](https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/asi.23968?casa_token=oK9_O2SOpa8AAAAA%3ArLsIRzk4IhphR7czaG6BZwLmhh9mk4okCj--kXOJolp1T70XzOXwOw-4vAOP8aLKh-iOTar1mq8nN3B7), which, given a list Fw of function words, are:

  • Distorted View – Multiple Asterisks (DVMA): every word not included in Fw is masked by replacing each of its characters with an asterisk.
  • Distorted View – Single Asterisk (DVSA): every word not included in Fw is masked by replacing it with a single asterisk.
  • Distorted View – Exterior Characters (DVEX): every word not included in Fw is masked by replacing each of its characters with an asterisk, except the first and last one.
  • Distorted View – Last 2 (DVL2): every word not included in Fw is masked by replacing each of its characters with an asterisk, except the last two characters.
BaseFeatures (BFs) and it's made of: function words, word lengths, sentence lengths.
We experiment with two different learning methods: Support Vector Machine and Neural Network. All the experiments are conducted on the same train-validation-test split.
For the former, we compute the TfIdf of the character n-grams in various ranges, extracted from the various encodings of the text, which we concatenate to BaseFeatures, and feed the resulting features matrix to a LinearSVC implemented in the [scikit-learn package](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html).
For the latter, we compute various parallel, identical branches, each one processing a single encoding or the Bfs matrix, finally combining the different outputs into a single decision layer. The network is implimented with the [PyTorch package](https://pytorch.org/). Each branch outputs a matrix of probabilities, which are stacked together, and an average-pooling operation is applied in order to obtain the average value of the decisions of the different branches. The final decision is obtained through a final dense layer applying a softmax (for training) or argmax (for testing) operation over the classes probabilities. The training of the network is conducted with the traditional backpropagation method; we employ cross-entropy as the loss function and the Adam optimizer.
We employ the macro-F1 and micro-F1 as measures in order to assess the performance of the methods. For each method employing SQ-based features, we compute the statistical significance against its baseline (the same method without SQ-based features); to this aim, we employ the McNemar's paired non-parametric statistical hypothesis test, taking $0.05$ as significance value.

NN architecture

Code

The code is organized as follows int the src directory:

  • NN_models: the directory contains the code to build the Neural Networks tried in the project, one file for each architecture. The one finally used in the project is in the file NN_cnn_deep_ensemble.py.
  • dataset_prep: the directory contains the code to preprocess the various dataset employed in the project. The file NN_dataloader.py prepares the data to be processed for the Neural Network.
  • general: the directory contains: a) helpers.py, with various functions useful for the current project; b) significance.py, with the code for the significance test; c) utils.py, with more comme useful functions; d) visualization.py, with functions for drawing graphics and similar.
  • NN_classification.py: it performs the Neural Networks experiments.
  • SVM_classification.py: it performs the Support Vector Machine experiments.
  • feature_extractor.py: it extract the features for the SVM experiments.
  • main.py

A powerful framework for decentralized federated learning with user-defined communication topology

Scatterbrained Decentralized Federated Learning Scatterbrained makes it easy to build federated learning systems. In addition to traditional federated

Johns Hopkins Applied Physics Laboratory 7 Sep 26, 2022
this is a lite easy to use virtual keyboard project for anyone to use

virtual_Keyboard this is a lite easy to use virtual keyboard project for anyone to use motivation I made this for this year's recruitment for RobEn AA

Mohamed Emad 3 Oct 23, 2021
Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information by Masato Tamura, Hiroki Ohashi, and Tomoaki Yosh

105 Dec 23, 2022
A transformer model to predict pathogenic mutations

MutFormer MutFormer is an application of the BERT (Bidirectional Encoder Representations from Transformers) NLP (Natural Language Processing) model wi

Wang Genomics Lab 2 Nov 29, 2022
Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

Matthias Wright 169 Dec 26, 2022
Official Code for "Non-deep Networks"

Non-deep Networks arXiv:2110.07641 Ankit Goyal, Alexey Bochkovskiy, Jia Deng, Vladlen Koltun Overview: Depth is the hallmark of DNNs. But more depth m

Ankit Goyal 567 Dec 12, 2022
OpenMMLab Pose Estimation Toolbox and Benchmark.

Introduction English | 简体中文 MMPose is an open-source toolbox for pose estimation based on PyTorch. It is a part of the OpenMMLab project. The master b

OpenMMLab 2.8k Dec 31, 2022
Official pytorch implementation of the paper: "SinGAN: Learning a Generative Model from a Single Natural Image"

SinGAN Project | Arxiv | CVF | Supplementary materials | Talk (ICCV`19) Official pytorch implementation of the paper: "SinGAN: Learning a Generative M

Tamar Rott Shaham 3.2k Dec 25, 2022
A Benchmark For Measuring Systematic Generalization of Multi-Hierarchical Reasoning

Orchard Dataset This repository contains the code used for generating the Orchard Dataset, as seen in the Multi-Hierarchical Reasoning in Sequences: S

Bill Pung 1 Jun 05, 2022
Markov Attention Models

Introduction This repo contains code for reproducing the results in the paper Graphical Models with Attention for Context-Specific Independence and an

Vicarious 0 Dec 09, 2021
Build a medical knowledge graph based on Unified Language Medical System (UMLS)

UMLS-Graph Build a medical knowledge graph based on Unified Language Medical System (UMLS) Requisite Install MySQL Server 5.6 and import UMLS data int

Donghua Chen 6 Dec 25, 2022
Differentiable Optimizers with Perturbations in Pytorch

Differentiable Optimizers with Perturbations in PyTorch This contains a PyTorch implementation of Differentiable Optimizers with Perturbations in Tens

Jake Tuero 54 Jun 22, 2022
The implementation of ICASSP 2020 paper "Pixel-level self-paced learning for super-resolution"

Pixel-level Self-Paced Learning for Super-Resolution This is an official implementaion of the paper Pixel-level Self-Paced Learning for Super-Resoluti

Elon Lin 41 Dec 15, 2022
UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.

Microsoft 7.6k Jan 01, 2023
SymmetryNet: Learning to Predict Reflectional and Rotational Symmetries of 3D Shapes from Single-View RGB-D Images

SymmetryNet SymmetryNet: Learning to Predict Reflectional and Rotational Symmetries of 3D Shapes from Single-View RGB-D Images ACM Transactions on Gra

26 Dec 05, 2022
Official PyTorch Implementation of HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning (NeurIPS 2021 Spotlight)

[NeurIPS 2021 Spotlight] HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning [Paper] This is Official PyTorch implementatio

42 Nov 01, 2022
[ICRA 2022] An opensource framework for cooperative detection. Official implementation for OPV2V.

OpenCOOD OpenCOOD is an Open COOperative Detection framework for autonomous driving. It is also the official implementation of the ICRA 2022 paper OPV

Runsheng Xu 322 Dec 23, 2022
A Pytorch Implementation of ClariNet

ClariNet A Pytorch Implementation of ClariNet (Mel Spectrogram -- Waveform) Requirements PyTorch 0.4.1 & python 3.6 & Librosa Examples Step 1. Downlo

Sungwon Kim 286 Sep 15, 2022
a minimal terminal with python 😎😉

Meterm a terminal with python 😎 How to use Clone Project: $ git clone https://github.com/motahharm/meterm.git Run: in Terminal: meterm.exe Or pip ins

Motahhar.Mokfi 5 Jan 28, 2022
Code for "Universal inference meets random projections: a scalable test for log-concavity"

How to use this repository This repository contains code to replicate the results of "Universal inference meets random projections: a scalable test fo

Robin Dunn 0 Nov 21, 2021