A Python client for the Softcite software mention recognizer server

Overview

Softcite software mention recognizer client

Python client for using the Softcite software mention recognition service. It can be applied to

  • individual PDF files

  • recursively to a local directory, processing all the encountered PDF

  • to a collection of documents harvested by biblio-glutton-harvester and article-dataset-builder, with the benefit of re-using the collection manifest for injectng metadata and keeping track of progress. The collection can be stored locally or on a S3 storage.

Requirements

The client has been tested with Python 3.5-3.7.

The client requires a working Softcite software mention recognition service. Service host and port can be changed in the config.json file of the client.

Install

cd software_mention_client/

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3 env

source env/bin/activate

Install the dependencies, use:

pip3 install -r requirements.txt

Usage and options

usage: software_mention_client.py [-h] [--repo-in REPO_IN] [--file-in FILE_IN]
                                  [--file-out FILE_OUT]
                                  [--data-path DATA_PATH] [--config CONFIG]
                                  [--reprocess] [--reset] [--load]
                                  [--diagnostic] [--scorched-earth]

Softcite software mention recognizer client

optional arguments:
  -h, --help            show this help message and exit
  --repo-in REPO_IN     path to a directory of PDF files to be processed by
                        the Softcite software mention recognizer
  --file-in FILE_IN     a single PDF input file to be processed by the
                        Softcite software mention recognizer
  --file-out FILE_OUT   path to a single output the software mentions in JSON
                        format, extracted from the PDF file-in
  --data-path DATA_PATH
                        path to the resource files created/harvested by
                        biblio-glutton-harvester
  --config CONFIG       path to the config file, default is ./config.json
  --reprocess           reprocessed failed PDF
  --reset               ignore previous processing states and re-init the
                        annotation process from the beginning
  --load                load json files into the MongoDB instance, the --repo-
                        in parameter must indicate the path to the directory
                        of resulting json files to be loaded
  --diagnostic          perform a full count of annotations and diagnostic
                        using MongoDB regarding the harvesting and
                        transformation process
  --scorched-earth      remove a PDF file after its successful processing in
                        order to save storage space, careful with this!

The logs are written by default in a file ./client.log, but the location of the logs can be changed in the configuration file (default ./config.json).

Processing local PDF files

For processing a single file., the resulting json being written as file at the indicated output path:

python3 software_mention_client.py --file-in toto.pdf --file-out toto.json

For processing recursively a directory of PDF files, the results will be:

  • written to a mongodb server and database indicated in the config file

  • and in the directory of PDF files, as json files, together with each processed PDF

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/

The default config file is ./config.json, but could also be specified via the parameter --config:

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/ --config ./my_config.json

Processing a collection of PDF harvested by biblio-glutton-harvester

biblio-glutton-harvester and article-dataset-builder creates a collection manifest as a LMDB database to keep track of the harvesting of large collection of files. Storage of the resource can be located on a local file system or on a AWS S3 storage. The software-mention client will use the collection manifest to process these harvested documents.

  • locally:

python3 software_mention_client.py --data-path /mnt/data/biblio-glutton-harvester/data/

--data-path indicates the path to the repository of data harvested by biblio-glutton-harvester.

The resulting JSON files will be enriched by the metadata records of the processed PDF and will be stored together with each processed PDF in the data repository.

If the harvested collection is located on a S3 storage, the access information must be indicated in the configuration file of the client config.json. The extracted software mention will be written in a file with extension .software.json, for example:

-rw-rw-r-- 1 lopez lopez 1.1M Aug  8 03:26 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.pdf
-rw-rw-r-- 1 lopez lopez  485 Aug  8 03:41 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.software.json

If a MongoDB server access information is indicated in the configuration file config.json, the extracted information will additionally be written in MongoDB.

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

Main author and contact: Patrice Lopez ([email protected])

Python Client for MLflow Tracking Server

Python Client for MLflow Python client for MLflow REST API. Features: Unlike MLflow Tracking client all REST API methods are exposed to user. All clas

MTS 35 Dec 23, 2022
NFTs Upload to OpenSea CuseEdition

NFTs-Upload-to-OpenSea-CuseEdition YOUTUBE VIDEO - Soon... Download Python and

Lil Cuse 2 Jan 04, 2022
A python script to download twitter space, only works on running spaces (for now).

A python script to download twitter space, only works on running spaces (for now).

279 Jan 02, 2023
PRNT.sc Image Grabber

PRNTSender PRNT.sc Image Grabber PRNTSender is a script that takes images posted on PRNT.sc and sends them to a Discord webhook, if you want to know h

neox 2 Dec 10, 2021
Modern, privacy-friendly, and detailed web analytics that works without cookies or JS.

Modern, privacy-friendly, and cookie-free web analytics. Getting started » Screenshots • Features • Office Hours Motivation There are a lot of web ana

R. Miles McCain 2.1k Jan 03, 2023
Buscar y descargar canciones de YouTube automáticamente desde la web

🎶 DescargarCanciones 🎶 Buscar y descargar canciones o playlist de Spotify o YouTube automáticamente con todos los metadatos de la canciones en forma

1 Dec 20, 2021
DragDev Maintained Instance Of discord.py

discord.py - DragDev Flavour A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. The Future of discord.py

DragDev Studios 3 Aug 27, 2022
Cool Discord bot for you

BountyBot Баунти – современный бот созданный с целью сделать ваш сервер лучше! В кратце В нем присутствует множество основных и интересных функций, та

Leestarb Original 1 Nov 22, 2021
Unofficial python api for MicroBT Whatsminer ASICs

whatsminer-api Unofficial python api for MicroBT Whatsminer ASICs Code adapted from a python file found in the Whatsminer Telegram group that is credi

Satoshi Anonymoto 16 Dec 23, 2022
Converts a text file of songs to a playlist on your Spotify account.

Playlist Converter Convert a text file of songs to a playlist on your Spotify account. Create your playlists faster instead of manually searching for

Priya Aggarwal 18 Dec 21, 2022
This is a discord bot, which tells you food recipes.

Discord Chef Bot You have a friend, familiy or other group / channel where the topic is the food? You cannot really decide what's for Saturday lunch?

2 Apr 25, 2022
Represents a Lavalink client used to manage nodes and connections.

lavaplayer Represents a Lavalink client used to manage nodes and connections. setup pip install lavaplayer setup lavalink you need to java 11* LTS or

HazemMeqdad 37 Nov 21, 2022
Python 3 tools for interacting with Notion API

NotionDB Python 3 tools for interacting with Notion API: API client Relational database wrapper Installation pip install notiondb API client from noti

Viet Hoang 14 Nov 24, 2022
Ig-Crackv2 - Crack Instagram Version 2.9

★★ Information ★★ ★★Menu Special Crack Melalui Pengikut Crack Melalui Mengikuti

Risky [ Zero Tow ] 11 Aug 30, 2022
C Y B Ξ R UserBot is a project that simplifies the use of Telegram.

C Y B Ξ R USΞRBOT 🇦🇿 C Y B Ξ R UserBot is a project that simplifies the use of Telegram. All rights reserved. Automatic Setup Android: open Termux p

FVREED 4 Dec 07, 2022
Simple Telegram Bot to extract various types of archives from a telegram file or a direct link

Unzipper Bot A Telegram Bot to Extract Various Types Of Archives Features Extract various types of archives like rar, zip, tar, 7z, tar.xz etc. Passwo

I'm Not A Bot #Left_TG 93 Dec 27, 2022
Bill is a bot capable to Chat with you, search everything on web to you, and send message to yours contacts for you.

Bill Bot The inteligent Bot Bill is a intelligent bot, it can chat, search and send messages to you. Chat with You Send messages on WhatsApp for you S

João Assalim 3 Sep 12, 2021
A python package to fetch results of various national examinations done in Tanzania.

Necta-API Get a formated data of examination results scrapped from necta results website. Note this is not an official NECTA API and is still in devel

vincent laizer 16 Dec 23, 2022
This is an Advanced Calculator maybe with Discord Buttons in python.

Welcome! This is an Advanced Calculator maybe with Discord Buttons in python. This was the first version of the calculator, made for my discord bot, P

Polsulpicien 18 Dec 24, 2022
Select random winners for a Twitter giveaway

twitter_picker Select random winners for a Twitter giveaway Once the Twitter giveaway (or airdrop) is closed, assign a number to each participant. The

Michael Rawner 1 Dec 11, 2021