Tools for working with MARC data in Catalogue Bridge.

Overview

catbridge_tools

Tools for working with MARC data in Catalogue Bridge.

Borrows heavily from PyMarc (https://pypi.org/project/pymarc/).

Requirements

Requires the regex module from https://bitbucket.org/mrabarnett/mrab-regex. The built-in re module is not sufficient.

Also requires py2exe.

Installation

From GitHub:

git clone https://github.com/victoriamorris/catbridge_tools
cd catbridge_tools

To install as a Python package:

python setup.py install

To create stand-alone executable (.exe) files for individual scripts:

python setup.py py2exe 

Executable files will be created in the folder \dist, and should be copied to an executable path.

Both of the above commands can be carried out by running the shell script:

compile_catbridge_tools.sh

Scripts

The scripts listed below can be run from anywhere, once the package is installed and the .exe files have been copied to an executable path.

Correspondence with original Catalogue Bridge tools

Original Catalogue Bridge tool New tool Original syntax Corresponding new syntax
cn-find cn-find CN-FIND cn_find -i -o -c
cn-tidy cn-find CN-FIND cn_find -i -o -c --tidy

Features common to all scripts

File formats

Unless otherwise specified, MARC files are in MARC 21 format, with .lex file extensions. Unless otherwise specified, text files are UTF-8-encoded, with .txt, .csv or .tsv file extensions. Config files are also text files, but may have the file extension .cfg for convenience.

Help

For any script, use the option --help to display help text.

Logs and debugging

Logs will be written to catbridge.log within the working directory. This is a UTF-8 encoded text field and can be read in any text editor. The default logging level is INFO; if option --debug is set, the logging level is changed to DEBUG. See https://docs.python.org/3/library/logging.html#levels for information about logging levels.

cn_find

cn_find is a utility which extracts extract control numbers from specified fields and subfields within a file of MARC records.

The fields and subfields to be extracted are specified in a config file.

Usage: cn_find -i 
   
     -o 
    
      -c 
     
       [options]

Options:
    --conv  Convert 10-digit ISBNs to 13-digit form where possible
    --rid   Include record ID as the first column of the output file
    --tidy  Sort and de-duplicate list

    --debug	Debug mode.
    --help	Show help message and exit.

     
    
   

Files

is the name of the input file, which must be a file of MARC 21 records.

is the name of the file to which the control numbers will be written. This should be a text file.

is the name of the file containing the configuration directives.

The config file

The format of the configuration file is as follows, with one entry per line

FIELD TAG $ subfield character [tab] control number specification

Each line must match the regular expression

^([0-9A-Z]{3})\s*\$?\s*([a-z0-9]?)\s*\t(.*?)\s*$

The field tag is specified using three numbers or UPPERCASE letters.

The subfield code are specified using a single number or lowercase letter. If '$' appears without any following subfield characters, all subfields will be searched for control numbers.

The control number specification tells the script what kind of control number to search for within the subfield. This can either take a value from a pre-defined list, or a regular expression can be used to search for control numbers with any other structure. Regular expressions are case-sensitive.

Control number specification Description Regular expression
ISBN Any structurally plausible ISBN* \b(?=(?:[0-9]+[- ]?){10})[0-9]{9}[0-9Xx]\b|\b(?=(?:[0-9]+[- ]?){13})[0-9]{1,5}[- ][0-9]+[- ][0-9]+[- ][0-9Xx]\b|\b97[89][0-9]{10}\b|\b(?=(?:[0-9]+[- ]){4})97[89][- 0-9]{13}[0-9]\b
ISBN10 Any structurally plausible 10-digit ISBN* \b(?=(?:[0-9]+[- ]?){10})[0-9]{9}[0-9Xx]\b|\b(?=(?:[0-9]+[- ]?){13})[0-9]{1,5}[- ][0-9]+[- ][0-9]+[- ][0-9Xx]\b
ISBN13 Any structurally plausible 13-digit ISBN* \b97[89][0-9]{10}\b|\b(?=(?:[0-9]+[- ]){4})97[89][- 0-9]{13}[0-9]\b
ISSN 8 digits with a hyphen in the middle, where the last digit may be an X \b[0-9]{4}[ -]?[0-9]{3}[0-9Xx]\b
BL001 9 digits \b[0-9]{9}\b
BNB See https://www.bl.uk/collection-metadata/metadata-services/structure-of-the-bnb-number \bGB([0-9]{7}|[A-Z][0-9][A-Z0-9][0-9]{4})\b
LCCN See https://www.loc.gov/marc/bibliographic/bd010.html \b[a-z][a-z ][a-z ]?[0-9]{2}[0-9]{6} ?\b
OCLC "(OCoLC)" followed by digits (OCoLC)[0-9]+\b
ISNI 16 digits separated into groups of 4 with spaces or hyphens \b[0]{4}[ -]?[0-9]{4}[ -]?[0-9]{4}[ -]?[0-9]{3}[0-9Xx]\b
FAST "fst" followed by digits \bfst[0-9]{8}\b

*Note: The ISBN check digit is not validated.

Multiple fields and subfields may be specified. Fields may be repeated with different subfields.

Example:

001 BL001
015$a	BNB
020	ISBN
020$z	ISBN10
500$a	\b[a-z]{7}\b
035$a	OCLC

In the example above, field 500 subfield $a is being searched for 7-character words.

Options

--conv

If option --conv is used, 10-digit ISBNs will be converted to 13-digit form whenever possible (i.e. whenever they are valid ISBNs).

--rid

By default, the output file consists of a single column of strings. If option --rid is used, the output file will consist of two columns: the first column will be the record control number from field 001 and the second column will be as per the default output.

--tidy

If option --tidy is used, the list of control numbers in the output file will be sorted and de-duplicated. Any duplicate control numbers will be written to an additional output file named with the prefix "dp-".

Note: option --tidy cannot be used at the same time as option --rid

INF42 - Topological Data Analysis

TDA INF421(Conception et analyse d'algorithmes) Projet : Topological Data Analysis SphereMin Etant donné un nuage des points, ce programme contient de

2 Jan 07, 2022
Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

HoloViz 2.9k Jan 06, 2023
Yet Another Workflow Parser for SecurityHub

YAWPS Yet Another Workflow Parser for SecurityHub "Screaming pepper" by Rum Bucolic Ape is licensed with CC BY-ND 2.0. To view a copy of this license,

myoung34 8 Dec 22, 2022
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Dec 31, 2022
Pipetools enables function composition similar to using Unix pipes.

Pipetools Complete documentation pipetools enables function composition similar to using Unix pipes. It allows forward-composition and piping of arbit

186 Dec 29, 2022
Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Teo Calvo 5 Apr 26, 2022
A real data analysis and modeling project - restaurant inspections

A real data analysis and modeling project - restaurant inspections Jafar Pourbemany 9/27/2021 This project represents data analysis and modeling of re

Jafar Pourbemany 2 Aug 21, 2022
Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

hydrosphere.io 6 Aug 10, 2021
Churn prediction with PySpark

It is expected to develop a machine learning model that can predict customers who will leave the company.

3 Aug 13, 2021
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 04, 2023
simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

aliaksandr-master 0 Jan 26, 2022
A Python adaption of Augur to prioritize cell types in perturbation analysis.

A Python adaption of Augur to prioritize cell types in perturbation analysis.

Theis Lab 2 Mar 29, 2022
Functional tensors for probabilistic programming

Funsor Funsor is a tensor-like library for functions and distributions. See Functional tensors for probabilistic programming for a system description.

208 Dec 29, 2022
A Numba-based two-point correlation function calculator using a grid decomposition

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Lehman Garrison 3 Aug 24, 2022
Nobel Data Analysis

Nobel_Data_Analysis This project is for analyzing a set of data about people who have won the Nobel Prize in different fields and different countries

Mohammed Hassan El Sayed 1 Jan 24, 2022
Senator Trades Monitor

Senator Trades Monitor This monitor will grab the most recent trades by senators and send them as a webhook to discord. Installation To use the monito

Yousaf Cheema 5 Jun 11, 2022
This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

NSE-timeseries-form-CSV-file-creator-and-SQL-appender- This creates a ohlc timeseries from downloaded CSV files from National Stock Exchange India (NS

PILLAI, Amal 1 Oct 02, 2022
MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

SeungHeonDoh 3 Jul 02, 2022
Analyze the Gravitational wave data stored at LIGO/VIRGO observatories

Gravitational-Wave-Analysis This project showcases how to analyze the Gravitational wave data stored at LIGO/VIRGO observatories, using Python program

1 Jan 23, 2022