Extract XML from the OS X dictionaries.

Overview

Before You Start

Apple-peeler was written using python 3.9 (but it should be trivial to support earlier versions of python 3.5+).

Installation

pip install apple-peeler

Dependencies

BeautifulSoup 4, lxml, and click

Usage

Apple likes to move around the dictionaries location from macOS version to macOS version. So if the dictionaries are no longer at the path below you can tell apple-peeler where to look by exporting DICT_BASE in your environment or using the --base option below.

export DICT_BASE="/System/Library/AssetsV2/com_apple_MobileAsset_DictionaryServices_dictionaryOSX/"

After that, useage is straightforward.

Usage: apple-peeler [OPTIONS]

Extract XML from Apple Dictionary files.

Options:
--base DIRECTORY                The root directory of the OS X dictionaries.
                                (Default: /System/Library/AssetsV2/com_apple
                                _MobileAsset_DictionaryServices_dictionaryOS
                                X/) [Env var DICT_BASE]
--out DIRECTORY                 The path to place extracted XML files.
-d, --dictionary [
    all|Arabic - English|Danish|Duden Dictionary Data Set I|Dutch|
    Dutch - English|French|French - English|French - German|German - English|
    Hebrew|Hindi|Hindi - English|Indonesian - English|Italian|
    Italian - English|Korean|Korean - English|New Oxford American Dictionary|
    Norwegian|Oxford American Writer's Thesaurus|
    Oxford Dictionary of English|Oxford Thesaurus of English|
    Polish - English|Portuguese|Portuguese - English|Russian|
    Russian - English|Sanseido Super Daijirin|
    Sanseido The WISDOM English-Japanese Japanese-English Dictionary|
    Simplified Chinese - English|Simplified Chinese - Japanese|Spanish|
    Spanish - English|Swedish|Thai|Thai - English|
    The Standard Dictionary of Contemporary Chinese|Traditional Chinese|
    Traditional Chinese - English|Turkish|Vietnamese - English]
                                The dictionary to extract or 'all'.
                                (Default: all) [Accepts multiple]
--format-xml / --no-format-xml  Format the XML files using BeautifulSoup.
                                (Default: False)
--debug                         Output debug information to STDERR.
                                (Default: False)
--help                          Show this message and exit.

Introduction

I need a ton of dictionary data for prototyping my learning a language tool, Parsnip, and licensing 40 dictionaries seems too expensive for a bootstrapper prototyping / working on an MVP (I look forward to the day this is no longer true). [Note: I am not planning to redistribute or otherwise use the data in an unlicensed manner.]

Parsnip uses Natural Language Processing and Dictionaries to decouple the word <-> sentence tug-of-war that's existed as long as flashcards have been used for language learning. I.e., should I make a word (concept) or a sentence (example) flashcard?

I care about what words I know for tracking purposes, but I want those words in context when I'm practicing. So the learning system breaks down sentences into lemmas (or dictionary form of a word) and a database of example sentences that the words appear in. This resolves the conceptual tug-of-war for flashcards.

But by removing reference data from the flashcards themselves, I need to integrate reference material directly into Parsnip's UI. JMDict is a great open source project for this, but that only covers a single language. So, I've been keeping my eyes open for people working on extracting the data from Apple's bundled dictionaries.

This has been a community effort that's spanned several years. My contribution is to collect the results, clear up some details about the file format, and package it into a general command-line tool.

References

This is inspired by Reverse-Engineering Apple Dictionary. And the discussion on Hacker News Hacker News: Reverse-Engineering Apple Dictionary (2020). Special thanks to tim-- and enragedcacti who introduced me to binwalk. And dunham who mentioned the random bytes looking like ints of payload sizes.

Additionally, I've found these posts informative:

You might also like...
extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTsite python Package extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install Get

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.
Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

eBook Reader Dictionaries Finally, decent dictionaries based on Wiktionary for your beloved eBook reader. Dictionaries Catalan 🚧 Ελληνικά (help welco

Generates password lists/dictionaries based on keywords written in python3.

dicbyru Introduction Generates password lists/dictionaries based on keywords. It uses the keywords and adds capital letters, numbers and special chara

This utility synchronises spelling dictionaries from various tools with each other.

This utility synchronises spelling dictionaries from various tools with each other. This way the words that have been trained on MS Office are also correctly checked in vim or Firefox. And vice versa of course.

Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

Converts XML to Python objects

untangle Documentation Converts XML to a Python object. Siblings with similar names are grouped into a list. Children can be accessed with parent.chil

Python module that makes working with XML feel like you are working with JSON

xmltodict xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec": print(json.dumps(xmltod

Create Open XML PowerPoint documents in Python

python-pptx is a Python library for creating and updating PowerPoint (.pptx) files. A typical use would be generating a customized PowerPoint presenta

The lxml XML toolkit for Python

What is lxml? lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory

PubMed Mapper: A Python library that map PubMed XML to Python object

pubmed-mapper: A Python Library that map PubMed XML to Python object 中文文档 1. Philosophy view UML Programmatically access PubMed article is a common ta

Simple app for visual editing of Page XML files

Name nw-page-editor - Simple app for visual editing of Page XML files. Version: 2021.02.22 Description nw-page-editor is an application for viewing/ed

PAGE XML format collection for document image page content and more
PAGE XML format collection for document image page content and more

PAGE-XML PAGE XML format collection for document image page content and more For an introduction, please see the following publication: http://www.pri

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

A repository that shares tuning results of trained models generated by TensorFlow / Keras. Post-training quantization (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization), Quantization-aware training. TensorFlow Lite. OpenVINO. CoreML. TensorFlow.js. TF-TRT. MediaPipe. ONNX. [.tflite,.h5,.pb,saved_model,tfjs,tftrt,mlmodel,.xml/.bin, .onnx]
A python script to convert an ucompressed Gnucash XML file to a text file for Ledger and hledger.

README 1 gnucash2ledger gnucash2ledger is a Python script based on the Github Gist by nonducor (nonducor/gcash2ledger.py). This Python script will tak

Json2Xml tool will help you convert from json COCO format to VOC xml format in Object Detection Problem.

JSON 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Json2Xml t

Txt2Xml tool will help you convert from txt COCO format to VOC xml format in Object Detection Problem.

TXT 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Txt2Xml too

Releases(v0.1.1)
Owner
Joshua Olson
Joshua Olson
Blender 2.93 addon for loading Quake II MD2 files

io_mesh_md2 is a Blender 2.93 addon for importing Quake II MD2 files.

Joshua Skelton 11 Aug 31, 2022
Auto-generate /etc/hosts for HackTheBox machines

Auto-generate /etc/hosts for HackTheBox machines Save yourself some tedium on getting started on a new machine by having your /etc/hosts ready to go.

3 Feb 16, 2022
Fuzzy box is a quick program I wrote to fuzz a URL that is in the format https:// url 20characterstring.

What is this? Fuzzy box is a quick program I wrote to fuzz a URL that is in the format https://url/20characterstring.extension. I have redacted th

Graham Helton 1 Oct 19, 2021
Standard implementations of FedLab and its provided benchmarks.

FedLab-benchmarks This repo contains standard implementations of FedLab and its provided benchmarks. Currently, following algorithms or benchrmarks ar

SMILELab-FL 104 Dec 05, 2022
A library for interacting with Path of Exile game and economy data, and a unique loot filter generation framework.

wraeblast A library for interfacing with Path of Exile game and economy data, and a set of item filters geared towards trade league players. Filter Ge

David Gidwani 29 Aug 28, 2022
Monte Carlo simulation of 3G rules

mc3g Monte Carlo simulation of 3G rules This project contains the Python code to do simulations of events according to the 3G rule (in German: "Geimpf

Jan Christoph Terasa 4 Nov 01, 2021
Import the module and create an object of the class LocalVariable.

LocalVariable Import the module and create an object of the class LocalVariable. Call the save method with the name and the value of a variable as arg

Sajedur Rahman Fiad 2 Dec 14, 2022
Install, run, and update apps without root and only in your home directory

Qube Apps Install, run, and update apps in the private storage of a Qube Building instrutions

Micah Lee 26 Dec 27, 2022
A string to hashtags module

A string to hashtags module

Fayas Noushad 4 Dec 01, 2021
A hashtag from string extract python module

A hashtag from string extract python module

Fayas Noushad 3 Aug 10, 2022
An online streamlit development platform

streamlit-playground An online streamlit development platform Run, Experiment and Play with streamlit Components Develop full-fledged apps online All

Akshansh Kumar 3 Nov 06, 2021
Generate random german words

Generate random german words / Generiere zufällige deutsche Wörter Getting Started Pip install with pip install zufallsworte Install the library with

Maximilian Freitag 5 Mar 24, 2022
A meme error handler for python

Pwython OwO what's this? Pwython is project aiming to fill in one of the biggest problems with python, which is that it is slow lacks owoified text. N

SystematicError 23 Jan 15, 2022
A Python script that parses and checks public proxies. Multithreading is supported.

A Python script that parses and checks public proxies. Multithreading is supported.

LevPrav 7 Nov 25, 2022
Software to help automate collecting crowdsourced annotations using Mechanical Turk.

Video Crowdsourcing Software to help automate collecting crowdsourced annotations using Mechanical Turk. The goal of this project is to enable crowdso

Mike Peven 1 Oct 25, 2021
A quick username checker to see if a username is available on a list of assorted websites.

A quick username checker to see if a username is available on a list of assorted websites.

Maddie 4 Jan 04, 2022
EVE-NG tools, A Utility to make operations with EVE-NG more friendly.

EVE-NG tools, A Utility to make operations with EVE-NG more friendly. Also it support different snapshot operations with same style as Libvirt/KVM

Bassem Aly 8 Jan 05, 2023
This project is a set of programs that I use to create a README.md file.

This project is a set of programs that I use to create a README.md file.

Tom Dörr 223 Dec 24, 2022
Set of utilities for exporting/controlling your robot in Blender

Blender Robotics Utils This repository contains utilities for exporting/controlling your robot in Blender Maintainers This repository is maintained by

Robotology 33 Nov 30, 2022
This utility lets you draw using your laptop's touchpad on Linux.

FingerPaint This utility lets you draw using your laptop's touchpad on Linux. Pressing any key or clicking the touchpad will finish the drawing

Wazzaps 95 Dec 17, 2022