Archive, organize, and watch for changes to publicly available information.

Overview

0. Overview

The Trapper Keeper is a collection of scripts that support archiving information from around the web to make it easier to study and use. If you are a researcher working with online material, an educator creating openly licensed content, or a curious person who likes to learn more about different subjects, then Trapper Keeper might be helpful to you. Trapper Keeper can currently archive and clean web pages and pdfs.

Trapper Keeper supports these features:

  • Archive data from multiple sources;
  • Clean data and save it as text;
  • List out embedded media and links;
  • Retain a copy of embedded images in the source text;
  • Track the source material for changes;
  • Organize your cleaned, archived data into arbitrary collections - a "collection" can be anything that unifies a set of information; ie, a set of urls that all related to a specific topic; or a set of information that will be remixed into chapters;
  • Export a list of all tracked URLs.

1. General Use

Identify urls that contain content or data you would like to research or use. Collect those urls into a csv file.

Use archive.py retain a clean copy that you can work with locally. You can run as many imports as you want.

When you want to work with a specific set of content, create a csv that lists the urls you want to examine. Run collect_texts.py to get a copy of the specific texts you want.

Periodically, check if any of the urls you have archived have been changed or updated by running archive.py -p update.

To see the specific content of any changes, create a csv that lists the urls where you want to examine diffs, and run show_diffs.py

A common use case here is developing and maintaining open content. You have researched multiple pages on the web that contain information shared under an open license, and you want to incorporate that information into your own material. Over time, if any of your source materials are updated, you'd like to know.

2. Additional Details

2A. Getting Started

Create a list of urls you want to archive. At present, the script processes and extracts text from web pages and pdfs.

For every web page you want to archive, you will need to designate three text snippets: the first phrase/few words of the page you want to archive; the final phrase/words of the section you want to archive, and a phrase in the middle of the text you want to archive.

For every pdf you want to archive, you only need to specify the url to the pdf.

When you have the url and snippet information, save it in a csv with four columns 'source_urls','opening', 'middle', and 'closing'.

Put the csv file in the "source" directory and update the filename in the archive.py file.

Run the intial import by running archive.py -p csv.

Running archive.py -p csv does a few things:

  • for web pages, the script saves three versions of the url: the complete, raw version; a second snippet that contains the html around the opening and closing snippet; and a cleaned version that just contains text. The cleaned version also contains a list of all linked urls in the page, and a list of all images linked in the page.
  • for web pages, the script retrieves any images in the page and stores them in a "media" directory
  • for pdfs, the script creates a copy of the pdf, and extracts text from the pdf, and stores a copy of both.
  • for both web pages and pdfs, the script creates a json file that stores metadata about the url and the content at that url. This metadata includes a hash of the content that is used to track changes over time.

2B. Track additional URLs

To track additional urls, create a new csv file with new information about urls, and run archive.py -p csv

2C. Getting updated content

To retrieve new versions of urls that have been archived, run archive.py -p update

2D. Check Diffs

To check whether or not a url has been updated, prepare a csv with at least 4 columns: source_urls, yyyy, mm, and dd.

source_urls should contain the urls you want to check for updates. yyyy, mm, and dd are used to designate the cutoff date for diffs. For example, if you wanted to check whether or not updates have happened after September 1, 2021, you would enter 2021 as yyyy, 09 as mm, and 01 as dd.

The cutoff date can be specified for each individual url.

Once you have the csv created, add it to the "source" directory and update the filename in the show_diffs.py file.

Running show_diffs.py creates a single html page for each changed file that displays the changes side by side. Each html page is stored in the "diffs" directory.

2E. Getting data from a subset of URLs; ie, the Point of It All

Saving content and cleaning it is great, but ultimately we need to organize this information and work with it. The collect_texts.py allows us to choose exactly the urls we want to work with, and to make a copy of the cleaned text.

To export cleaned text that we have archived, create a csv with two columns: source_urls and collection.

Once you have the csv created, add it to the "source" directory and update the filename in the collect_texts.py file.

Then, run collect_texts.py and the cleaned text will be copied into the "delivery" directory, and the files will be sorted by "collection".

2F. Export all data

The export.py file allows you to export a csv of all records, or a csv of only the urls that are current.

To export every record, run export.py -e all.

To export current records only (ie, only pointer to the latest version of each url), run export.py -e current.

2G. Housekeeping

The housekeeping.py script runs basic maintenance tasks. Currently, it cleans unneeded files and moves them into a "manual_review" directory.

Other housekeeping tasks will be added in the near future.

3. Current files:

  • archive.py - requires a csv - takes two arguments: csv, or update
  • housekeeping.py
  • show_diffs.py
  • collect_texts.py - requires a csv
  • export.py - requires a csv- takes two arguments: current, or all
Comments
  • Document setup procedure for OSX

    Document setup procedure for OSX

    I did not get far.

    1. Created a csv with 3 pdf URLs, 3 web URLs, as specified, file name named keep_this.csv in a source/ directory (note, the readme suggests using this, it might help if the distro included a sample csv one could run as a first test)
    2. Modified archive.py with the new file name
    3. Ran python archive.py -p csv

    I get an error message

      File "archive.py", line 143
        print(f"Processing {url}\n")
                                  ^
    SyntaxError: invalid syntax
    

    I know nothing of python and it's likely a rookie error.

    opened by cogdog 16
  • Use pdfminer for OSX; retain ocrmypdf for Linux

    Use pdfminer for OSX; retain ocrmypdf for Linux

    Pdfminer is already used to extract metadata, and ocrmypdf is not behaving well in testing with OSX (although that's likely due to my human error).

    In any case, pdfminer has the ability to extract text from pdfs, and it is working without issue in OSX (so far, anyways).

    This thread has info from one of the pdfminer maintainers, and will be a good starting point: https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python/61855361#61855361

    opened by billfitzgerald 2
  • In archive.py, add

    In archive.py, add "initial_save" and "pdf" value to the json file that contains url metadata

    I'm not 100% sure this is needed, as it can be largely inferred from the accessed_on value, but it could be useful over time as an additional way to track the evolution of a url over time.

    This will also have implications for other functionality in different scripts, so any other element of the toolkit that parses the url_data will need to be reviewed prior to making any change here.

    opened by billfitzgerald 2
  • fix small typo add requirements.txt

    fix small typo add requirements.txt

    Thanks for sharing, I've been meaning to look at beautiful soup for a while.

    Looks like there is a typo on line 50 of collect_texts.py

    Added requirements.txt to capture dependencies to easily install via pip install -r requirements.txt especially convenient if using conda environments to avoid cluttering main python env.

    This also required liblept5 and firefox-geckodriver installed via apt on ubuntu before anything would run. Maybe I'll try to document the full set via a Dockerfile as I'm sure there are other dependencies I already had installed.

    There appear to be other issues I'm struggling through, not sure if user error, documentation, or something other. I'll try to grok things a bit better so I can articulate the other issues and either create issues or PRs.

    opened by jgraham909 1
  • Implement shuffle and checks on urls to reduce the load on any specific site

    Implement shuffle and checks on urls to reduce the load on any specific site

    When checking sites for updates, we want to make sure that we don't place any performance burden on any of the sites we are working with. This will be most relevant for running updates on policies.

    We'll get two checks in place:

    1. Shuffle the dataframe with urls.
    2. Check the url that was just archived; if it's from the same domain that was just archived, pause (or navigate away) before continuing.
    opened by billfitzgerald 0
  • Rework collect_texts.py to cover two options: text only, and text will all html files

    Rework collect_texts.py to cover two options: text only, and text will all html files

    The current default brings text and the truncated snippet over.

    Rework the script to take an argument on the command line:

    -c --collect all or -c --collect text

    opened by billfitzgerald 0
  • Expand collect_texts.py to include source texts

    Expand collect_texts.py to include source texts

    Both extracted text and source will be exported by default.

    Over time, this can be made configurable, but for now getting both is a more common use case (for me, anyways).

    opened by billfitzgerald 0
  • Adjust how opening text is used to trip text output in archive.py

    Adjust how opening text is used to trip text output in archive.py

    It works well in most cases, but in some instances (generally, where the opening snippet includes text that from multiple tags or with unpredictable spacing) the trimming can be too aggressive.'

    I have some urls where this issue occurs, and some ideas on the best way to address this.

    Creating this issue to track progress.

    opened by billfitzgerald 0
  • Implement cleaner handling of opening and closing text

    Implement cleaner handling of opening and closing text

    Sometimes, depending on how the opening and closing lines are formatted, the closing text and opening text can have spaces and other odd characters that complicate identifying the exact end and beginning of a text when the text isn't compressed.

    Needs to be addressed.

    opened by billfitzgerald 0
  • Meta-Issue: PDF Cleanup

    Meta-Issue: PDF Cleanup

    Currently, PDF cleanup is working in both Linux and OSX.

    The way it's working definitely needs improvement. This issue documents the current approach, some of the rationale behind this less-than-ideal approach, and some general thoughts on moving forward.

    The thoughts in this ticket are reflected in the update pushed to the repo here: https://github.com/billfitzgerald/trapper-keeper/commit/d3f26ad76adb06d448f956f8663a871128da9657

    Current approach and rationale

    The original version was written and tested in Linux, and used OCRMyPDF. The results from OCRMyPDF are good.

    However, using OCRMyPDF in OSX didn't work cleanly, even when using an "ifmain" guard as specified by the documentation here: https://ocrmypdf.readthedocs.io/en/latest/api.html

    To address this issue, I switched over to PDFMiner.six, which worked in OSX, and did not throw exceptions. However, the results were not as clean, and part of what's nice about OCRMyPDF is that it will also OCR text from images.

    The current "solution" (which isn't awesome) is to check for the OS of the machine running the script. OSX users are routed to use PDFMiner; Linux users are routed to use OCRMyPDF. Windows users should probably be routed to use PDFMiner as well, but I don't have a Windows machine to test against, so Windows is not currently supported.

    Future path

    In the future, I'd rather use a single method for cleaning PDFs.

    Additionally, even using OCRMyPDF, the average PDF still has a lot of cruft that needs to be cleaned from the output, so future will will also include better text cleanup.

    opened by billfitzgerald 0
  • Add a check for iframes within a page and flag results for additional review

    Add a check for iframes within a page and flag results for additional review

    For pages that embed iframes - test and see if the full content is rendered, and therefore accessible.

    This might be a non-issue; putting this here so it doesn't get lost.

    opened by billfitzgerald 1
Releases(v00.01.01pre_alpha)
  • v00.01.01pre_alpha(Jan 24, 2022)

Owner
Bill Fitzgerald
Bill Fitzgerald
Data derived from the OpenType specification

This package currently provides the opentypespec.tags module, which exports FEATURE_TAGS, SCRIPT_TAGS, LANGUAGE_TAGS and BASELINE_TAGS dictionaries, representing data from the Layout Tag Registry

Simon Cozens 4 Dec 01, 2022
Python Osmium Examples

Python Osmium Examples This is a set (currently of size 1) of examples showing practical usage of PyOsmium, a thin wrapper around the osmium library.

Martijn van Exel 1 Jan 26, 2022
Use Ghidra Structs in Python

Strudra Welcome to Strudra, a way to craft Ghidra structs in python, using ghidra_bridge. Example First, init Strudra - you can pass in a custom Ghidr

Dominik Maier 27 Nov 24, 2022
A community-driven python bot that aims to be as simple as possible to serve humans with their everyday tasks

JARVIS on Messenger Just A Rather Very Intelligent System, now on Messenger! Messenger is now used by 1.2 billion people every month. With the launch

Swapnil Agarwal 1.3k Jan 07, 2023
ChieriBot,词云API版,用于统计群友说过的怪话

wordCloud_API 词云API版,用于统计群友说过的怪话,基于wordCloud 消息储存在mysql数据库中.数据表结构见table.sql 为啥要做成API:这玩意太吃性能了,如果和Bot放在同一个服务器,可能会影响到bot的正常运行 你服务器性能够用的话就当我在放屁 依赖包 pip i

chinosk 7 Mar 20, 2022
A supercharged version of paperless: scan, index and archive all your physical documents

Paperless-ng Paperless (click me) is an application by Daniel Quinn and contributors that indexes your scanned documents and allows you to easily sear

Jonas Winkler 5.3k Jan 09, 2023
frida-based ceserver. iOS analysis is possible with Cheat Engine.

frida-ceserver frida-based ceserver. iOS analysis is possible with Cheat Engine. Original by Dark Byte. Usage Install frida on iOS. python main.py Cyd

KenjiroIchise 89 Jan 08, 2023
A simple script written using symbolic python that takes as input a desired metric and automatically calculates and outputs the Christoffel Pseudo-Tensor, Riemann Curvature Tensor, Ricci Tensor, Scalar Curvature and the Kretschmann Scalar

A simple script written using symbolic python that takes as input a desired metric and automatically calculates and outputs the Christoffel Pseudo-Tensor, Riemann Curvature Tensor, Ricci Tensor, Scal

2 Nov 27, 2021
A series of basic programs written in Python

Primeros programas en Python Una serie de programas básicos escritos en Python

Madirex 1 Feb 15, 2022
Fully coded Apps by Codex.

OpenAI-Codex-Code-Generation Fully coded Apps by Codex. How I use Codex in VSCode to generate multiple completions with autosorting by highest "mean p

nanowell 47 Jan 01, 2023
An execution framework for systematic strategies

WAGMI is an execution framework for systematic strategies. It is very much a work in progress, please don't expect it to work! Architecture The Django

Rich Atkinson 10 Mar 28, 2022
A custom advent of code I am completing

advent-of-code-custom A custom advent of code I am doing in python. The link to the problems I am solving is here: https://github.com/seldoncode/Adven

Rocio PV 2 Dec 11, 2021
Static bytecode simulator

SEA Static bytecode simulator for creating dependency/dependant based experimental bytecode format for CPython. Example a = random() if a = 5.0:

Batuhan Taskaya 23 Jun 10, 2022
Tracing and Observability with OpenFaaS

Tracing and Observability with OpenFaaS Today we will walk through how to add OpenTracing or OpenTelemetry with Grafana's Tempo. For this walk-through

Lucas Roesler 8 Nov 17, 2022
A program that analyzes data from inertia measurement units installeed in aircraft and generates g-exceedance curves

A program that analyzes data from inertia measurement units installeed in aircraft and generates g-exceedance curves

Pooya 1 Nov 23, 2021
Notebooks for computing approximations to the prime counting function using Riemann's formula.

Notebooks for computing approximations to the prime counting function using Riemann's formula.

Tom White 2 Aug 02, 2022
This is a far more in-depth and advanced version of "Write user interface to a file API Sample"

Fusion360-Write-UserInterface This is a far more in-depth and advanced version of "Write user interface to a file API Sample" from https://help.autode

4 Mar 18, 2022
Make discord server By Coding!

Discord Server Maker Make discord server by Coding! FAQ How can i get role permissons? Open discord with chrome developer tool, go to network and clic

1 Jul 17, 2022
Wisdom Tree is a concentration app i am working on.

Wisdom Tree Wisdom Tree is a tui concentration app I am working on. Inspired by the wisdom tree in Plants vs. Zombies which gives in-game tips when it

NO ONE 241 Jan 01, 2023
Run CodeServer on Google Colab using Inlets in less than 60 secs using your own domain.

Inlets Colab Run CodeServer on Colab using Inlets in less than 60 secs using your own domain. Features Optimized for Inlets/InletsPro Use your own Cus

2 Dec 30, 2021