Command line program to download documents from web portals.

Overview

command line document download made easy



Highlights

  • list available documents in json format or download them
  • filter documents using
    • string matching
    • regular expressions or
    • jq queries
  • display captcha or QR codes for interactive input
  • writing new plugins is easy
  • existing plugins (some of them even work):
    • amazon
    • ing.de
    • dkb.de
    • o2.de
    • kabel.vodafone.de
    • conrad.de
    • elster.de



Dependencies



Installation

$ git clone --recursive https://github.com/heeplr/document-dl
$ cd document-dl
$ pip install .



Usage

Display Help:

$ document-dl -h
Usage: document-dl [OPTIONS] COMMAND [ARGS]...

  download documents from web portals

Options:
  -u, --username TEXT             login id  [env var: DOCDL_USERNAME]
  -p, --password TEXT             secret password  [env var: DOCDL_PASSWORD]
  -m, --match <ATTRIBUTE PATTERN>...
                                  only output documents where attribute
                                  contains pattern string  [env var:
                                  DOCDL_MATCH]

  -r, --regex <ATTRIBUTE REGEX>...
                                  only output documents where attribute value
                                  matches regex  [env var: DOCDL_REGEX]

  -j, --jq JQ_EXPRESSION          only output documents if json query matches
                                  document's attributes (see
                                  https://stedolan.github.io/jq/manual/ )
                                  [env var: DOCDL_JQ]

  -H, --headless BOOLEAN          show browser window if false  [env var:
                                  DOCDL_HEADLESS; default: True]

  -b, --browser [chrome|edge|firefox|ie|opera|safari|webkitgtk]
                                  webdriver to use for selenium based plugins
                                  [env var: DOCDL_BROWSER; default: chrome]

  -t, --timeout INTEGER           seconds to wait for data before terminating
                                  connection  [env var: DOCDL_TIMEOUT;
                                  default: 15]

  -i, --image-loading BOOLEAN     Turn off image loading when False  [env var:
                                  DOCDL_IMAGE_LOADING; default: False]

  -a, --action [download|list]    download or just list documents  [env var:
                                  DOCDL_ACTION; default: list]

  -h, --help                      Show this message and exit.

Commands:
  amazon    amazon.de (invoices)
  conrad    conrad.de (invoices)
  dkb       dkb.de with photoTAN (postbox)
  elster    elster.de with path to .pfx certfile as username (postbox)
  ing       banking.ing.de with photoTAN (postbox)
  o2        o2online.de (invoices/postbox)
  vodafone  kabel.vodafone.de (postbox, invoices)

Display plugin-specific help: (currently there is a bug in click that prompts for username and password before displaying the help)

$ document-dl ing --help
Usage: document-dl ing [OPTIONS]

  banking.ing.de with photoTAN (postbox)

Options:
  -k, --diba-key TEXT  DiBa Key  [env var: DOCDL_DIBA_KEY]
  -h, --help           Show this message and exit.



Examples

List all documents from vodafone.de, prompt for username/password:

$ document-dl vodafone

Same, but show browser window this time:

$ document-dl --headless=false vodafone

Download all documents from conrad.de, pass credentials as commandline arguments:

$ document-dl --username mylogin --password mypass --action download conrad

Download all documents from conrad.de, pass credentials as env vars:

$ DOCDL_USERNAME='mylogin' DOCDL_PASSWORD='mypass' document-dl --action download conrad

Download all documents from o2online.de where "doctype" attribute contains "BILL":

$ document-dl --match doctype BILL --action download o2

You can also use regular expressions to filter documents:

$ document-dl --regex date '^(2021-04|2021-05).*$' o2

List all documents from o2online.de where year >= 2019:

$ document-dl --jq 'select(.year >= 2019)' o2

Download document from elster.de with id == 15:

$ document-dl --jq 'contains({id: 15})' --action download elster



Writing a plugin

Plugins are click-plugins which in turn are normal @click.command's registered in setup.py

  • put your plugin into "docdl/plugins"

  • write your plugin class:

    • if you just need requests, inherit from docdl.WebPortal and use self.session that's initialized for you
    • if you need selenium, inherit from docdl.SeleniumWebPortal and use self.webdriver that's initialized for you
    • add click glue code
    • add your plugin to setup.py docdl_plugins registry
import docdl
import docdl.util

class MyPlugin(docdl.WebPortal):

    URL_LOGIN = "https://myservice.com/login"

    def login(self):
        request = self.session.get(self.URL_LOGIN)
        # ... authenticate ...
        if not_logged_in:
            return False
        return True

    def logout(self):
        # ... logout ...

    def documents(self):
        # iterate over all available documents
        for count, document in enumerate(all_documents):

            # scrape:
            #  * document attributes
            #    * it's recommended to assign an incremental "id"
            #      attribute to every document
            #    * if you set a "filename" attribute, it will be used to
            #      rename the downloaded file
            #    * dates should be parsed to datetime.datetime objects
            #      docdl.util.parse_date() should parse the most common strings
            #
            # also you must scrape either:
            #  * the download URL
            #
            # or (for SeleniumWebPortal plugins):
            #  * the DOM element that triggers download. It is expected
            #    that the download starts immediately after click() on
            #    the DOM element
            # or implement a custom download() method

            yield docdl.Document(
                url = this_documents_url,
                # download_element = <some selenium element to click>
                attributes = {
                    "id": count,
                    "category": "invoices",
                    "title": this_documents_title,
                    "filename": this_documents_target_filename,
                    "date": docdl.util.parse_date(some_date_string)
                }
            )


    def download(self, document):
        """you shouldn't need this for most web portals"""
        # ... save file to os.getcwd() ...
        return self.rename_after_download(document, filename)


@click.command()
@click.pass_context
def myplugin(ctx):
    """plugin description (what, documents, are, scraped)"""
    docdl.cli.run(ctx, MyPlugin)

and in setup.py:

# ...
setup(
    # ...
    packages=find_packages(
        # ...
        entry_points={
            'docdl_plugins': [
                # ...
                'myplugin=docdl.plugins.myplugin:myplugin',
                # ...
            ],
            # ...
        }
)



Security

Beware that your login credentials are most probably saved in your shell history when you pass them as commandline arguments. You can use the input prompt to avoid that or set environment variables safely.



Bugs

document-dl is still in a very early state of development and a lot of things don't work, yet. Especially a ton of edge cases need to be covered. If you find a bug, please open an issue or send a pull request.

  • --browser settings beside chrome probably don't work unless you help to test them
  • some services offer more documents/data than currently scraped



TODO

  • logging
  • better documentation
  • properly parse rfc6266
  • delete action
ChromiumJniGenerator - Jni Generator module extracted from Chromium project

ChromiumJniGenerator - Jni Generator module extracted from Chromium project

allenxuan 4 Jun 12, 2022
爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

OnTimeHacker V1.0 OnTimeHacker 是一个爬取各大SRC当日公告,并通过微信通知的小工具 OnTimeHacker目前版本为1.0,已支持24家SRC,列表如下 360、爱奇艺、阿里、百度、哔哩哔哩、贝壳、Boss、58、菜鸟、滴滴、斗鱼、 饿了么、瓜子、合合、享道、京东、

Bywalks 95 Jan 07, 2023
A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

Danushka-Madushan 1 Nov 28, 2021
Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a ra

1 Jan 04, 2022
Using Selenium with Python to Web Scrap Popular Youtube Tech Channels.

Web Scrapping Popular Youtube Tech Channels with Selenium Data Mining, Data Wrangling, and Exploratory Data Analysis About the Data Web scrapi

David Rusho 0 Aug 18, 2021
Web Scraping Practica With Python

Web-Scraping-Practica Integrants: Guillem Vidal Pallarols. Lídia Bandrés Solé Fitxers: Aquest document és el primer que trobem. A continuació trobem u

2 Nov 08, 2021
Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Video Games Web Scraper Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages. This

Albert Marrero 1 Jan 12, 2022
python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤(从2月份稳定运行至今)

python+selenium实现的web端自动打卡 说明 本打卡脚本适用于郑州大学健康打卡,其他web端打卡也可借鉴学习。(自己用的,从2月分稳定运行至今) 仅供学习交流使用,请勿依赖。开发者对使用本脚本造成的问题不负任何责任,不对脚本执行效果做出任何担保,原则上不提供任何形式的技术支持。 为防止

Sunday 1 Aug 27, 2022
Google Maps crawler using Selenium

Google Maps Crawler using Selenium Built as part of the Antifragile Dev Project Selenium crawler that browses Google Maps as a regular user and stores

Guilherme Latrova 46 Dec 16, 2022
自动完成每日体温上报(Github Actions)

体温上报助手 简介 每天 10:30 GMT+8 自动完成体温上报,如想修改定时运行的时间,可修改 .github/workflows/SduHealthReport.yml 中 schedule 属性。 如果当日有异常,请手动在小程序端/PC 端填写!

Teng Zhang 23 Sep 15, 2022
OSTA web scraper, for checking the status of school buses in Ottawa

OSTA-La-Vista OSTA web scraper, for checking the status of school buses in Ottawa. Getting Started Using a Raspberry Pi, download Python 3, and option

1 Jan 28, 2022
中国大学生在线 四史自动答题刷分(现仅支持英雄篇)

中国大学生在线 “四史”学习教育竞答 自动答题 刷分 (现仅支持英雄篇,已更新可用) 若对您有所帮助,记得点个Star 🌟 !!! 中国大学生在线 “四史”学习教育竞答 自动答题 刷分 (现仅支持英雄篇,已更新可用) 🥰 🥰 🥰 依赖 本项目依赖的第三方库: requests 在终端执行以下

XWhite 229 Dec 12, 2022
This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Faisal Ahmed 1 Jan 10, 2022
学习强国 自动化 百分百正确、瞬间答题,分值45分

项目简介 学习强国自动化脚本,解放你的时间! 使用Selenium、requests、mitmpoxy、百度智能云文字识别开发而成 使用说明 注:Chrome版本 驱动会自动下载 首次使用会生成数据库文件db.db,用于提高文章、视频任务效率。 依赖安装 pip install -r require

lisztomania 359 Dec 30, 2022
Scraping web pages to get data

Scraping Data Get public data and save in database This is project use Python How to run a project 1 - Clone the repository 2 - Install beautifulsoup4

Soccer Project 2 Nov 01, 2021
Audio media crawler for lbry.

Audio media crawler for lbry. Requirements Python 3.8 Poetry 1.1.7 Elasticsearch 7.14.0 Lbry-sdk 0.99.0 Development This project uses poetry as a depe

Hound.fm 4 Dec 03, 2022
This is a python api to scrape search results from a url.

googlescrape Installation Installation is simple! # Stable version pip install googlescrape Examples from googlescrape import client scrapeClient=cli

1 Dec 15, 2022
A list of Python Bots used to extract data from several websites

A list of Python Bots used to extract data from several websites. Data extraction is for products on e-commerce (ecommerce) websites. Data fetched i

Sahil Ladhani 1 Jan 14, 2022
A Very simple free proxy list scraper.

Scrappp A Very simple free proxy list scraper, made in python The tool scrape proxy from diffrent sites and api's. Screenshots About the script !!! RE

Joji aka Moncef 12 Oct 27, 2022
Web scrapper para cotizar articulos

WebScrapper Este web scrapper esta desarrollado en python 3.10.0 para buscar en la pagina de cyber puerta articulos dentro del catalogo. El programa t

Jordan Gaona 1 Oct 27, 2021