A webmining CLI tool & library for python.

Overview

Build Status DOI download number

Minet

minet is a webmining command line tool & library for python (>= 3.6) that can be used to collect and extract data from a large variety of web sources such as raw webpages, Facebook, CrowdTangle, YouTube, Twitter, Media Cloud etc.

It adopts a very simple approach to various webmining problems by letting you perform a variety of actions from the comfort of the command line. No database needed: raw CSV files should be sufficient to do most of the work.

In addition, minet also exposes its high-level programmatic interface as a python library so you can tweak its behavior at will.

Shortcuts: Command line documentation, Python library documentation.

Summary

What it does

Minet can single-handedly:

  • Extract URLs from a text file (or a table)
  • Parse URLs (get useful information, with Facebook- and Youtube-specific stuff)
  • Join two CSV files by matching the columns containing URLs
  • From a list of URLs, resolve their redirections
    • ...and check their HTTP status
    • ...and download the HTML
    • ...and extract hyperlinks
    • ...and extract the text content and other metadata (title...)
    • ...and scrape structured data (using a declarative language to define your heuristics)
  • Crawl (using a declarative language to define a browsing behavior, and what to harvest)
  • Mine or search:
  • Scrape (without requiring special access):
  • Grab & dump cookies from your browser
  • Dump Hyphe data

Documented use cases

Features (from a technical standpoint)

  • Multithreaded, memory-efficient fetching from the web.
  • Multithreaded, scalable crawling using a comfy DSL.
  • Multiprocessed raw text content extraction from HTML pages.
  • Multiprocessed scraping from HTML pages using a comfy DSL.
  • URL-related heuristics utilities such as extraction, normalization and matching.
  • Data collection from various APIs such as CrowdTangle.

Installation

minet can be installed as a standalone CLI tool (currently only on mac >= 10.14, ubuntu & similar) by running the following command in your terminal:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash

Don't trust us enough to pipe the result of a HTTP request into bash? We wouldn't either, so feel free to read the installation script here and run it on your end if you prefer.

On ubuntu & similar you might need to install curl and unzip before running the installation script if you don't already have it:

sudo apt-get install curl unzip

Else, minet can be installed directly as a python CLI tool and library using pip:

pip install minet

If you need more help to install and use minet from scratch, you can check those installation documents.

Finally if you want to install the standalone binaries by yourself (even for windows) you can find them in each release here.

Upgrading

To upgrade the standalone version, simply run the install script once again:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash

To upgrade the python version you can use pip thusly:

pip install -U minet

Uninstallation

To uninstall the standalone version:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/uninstall.sh | bash

To uninstall the python version:

pip uninstall minet

Documentation

Contributing

To contribute to minet you can check out this documentation.

How to cite

minet is published on Zenodo as DOI

You can cite it thusly:

Guillaume Plique, Pauline Breteau, Jules Farjas, Héloïse Théro, Jean Descamps, & Amélie Pellé. (2019, October 14). Minet, a webmining CLI tool & library for python. Zenodo. http://doi.org/10.5281/zenodo.4564399

Comments
  • casanova.exceptions.EmptyFileError

    casanova.exceptions.EmptyFileError

    I am trying to run minet in a github action. It fails with the following message:

      minet tw scrape tweets -o tweets.csv "from:@taniki #tutotal2022"
      shell: /usr/bin/bash -e {0}
      env:
        pythonLocation: /opt/hostedtoolcache/Python/3.9.5/x64
        LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.5/x64/lib
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s]Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/casanova/reader.py", line 151, in __init__
        fieldnames = next(self.reader)
    StopIteration
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.9.5/x64/bin/minet", line 8, in <module>
        sys.exit(main())
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/minet/cli/__main__.py", line 218, in main
        fn(cli_args)
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/minet/cli/twitter/__init__.py", line 33, in twitter_action
        twitter_scrape_action(cli_args)
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/minet/cli/twitter/scrape.py", line 45, in twitter_scrape_action
        enricher = casanova.enricher(
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/casanova/enricher.py", line 31, in __init__
        super().__init__(input_file, no_headers=no_headers, **kwargs)
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/casanova/reader.py", line 157, in __init__
        raise EmptyFileError
    casanova.exceptions.EmptyFileError
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s]
    Error: Process completed with exit code 1.
    
    opened by taniki 16
  • Get Retweeters

    Get Retweeters

    Hi, thanks for the last release, I'm glad to see there is a Retweeters tool but I went through some issues with it... for a few days.. I may not understood how it should implemented ? I run it and I get this error : image May someone who manage with it help me ?

    Thank you

    opened by jlbreeeez 15
  • Twitter API scraper: acquire guest_token by API

    Twitter API scraper: acquire guest_token by API

    new method to acquire the guest_token through activate API relates #384 #382

    Method taken from @JustAnotherArchivist in snscrape see: https://github.com/JustAnotherArchivist/snscrape/commit/0336ce13edbd195b3e91487061a0e7a2857f0c68 Thanks for sharing the solution.

    For now this edit is simply a new method to acquire the token. The token is used as a cookie as before but it's not preserved on disk in case of multiple calls.

    opened by paulgirard 11
  • tw scrape fails on some queries due to Over capacity error

    tw scrape fails on some queries due to Over capacity error

    minet tw scrape tweets '#5gcovid' > tweets.csv

    <class 'minet.twitter.exceptions.TwitterPublicAPIInvalidResponseError'>

    {'errors': [{'message': 'Over capacity', 'code': 130}]} 503

    bug 
    opened by Yomguithereal 10
  • [retweeters] KeyError: 'url'

    [retweeters] KeyError: 'url'

    Hi, when I try to retrieve the retweeters list from a file containing tweets previously extracted from Twitter using minet scrapper, I get this error after scanning a few tweets from my list (after 7, 10, or 30 tweets scanned... it depend of the database...). Does anyone encountered this error before ? Thanks for helping :-) image

    opened by tloops329384 8
  • impossible d'extraire totalité des tweets d'une requête

    impossible d'extraire totalité des tweets d'une requête

    Lorsque je lance une requête, avec comme critère un mot clé + un utilisateur, le résultat est très aléatoire : une fois 0 tweet, une fois 1 tweet, une fois 20 tweets, une fois 80 tweets etc sans jamais arriver à une extraction totale (qui est d'environ seulement 200 tweets pourtant). J'ai relancé cette requête de nombreuses fois, sans jamais extraire l'ensemble des tweets en question.

    Que dois-je faire pour y parvenir ? Merci

    opened by parisGH 8
  • [twitter] unable to get user tweets

    [twitter] unable to get user tweets

    Hello,

    Thanks for sharing the lib with the community. I am not able to get user tweets , I got the error:

    Traceback (most recent call last):
      File "/home/bafou/.local/bin/minet", line 8, in <module>
        sys.exit(main())
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/minet/cli/__main__.py", line 198, in main
        to_close = resolve_arg_dependencies(cli_args, config)
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/minet/cli/argparse.py", line 290, in resolve_arg_dependencies
        setattr(cli_args, name, value.resolve(config))
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/minet/cli/argparse.py", line 253, in resolve
        return getpath(config, self.key, self.default)
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/ebbe/utils.py", line 72, in getpath
        target = target[step]
    TypeError: string indices must be integers
    

    when executingminet tw user-tweets screen_name users.csv > tweets.csv with users.csv

    Regards.

    bug 
    opened by billmetangmo 6
  • GH actions + Minet Scrap Twitter fail.

    GH actions + Minet Scrap Twitter fail.

    hi,

    i have this GH action to generate a twitter scrap csv (written by @taniki) :

    name: scrape bfm
    
    on:
      workflow_dispatch:
      schedule:
        - cron:  '0 9 * * *'
    
    jobs:
      scrape_bfm:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/[email protected]
          - uses: actions/[email protected]
            with:
              python-version: '3.x'
          - name: install minet
            run: |
              python -m pip install --upgrade pip
              pip install minet==0.56.2
          - name: scrape @BFMTV tweets
            shell: bash
            run: |
              minet tw scrape tweets "from:@BFMTV since:2021-09-01" > bfmtv-tweets.csv
          - name: commit
            uses: ./.github/actions/commit
            with:
              message: lol @bfmtv
    

    Sometimes, no problem. Sometimes, GH return error log :

    Run minet tw scrape tweets "from:@CNEWS since:2021-09-01" > cnews-tweets.csv
    Collecting tweets: 0 tweets [00:00, ? tweets/s]                            
    Collecting tweets: 0 tweets [00:00, ? tweets/s]                   
    Searching for "from:@CNEWS since:2021-09-01"
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s]
    Collecting tweets: 0 tweets [00:00, ? tweets/s, queries=1, tokens=1]Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.10.1/x64/bin/minet", line 8, in <module>
        sys.exit(main())
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/cli/__main__.py", line 218, in main
        fn(cli_args)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/cli/twitter/__init__.py", line 31, in twitter_action
        twitter_scrape_action(cli_args)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/cli/twitter/scrape.py", line 69, in twitter_scrape_action
        for tweet, meta in iterator:
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/twitter/api_scraper.py", line 370, in search
        new_cursor, tweets = retryer(self.request_search, query, cursor, refs=refs)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
        do = self.iter(retry_state=retry_state)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/tenacity/__init__.py", line 349, in iter
        return fut.result()
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/concurrent/futures/_base.py", line 438, in result
        return self.__get_result()
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
        raise self._exception
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/tenacity/__init__.py", line 407, in __call__
        result = fn(*args, **kwargs)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/twitter/api_scraper.py", line 72, in wrapped
        self.acquire_guest_token()
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/twitter/api_scraper.py", line 261, in acquire_guest_token
        raise TwitterGuestTokenError
    minet.twitter.exceptions.TwitterGuestTokenError
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s, queries=1, tokens=1]
    Error: Process completed with exit code 1.
    

    Dont understand. Did anyone have the same problem Twitter ban GH sometimes ?

    Thanks for Minet, super outil !

    opened by stefw 6
  • Access denied

    Access denied

    Forewords : sorry, new on GitHub, and I'm not sure it is the appropriate place to post my question... Is it ?

    Hi, First, thank you for the tool which will help me a lot in my research ! I got a problem, which I think is not that complicated, but when I run Minet in order to get the "friends" of the twitter_users contained in the data_users.csv file, I don't manage to get access to the file : "Permission Denied"... I tried to open the CMD as an Administrator but it didn't solve the problem. Can you help me ?

    Capture

    opened by jlbreeeez 6
  • error in installing pip install mineit

    error in installing pip install mineit

    while installing mineit via pip it does not work. says, "" Collecting mineit Could not install packages due to an EnvironmentError: 404 Client Error: Not Found for url: https://pypi.org/simple/mineit/

    ""

    is this issue already solved?

    opened by moonisali 6
  • Twitter scrape: systematic TwitterGuestTokenError with v0.56.2 or v0.56.1

    Twitter scrape: systematic TwitterGuestTokenError with v0.56.2 or v0.56.1

    As in #382 I experience systematic TwitterGuestTokenError exceptions. Was not the case a few weeks ago. I didn't test other versions than 0.56.1 and 0.56.2.

    Looks like we need to review the twitter scrape heuristic. I will try to have a look later today or tomorrow.

    bug 
    opened by paulgirard 5
  • instagram

    instagram

    • [ ] get comments from a post id: https://www.instagram.com/api/v1/media/POST_ID/comments/?can_support_threading=true&permalink_enabled=false
    • [x] get user info from username: https://i.instagram.com/api/v1/users/web_profile_info/?username=USERNAME
    • [ ] other route for posts associated with hashtag (more info but don't know how to change page): https://www.instagram.com/api/v1/tags/web_info/?tag_name=HASHTAG
    • [ ] get post info from post id: https://www.instagram.com/api/v1/media/POST_ID/info/
    • [ ] get post likers from post id (it seems that we can only have access to a limited number of them): https://www.instagram.com/api/v1/media/POST_ID/likers/

    Need 'cookie' and 'x-ig-app-id'

    enhancement 
    opened by MiguelLaura 0
Releases(0.66.1)
Owner
médialab Sciences Po
SciencesPo's médialab is an interdisciplinary research laboratory gathering engineers, designers & social science researchers.
médialab Sciences Po
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

Command line utilities for tabular data files This is a set of command line utilities for manipulating large tabular data files. Files of numeric and

eBay 1.4k Jan 09, 2023
Doro is a CLI based pomodoro app and countdown timer application built using python.

Doro - CLI based pomodoro app Doro is a CLI based pomodoro app and countdown timer application built using python. Install $ pip install doro Usage Po

Suresh Kumar 14 May 23, 2022
A Julia library for solving Wordle puzzles.

Wordle.jl A Julia library for solving Wordle puzzles. Usage julia import Wordle: play julia play("panic") 4 julia play("panic", verbose = true) I

John Myles White 3 Jan 23, 2022
Tiny command-line utility for mapping broken keys to other positions.

brokenkey Tiny command-line utility for mapping broken keys to other positions. Installation Clone this repository using git: git clone https://github

0 Oct 04, 2021
Unpacks things.

$ unp_ unp is a command line tool that can unpack archives easily. It mainly acts as a wrapper around other shell tools that you can find on v

Armin Ronacher 405 Jan 03, 2023
Command-line search tool for GitHub

cligh is a command-line search tool for GitHub.

1 Oct 02, 2022
A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Stream your favorite shows straight from the command line.

A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Installation pip install -r requirements.txt It use

Jonardon Hazarika 17 Dec 11, 2022
A Python package for Misty II development

Misty2py Misty2py is a Python 3 package for Misty II development using Misty's REST API. Read the full documentation here! Installation Poetry To inst

Chris Scarred 1 Mar 07, 2022
cmdpxl: a totally practical command-line image editor

cmdpxl: a totally practical command-line image editor

Jieruei Chang 476 Jan 07, 2023
Colors in Terminal - Python Lang

🎨 Colorate - Python 🎨 About Colorate is an Open Source project that makes it easy to use Python color coding in your projects. After downloading the

0110 Henrique 1 Dec 01, 2021
A CLI framework based on asyncio

asynccli A CLI framework based on asyncio. Note This is still in active development. Things will change. For now, the basic framework is operational.

Adam Hopkins 6 Nov 13, 2022
CLI based diff viewer

Rich Diff CLI based diff viewer

Suresh Kumar 24 Nov 15, 2022
Play WORDLE game in your terminal.

Wordle TUI Play WORDLE game in your terminal. The game will be kept the same as the Web version. Prerequisites Python 3.7+ Linux/MacOS (Windows is not

Frost Ming 61 Oct 30, 2022
MiShell is a multi-platform, multi-architecture project based on the first version (MiShell32)

MiShell is a multi-platform, multi-architecture project based on the first version (MiShell32), which offers super super small reverse shell payloads great for injection in buffer overflow vulnerabil

Kamyar Hatamnezhad 0 Oct 27, 2022
A simple reverse shell in python

RevShell A simple reverse shell in python Getting started First, start the server python server.py Finally, start the client (victim) python client.py

Lojacopsen 4 Apr 06, 2022
An question and answer shell environment based on xonsh using ansible for setup

An question and answer shell environment based on xonsh using ansible for setup

Steven Hollingsworth 2 Jan 11, 2022
A clone of the popular online game Wordle

wordle_clone A CLI application for wordle. Description A clone of the popular online game Wordle.

0 Jan 29, 2022
Output Analyzer for you terminal commands

Output analyzer (OZER) You can specify a few words inside config.yaml file and specify the color you want to be used. installing: Install command usin

Ehsan Shirzadi 1 Oct 21, 2021
Generate your name in Ascii modular type art through the terminal

ASCII Name Generator Designed and developed by Eduardo Aire The ASCII Art Name Generator is a simple program that helps you to have a practical Shell/

Eduardo Aire 1 Nov 17, 2021
OneDriveExplorer - A command line and GUI based application for reconstructing the folder structure of OneDrive from the UserCid.dat file

OneDriveExplorer - A command line and GUI based application for reconstructing the folder structure of OneDrive from the UserCid.dat file

Brian Maloney 100 Dec 13, 2022