Libextract: extract data from websites

https://travis-ci.org/datalib/libextract.svg?branch=master

    ___ __              __                  __
   / (_) /_  ___  _  __/ /__________ ______/ /_
  / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/
 / / / /_/ /  __/>   
Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python. Originating from eatiht, the extraction algorithm works by making one simple assumption: data appear as collections of repetitive elements. You can read about the reasoning here.
  

  
   Overview 

  
 
  
   
  libextract.api.extract(document, encoding='utf-8', count=5)
 
   
 
  
   
  Given an html document, and optionally the encoding, return a list of nodes likely containing data (5 by default).
 
   

 
  

  
   Installation 
pip install libextract
  

  
   Usage 
Due to our simple definition of "data", we open up a single interfaceable method. Post-processing is up to you. 

 
 
  from requests import get
from libextract.api import extract

r = get('http://en.wikipedia.org/wiki/Information_extraction')
textnodes = list(extract(r.content))

  
Using lxml's built-in methods for post-processing: 

 
 
  >> print(textnodes[0].text_content())
Information extraction (IE) is the task of automatically extracting structured information...

  
The extraction algo is agnostic to article text as it is with tabular data: 

 
 
  height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content))

  

 
 
  >> [elem.text_content() for elem in tabs[0].iter('th')]
['Country/Region',
 'Average male height',
 'Average female height',
 ...]

 
  

  
   Dependencies 
lxml
statscounter
  

  
   Disclaimer 
This project is still in its infancy; and advice and suggestions as to what this library could and should be would be greatly appreciated 
:)

Libextract: extract data from websites

Related tags

Overview

Libextract: extract data from websites

Overview

Installation

Usage

Dependencies

Disclaimer

Owner

Pelican plugin that adds site search capability

Dailyiptvlist.com Scraper With Python

SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

Ebay Webscraper for Getting Average Product Price

CreamySoup - a helper script for automated SourceMod plugin updates management.

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

京东茅台抢购 2021年4月最新版

Scrapes Every Email Address of Every Society in Every University

Minimal set of tools to conduct stealthy scraping.

Pseudo API for Google Trends

Scraping web pages to get data

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Anonymously scrapes onlinesim.ru for new usable phone numbers.

The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

京东茅台抢购最新优化版本，京东秒杀，添加误差时间调整，优化了茅台抢购进程队列

Iptvcrawl - A scrapy project for crawl IPTV playlist

Simple library for exploring/scraping the web or testing a website you’re developing

Library to scrape and clean web pages to create massive datasets.