A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Last update: Feb 10, 2022

Overview

Parallel web scraping

The project is a training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Goal

The script extracts names and prices of the Top-100 crypto coins and stores the data into a db.

Disclaimer

The task is quite contrived and serves mainly for study purpose. There are innumerous of mature sources containing both real-time and historical cryptocurrency data.

Solved problems within the project

Multiple pages with one level nesting have been scraped. The propagation has been implemented by gathering internal links from the main page followed by looping on them.
To avoid getting banned from the remote server, a mechanism dealing with proxy servers was implemented.
A free public proxy server is commonly assumed as unreliable in terms of availability. To overcome this issue:
- another scraping script extracts a list of free public proxy servers from a web site.
- with each launch of the script, the list of 10 proxy servers gets updated by currently available proxy servers.
- during the script execution, some proxy servers get unavailable. Thus, each scraping query goes through this list and searches for an alive proxy server to execute a query.
To speed up the scraping of the total 101 web pages multithreading is involved. The work is divided among 4 threads running almost simultaneously.
The extracted data is being written directly to a DataBase.

A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Related tags

Overview

Parallel web scraping

Goal

Disclaimer

Solved problems within the project

Owner

Kushal Shingote

FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

A database scraper created with mechanical soup and sqlite

LSpider 一个为被动扫描器定制的前端爬虫

Scrapy-soccer-games - Scraping information about soccer games from a few websites

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

A simple app to scrap data from Twitter.

Create crawler get some new products with maximum discount in banimode website

Telegram group scraper tool

A package designed to scrape data from Yahoo Finance.

Explore scraping with BeautifulSoup!

Binance Smart Chain Contract Scraper + Contract Evaluator

Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

A simple flask application to scrape gogoanime website.

Html Content / Article Extractor, web scrapping lib in Python

A social networking service scraper in Python

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Console application for downloading images from Reddit in Python