🐞 Douban Movie / Douban Book Scarpy

Overview

ScrapyDouban

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

The purpose of maintaining this project is to share some of my practice in the process of using Scrapy, the project covers about 80% of my knowledge of Scrapy, I hope to help friends who are learning Scrapy, please note that the current version of the project is Scrapy 2.5.0.

Docker


Project contains douban_scrapyd douban_db douban_adminer three containers.

The douban_scrapyd container is based on python:3.9-slim-buster, the default installed Python3 libraries are scrapy scrapyd pymysql pillow arrow, default mapping port 6800:6800 to facilitate user access to scrapyd management interface via host IP:6800, login required parameters, username:scrapyd password:public.

The douban_db container is based on mysql:8, root password is public, and the default initialization is to import the docker/mysql/douban.sql file to the douban database.

douban_adminer container is based on adminer:4, default mapping port 8080:8080 to facilitate users to access the database management interface through the host IP:8080, login required parameters, server:mysql username:root password:public.

Project SQL


The path to the SQL file used by the project is docker/mysql/douban.sql.

Collection Process


First collect Subject ID --> then crawl the detail page by Subject ID to collect data --> finally collect comments by Subject ID

method


$ git clone https://github.com/xjia77/ScrapyDouban.git
# Build and run containers
$ cd ./ScrapyDouban/docker
$ sudo docker-compose up --build -d
# enter douban_scrapyd container
$ sudo docker exec -it douban_scrapyd bash
# enter scrapy content
$ cd /srv/ScrapyDouban/scrapy
$ scrapy list
# Grabbing movie data
$ scrapy crawl movie_subject # collect movie Subject ID
$ scrapy crawl movie_meta # collect movie data
$ scrapy crawl movie_comment # collect movie comment
# Grabbing book data
$ scrapy crawl book_subject # collect book Subject ID
$ scrapy crawl book_meta # collect book data
$ scrapy crawl book_comment # collect book comment

If you want to make changes to your code more easily while testing, you can mount your project in the scrapy directory to the douban_scrapyd container. If you are used to working with scrapyd, you can deploy your project directly to the douban_scrapyd container via scrapyd-client.

Proxy IP


Due to douban's anti-crawler mechanism, the only way to bypass it now is through a proxy IP. ProxyMiddleware middleware is not enabled in the default settings.py. If you really need to use Douban's data to do some research, you can go rent a paid proxy pool.

image download


douban.pipelines.CoverPipeline processes the cover download logic by filtering spider.name, and the save path of the downloaded image files is the /srv/ScrapyDouban/storage directory of the douban_scrapy container.

Owner
Xingbo Jia
~1 year of professional Experience as a Software Engineer with a background in web development data science. Actively interested in software engineering interns
Xingbo Jia
A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

2 Jun 06, 2022
Web Scraping OLX with Python and Bsoup.

webScrap WebScraping first step. Authors: Paulo, Claudio M. First steps in Web Scraping. Project carried out for training in Web Scrapping. The export

claudio paulo 5 Sep 25, 2022
Ebay Webscraper for Getting Average Product Price

Ebay-Webscraper-for-Getting-Average-Product-Price The code in this repo is used to determine the average price of an item on Ebay given a valid search

17 Jan 05, 2023
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

2.3k Jan 04, 2023
Simple library for exploring/scraping the web or testing a website you’re developing

Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit for

Dan Claudiu Pop 79 Nov 27, 2022
Web3 Pancakeswap Sniper bot written in python3

Pancakeswap_BSC_Sniper_Bot Web3 Pancakeswap Sniper bot written in python3, Please note the license conditions! The first Binance Smart Chain sniper bo

Treading-Tigers 295 Dec 31, 2022
Crawl BookCorpus

These are scripts to reproduce BookCorpus by yourself.

Sosuke Kobayashi 590 Jan 03, 2023
A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022
Download images from forum threads

Forum Image Scraper Downloads images from forum threads Only works with forums which doesn't require a login to view and have an incremental paginatio

9 Nov 16, 2022
New World Market Scraper

Bean Seller A New Worlds market scraper. Deployment This must be installed on Windows as it uses the Windows api to do its stuff Install Prerequisites

4 Sep 21, 2022
A tool can scrape product in aliexpress: Title, Price, and URL Product.

Scrape-Product-Aliexpress A tool can scrape product in aliexpress: Title, Price, and URL Product. Usage: 1. Install Python 3.8 3.9 padahal halaman ins

Rahul Joshua Damanik 1 Dec 30, 2021
This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

1 Dec 05, 2021
Web crawling framework based on asyncio.

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Requirements Python3.5+ Installation pip install gain pip install uvloo

Jiuli Gao 2k Jan 05, 2023
Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

Shopee Scraper A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil. The project was crea

Paulo DaRosa 5 Nov 29, 2022
Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

NewsScraper A simple Python 3 module to get crypto or news articles and their content from various RSS feeds. πŸ”§ Installation Clone the repo locally.

Rokas 3 Jan 02, 2022
a way to scrape a database of all of the isef projects

ISEF Database This is a simple web scraper which gets all of the projects and abstract information from here. My goal for this is for someone to get i

William Kaiser 1 Mar 18, 2022
Danbooru scraper with python

Danbooru Version: 0.0.1 License under: MIT License Dependencies Python: = 3.9.7 beautifulsoup4 cloudscraper Example of use Danbooru from danbooru imp

Sugarbell 2 Oct 27, 2022
Telegram group scraper tool

Telegram Group Scrapper

Wahyusaputra 2 Jan 11, 2022
The first public repository that provides free BUBT website scraping API script on Github.

BUBT WEBSITE SCRAPPING SCRIPT I think this is the first public repository that provides free BUBT website scraping API script on github. When I was do

Md Imam Hossain 3 Feb 10, 2022
OSTA web scraper, for checking the status of school buses in Ottawa

OSTA-La-Vista OSTA web scraper, for checking the status of school buses in Ottawa. Getting Started Using a Raspberry Pi, download Python 3, and option

1 Jan 28, 2022