Web Downloader With Python

Overview

Web Downloader

Introduction

This module will provide API to download the webpage components : html file, image file, css fil, javascript file, href link file based on the input url (the url must start with 'http' or 'https' ).

To prosses multiple URLs at the same time, The user can list all the url he wants to download in the file "urllist.txt" as shown below:

# Add the URL you want to download line by line(The url must start with 'http' or 'https' ):
# example: https://www.google.com
https://www.google.com
https://www.carousell.sg/
https://www.google.com/search?q=github&sxsrf=AOaemvJh3t5_h8H85AE8Ajbb1IMnBrRISA%3A1636698503535&source=hp&ei=hwmOYY6mHdGkqtsPq8S9sAY&iflsig=ALs-wAMAAAAAYY4Xl7GLWS16_xc2Q9XrG0p3q277DpkL&oq=&gs_lcp=Cgdnd3Mtd2l6EAEYADIHCCMQ6gIQJzIHCCMQ6gIQJzIHCCMQ6gIQJzIHCCMQ6gIQJzIHCCMQ6gIQJzIHCCMQ6gIQJzINCC4QxwEQowIQ6gIQJzIHCCMQ6gIQJzIHCCMQ6gIQJzIHCCMQ6gIQJ1AAWABgjgdoAXAAeACAAQCIAQCSAQCYAQCwAQo&sclient=gws-wiz
https://stackoverflow.com/questions/66022042/how-to-let-kubernetes-pod-run-a-local-script/66025424

Program Setup

Development Environment : python 3.7.4
Additional Lib/Software Need
  1. beautifulsoup4 4.10.0

    install:

    pip install beautifulsoup4
    

    Lib link: https://pypi.org/project/beautifulsoup4/

Hardware Needed : None
Program File List

version: v0.1

Program File Execution Env Description
webDownload.py python 3 Main executable program use the API.
urllist.txt url record list.

Program Usage

Module API Usage
  1. Downloader init:
soup = urlDownloader(imgFlg=True, linkFlg=True, scriptFlg=True)
  • imgFlg: Set to "True" to download all the "" tag files.
  • linkFlg: Set to "True" to download all the html section, image, icon, css file imported by ""
  • scriptFlg: set to "True" to download all the js file.
  1. Call API method savePage to scape url and save the data in a folder

    soup.savePage('
         
          ', '
          
           ')
    
    # Exampe:
    soup.savePage('https://www.google.com', 'www_google_com')
    
          
         
Program Execution
  1. Copy the url you want to check in the url record file "urllist.txt"

  2. Cd to the program folder and run program execution cmd:

    python webDownload.py
    
  3. Check the result:

    For example, if you copy the url "https://www.carousell.sg/" as the first url you want to check into the file "urllist.txt" file, all the html files, image file and js files will be under folder "1_www.carousell.sg_files"

    • The main web page will be saved as: "1_www.carousell.sg_files/1_www.carousell.sg.html"
    • The image used in the page will be saved in folder: "1_www.carousell.sg_files/img"
    • The html/imge/css import by href will be saved in folder: "1_www.carousell.sg_files/link"
    • The js file used by the page will be saved in fodler: "1_www.carousell.sg_files/script"

Problem and Solution

Problem[0]: Files download got slight different

Why there is a slight different between the files which download by using the program and the files which downlaod I use some-webBrowser's "page save as " for the same URL such as www.google.com

OS Platform : n.a

Error Message: n.a

Type: n.a

Solution:

This is normal situation, the logic of web scrape and browser display are different: if you type www.google.ccom if different people's browser, you can see the page shown on different browser are also different. This is because the browser cache, token in the local storage , cookie will make influence of the "GET" request. So when different people type in the google URL in their browser, they can see their own Gmail Icon shows on the right top corner. If you remove all the cache, token in the local storage , cookie of your browser and try "page save as ", the file downloaded by "page save as " should be same as the program.

Problem[2]: Some download Image are empty

OS Platform : n.a

Error Message: n.a

Type: n.a

Solution:

If a web use third party's storage to save the image and the net-storage need to authorization before download, our program download request will be reject and got 'null' when download the file. Then the saved image will be empty.


Last edit by LiuYuancheng([email protected]) at 13/11/2021

Script that allows to download portable installers of different versions Adobe software for macOS

What is this and for what This is a script that allows you to download portable installers of programs from Adobe for macOS with different versions. T

715 Jan 06, 2023
A prometheus exporter for torrent downloader like qbittorrent/transmission/deluge

downloader-exporter A prometheus exporter for qBitorrent/Transmission/Deluge. Get metrics from multiple servers and offers them in a prometheus format

Lei Shi 41 Nov 18, 2022
FireDM is a python open source (Internet Download Manager) with multi-connections, high speed engine, it downloads general files and videos from youtube and tons of other streaming websites .

python open source (Internet Download Manager) with multi-connections, high speed engine, based on python, LibCurl, and youtube_dl https://github.com/firedm/FireDM

1.6k Apr 12, 2022
this is udemy course downloader, before a start you know how to get access token.

udemy_downloader this is udemy course downloader, before a start you know how to get access token. To get the access_token on Google Chrome (once on U

OkUgur 18 Dec 04, 2022
⚙️ A CLI tool that can download songs from youtube.

⚙️ Music Downloader Music Downloader is a tool that can download songs from Youtube. Installation Base requirements: Python 3.7+ If you have Python 3.

matjs 4 Nov 03, 2021
Download your Spotify playlists and songs along with album art and metadata

spotDL Download your Spotify playlists and songs along with album art and metadata The fastest, easiest, and most accurate command-line music download

10.6k Jan 03, 2023
Fully automated download and parsing for Texas A&M University's Registrar's grade distribution PDFs for years 2014+.

Fully automated download and parsing for Texas A&M University's Registrar's grade distribution PDFs for years 2014+. Adds the parsing results to a mySQL database.

TAMU Grade Distribution 1 Sep 28, 2022
The sole purpose of this script is to download any NFT collection from OpenSea

OpenSea NFT Stealer The sole purpose of this script is to download any NFT collection from OpenSea. Setup Prerequisites: Python 3 Python requests libr

Phillip 9 Sep 04, 2022
A cli tool to download purchased products from the DLsite.

dlsite-downloader A cli tool to download purchased products from the DLsite. How can I use? This program runs with configurations defined at settings.

AcrylicShrimp 9 Dec 23, 2022
A Telegram bot to download TikTok videos without any watermark.

TikTok Downloader Bot A Telegram bot to download TikTok videos without any watermark. Host on Heroku Youtube: Deployment Tutorial Demo: JayBee TikTok

Joy Biswas 184 Jan 04, 2023
MMDL (Mega Music Downloader) - A tool to easily download music.

mmdl - Mega Music Downloader What is mmdl ❓ TLDR: MMDL is a cli app which allows you to quickly and efficiently download one or multiple songs from Yo

techboy-coder 30 Dec 13, 2022
Libretrofuzz - Fuzzy Retroarch thumbnail downloader

Fuzzy Retroarch thumbnail downloader In Retroarch, when you use the manual scann

8 Nov 26, 2022
Can automatically download mods from a Curseforge modpack

Curseforge-Modpack-Downloader A Python script which automatically downloads mods from a Curseforge modpack. Installing Dependencies ⚠ Make sure you ha

Rayr 1 Sep 20, 2022
Tool to download Netflix in 4k

Netflix-4K-Script Tool to download Netflix in 4k You will need to get a L1 CDM that is whitelsited with Netflix CDM In this script are downgraded

9 Dec 23, 2021
A tool written in Python to download all Snapmaps content from a specific location.

snapmap-archiver A tool written in Python to download all Snapmaps content from a specific location.

46 Dec 09, 2022
Youtube-downloader-using-Python - Youtube downloader using Python

Youtube-downloader-using-Python Hii guys !! Fancy to see here Welcome! built by

Lakshmi Deepak 2 Jun 09, 2022
code for paper"3D reconstruction method based on a generative model in continuous latent space"

PyTorch implementation of 3D-VGT(3D-VAE-GAN-Transformer) This repository contains the source code for the paper "3D reconstruction method based on a g

Tong 5 Apr 25, 2022
Download your bandcamp collection using this python script.

bandcamp-downloader Download your Bandcamp collection using this python script. It requires you to have a browser with a logged in session of bandcamp

72 Dec 20, 2022
Simple tool downloads public PoC (refer from nomi-sec)

PoC Collection This is the little script to collect the proof-of-concept which is refered from nomi-sec. The repository now is only develop for linux-

2 Aug 17, 2022
A YouTube downloader app built with Django.

YouTube Downloader ⭐️ Star this project ⭐️ Requirements Python3+ Git Installation Install the dependencies and start the server. git clone https://git

Gabriel Tavares 26 Aug 19, 2022