A Python package that scrapes Google News article data while remaining undetected by Google.

Last update: Aug 10, 2022

Overview

googlenewsscraper

Getting Started

Installation

pip install GoogleNewsScraper

Reference

Importing

from GoogleNewsScraper import GoogleNewsScraper

Instantiating Scraper

GoogleNewsScraper(driver)

Constructor Parameters

Name	Type	Required
driver	web driver	no

Possible values:

'chrome': The driver will default to use this package's chrome driver
A path to some driver (FireFox, for instance) stored on the user's system

Methods

This method is both public and private, though it really should only be used by the class

locate_html_element(self, driver, element, selector, wait_seconds)

Name	Type	Required	Description
driver	web driver	yes	A web driver (Chrome, FireFox, etc)
element	string	yes	Id or class selector of an HTML element
selector	Module import	yes	see below
wait_seconds	int	no	Waits a certain number of seconds in order to locate an HTML element

To configure the 'selector' param:

First install selenium

pip install selenium

Then import By

from selenium.webdriver.common.by import By

Possible values:

By.ID
By.CLASS_NAME
By.CSS_SELECTOR
By.LINK_TEXT
By.NAME
By.PARTIAL_LINK_TEXT
By.TAG_NAME
By.XPATH

GoogleNewsScraper(...args).search(search_text, date_range, pages, pagination_pause_per_page, cb) -> list or None

Name	Type	Required	Description
search_text	str	yes	A series of word(s) that will be inputted into the Google search engine
date_range	str	no	Filters article by date. Possible values: Past hours, Past 24 hours, Past week, Past month, Past year, Archives
pages	str or int	no	Number of pages that should be scraped (defaults to 'max')
pagination_pause_per_page	int	no	Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages.
cb	function	no	Will return all article data on a single page for every page scraped (defaults to False)

Example using 'cb' paramater:

def handle_page_data(page_data: list):
  # Do something with page_data

GoogleNewsScraper(...args).search(...args, cb=handle_page_data)

NOTE:

If no argument is provided for 'cb,' the scrape method will return a two-dimensional list
Each list will contain an object of news article data for every news article on that page

Example of the data that every article-object will contain:

'id': A unique id for every article data object
'description': The preview description of the news article
'title': The title of the news article
'source': The source of news article (New York Times, for instance)
'image_url': The url of the preview news article image
'url': A link to the news article
'date_time': A datetime string that represents the date of when the article was published

A Python package that scrapes Google News article data while remaining undetected by Google.

Related tags

Overview

googlenewsscraper

Getting Started

Installation

Reference

Importing

Instantiating Scraper

Methods

Owner

Geminid Systems, Inc

Generate a repository with mirror links for DriveDroid app

Library to scrape and clean web pages to create massive datasets.

Scraping script for stats on covid19 pandemic status in Chiba prefecture, Japan

Grab the changelog from releases on Github

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）

Displays market info for the LUNI token on the Terra Blockchain

A tool can scrape product in aliexpress: Title, Price, and URL Product.

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Scrap the 42 Intranet's elearning videos in a single click

A Python web scraper to scrape latest posts from official Coinbase's Blog.

A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

crypto currency scraping

PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

API to parse tibia.com content into python objects.

Automatically scrapes all menu items from the Taco Bell website

This is a sport analytics project that combines the knowledge of OOP and Webscraping

:arrow_double_down: Dumb downloader that scrapes the web

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

A Python package that scrapes Google News article data while remaining undetected by Google.

Related tags

Overview

googlenewsscraper

Getting Started

Installation

Reference

Importing

Instantiating Scraper

Methods

Owner

Geminid Systems, Inc

Generate a repository with mirror links for DriveDroid app

Library to scrape and clean web pages to create massive datasets.

Scraping script for stats on covid19 pandemic status in Chiba prefecture, Japan

Grab the changelog from releases on Github

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤（从2月份稳定运行至今）

Displays market info for the LUNI token on the Terra Blockchain

A tool can scrape product in aliexpress: Title, Price, and URL Product.

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Scrap the 42 Intranet's elearning videos in a single click

A Python web scraper to scrape latest posts from official Coinbase's Blog.

A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

crypto currency scraping

PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

API to parse tibia.com content into python objects.

Automatically scrapes all menu items from the Taco Bell website

This is a sport analytics project that combines the knowledge of OOP and Webscraping

:arrow_double_down: Dumb downloader that scrapes the web

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）