Snowflake database loading utility with Scrapy integration

Overview

Snowflake Stage Exporter

Snowflake database loading utility with Scrapy integration.
Meant for streaming ingestion of JSON serializable objects into Snowflake stages and tables.

Installation

> pip install git+https://github.com/hermit-crab/snowflake-stage-exporter.git

Basic example

from snowflake_stage_exporter import SnowflakeStageExporter

with SnowflakeStageExporter(
    user='...',
    password='...',
    account='...',
    table_path='MY_DATABASE.PUBLIC.{item_type_name}',
) as exporter:
    exporter.export_item({'name': 'Jack', 'salary': 100}, item_type_name='employee')
    exporter.export_item({'name': 'Sal', 'salary': 90, 'extra_info': {'age': 20}}, item_type_name='employee')
    exporter.export_item({'title': 'Steel Grill', 'price': 5.5}, item_type_name='product')
    exporter.finish_export()  # flushes all stage buffers, creates tables and populates them with data inside stages

After you call finish() 2 tables will be created: EMPLOYEE (2 rows, 3 columns) and PRODUCT (1 row, 2 columns) located inside database MY_DATABASE and database schema PUBLIC (Snowflake default database schema).

Same thing achieved via Scrapy integration.

How this works

For each object that you feed into the exporter it will write it into a local buffer (temporary JSON file). Once a configurable maximum buffer size is reached the file is uploaded to Snowflake internal stage via PUT statement. Upon the end of the execution exporter will create all specified tables then instruct Snowflake to populate each table from every staged JSON file via COPY INTO

statements

  • If you output to multiple tables then a buffer is maintained for each.
  • Alternatively you can create / populate tables as soon as the buffers are flushed via *_on parameters described below.
  • *_on parameters also allow you to disable any table creation / population and just deal with the stages yourself.
  • For table creation the exporter will try to figure out column types dynamically during execution, otherwise you can pass them explicitly via parameter.

Why "Stages"?

The use of local buffers and stages opposed to typical SQL INSERT statements is motivated largely by Snowflake performance implications and their billing model (see https://community.snowflake.com/s/question/0D50Z00008JpBymSAF/implications-of-multiple-insert-statements-vs-copy-into).

An illustrative example can be a long running Scrapy / ScrapyCloud job that constantly outputs data. If the job was to keep the connection constantly executing the INSERTs - Snowflake would also keep the warehouse running / consuming the credits for the entire duration of the job.

Secondary consideration was for allowing the user to be able to work with purely just the stages like one would work with S3 or similar blob file storage. This covers cases when user would needs their own sophisticated table management approach and simply wants a convenient place to store raw data.

IMPORTANT NOTE: it won't make much sense to use this library if you're already working with S3 or similar storages (including just local machine) where your data is one of the Snowflake supported file formats. Snowflake has built-in support of ingesting several 3rd party blob storages and for local files you can upload them via PUT statements.

Configuration

All of the configurations are done via arguments to main exporter class SnowflakeStageExporter.

  • user/password/account - Snowflake account credentials, passed as is to snowflake.connector.connect.
  • connection_kwargs - any additional parameters to snowflake.connector.connect.
  • table_path - table path to use.
    • If you specify database / database schema in connection_kwargs you won't need to specify them in the table path.
    • The path can include template variables which are expanded when you feed an item to exporter. By default only item variable is passed (e.g. 'MY_DB.PUBLIC.TABLE_{item[entity_type]}' here it's assumed all of your items have "entity_type" field).
    • Any additional variables you can pass yourself as keyword arguments when calling exporter.export_item().
    • Additionally in Scrapy integration the following fields are passed:
      • spider - spider instance.
      • item_type_name - type(item).__name__. In the basic example above you passed this explicitly yourself.
  • stage - which internal stage to use. By default user stage ("@~") is used.
  • stage_path - naming for the files being uploaded to the stage.
    • By default it's "{table_path}/{instance_ms}_{batch_n}.jl" where table_path is table_path with all variables resolved, instance_ms epoch milliseconds when exporter was instantiated and batch_n being sequential number of the buffer.
    • In Scrapy integration by default this is "{table_path}/{job}/{instance_ms}_{batch_n}.jl" where job is the key of the ScrapyCloud job or "local" if spider ran locally.
  • max_file_size - maximum buffer size in bytes. 1GiB by default.
  • predefined_column_types - dictionary of table_path to Snowflake columns types for table creation.
    • e.g. {"MY_DB.PUBLIC.PRODUCT": {"title": "STRING", "price": "NUMBER"}, "MY_DB.PUBLIC.EMPLOYEE": {"name": "STRING", "salary": "NUMBER", "extra_info": "OBJECT"}}.
  • ignore_unexpected_fields - ignore fields not passed in predefined_column_types during table creation / population.
    • True by default but only takes effect when table does have predefined column types.
    • The data is still exported in full to the staged files.
  • allow_varying_value_types - if False during table creation / population skip columns that had multiple value types.
    • True by default. VARIANT type is assigned to such column.
    • Error is logged when False and such column is encountered.
    • Takes effect only when there is a need for exporter to figure out the column type.
    • The data is still exported in full to the staged files.
  • create_tables_on - one of "finish/flush/never". "finish" by default. "flush" is for each time a file is staged.
  • populate_tables_on - ditto.
  • clear_stage_on - same as above but "never" is default. Each file is removed from stage individually when enabled.

Configuration (Scrapy)

All of the exporter instance parameters are exposed as Scrapy settings like SNOWFLAKE_ (e.g. SNOWFLAKE_MAX_FILE_SIZE).

Once a Scrapy job ends, all remaining buffers are flushed. If the job outcome is not "finished" (something went wrong) then no table creation / table population / stage clear takes place.

TODO

  • Unit tests >_>.
  • Test on windows?
.
Owner
Oleg T.
Oleg T.
The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

The open-source web scrapers that feed the Los Angeles Times' California coronavirus tracker. Processed data ready for analysis is available at datade

Los Angeles Times Data and Graphics Department 51 Dec 14, 2022
Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

Instagram_scrapper This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or exce

Lakhdar Belkharroubi 5 Oct 17, 2022
淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

MaoTai 118 Dec 16, 2022
Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

1 Jan 28, 2022
Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

Pyrics Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes. ./test/run.py provides the full function in terminal cmd

MisterDK 1 Feb 12, 2022
Simple proxy scraper made by using ProxyScrape's api.

What is Moon? Moon is a lightweight and fast proxy scraper made by using ProxyScrape's api. What can i do with this? You can use proxies for varietys

1 Jul 04, 2022
Tool to scan for secret files on HTTP servers

snallygaster Finds file leaks and other security problems on HTTP servers. what? snallygaster is a tool that looks for files accessible on web servers

Hanno Böck 2k Dec 28, 2022
Python framework to scrape Pastebin pastes and analyze them

pastepwn - Paste-Scraping Python Framework Pastebin is a very helpful tool to store or rather share ascii encoded data online. In the world of OSINT,

Rico 105 Dec 29, 2022
A database scraper created with mechanical soup and sqlite

WebscrapingDatabases a database scraper created with mechanical soup and sqlite author: Mariya Sha Watch on YouTube: This repository was created to su

Mariya 30 Aug 08, 2022
tweet random sand cat pictures

sandcatbot setup pip3 install --user -r requirements.txt cp sandcatbot.example.conf sandcatbot.conf vim sandcatbot.conf running the first parameter i

jess 8 Aug 07, 2022
Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye

Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye, you can search with various keywords and usernames on Twitter.

Jolanda de Koff 19 Dec 12, 2022
🕷 Phone Crawler with multi-thread functionality

Phone Crawler: Phone Crawler with multi-thread functionality Disclaimer: I'm not responsible for any illegal/misuse actions, this program was made for

Kmuv1t 3 Feb 10, 2022
Subscrape - A Python scraper for substrate chains

subscrape A Python scraper for substrate chains that uses Subscan. Usage copy co

ChaosDAO 14 Dec 15, 2022
Scrapes Every Email Address of Every Society in Every University

society-email-scrape Site Live at https://kcsoc.github.io/society-email-scrape/ How to automatically generate new data Go to unis.yml Add your uni Cre

Krishna Consciousness Society 18 Dec 14, 2022
联通手机营业厅自动做任务、签到、领流量、领积分等。

联通手机营业厅自动完成每日任务,领流量、签到获取积分等,月底流量不发愁。 功能 沃之树领流量、浇水(12M日流量) 每日签到(1积分+翻倍4积分+第七天1G流量日包) 天天抽奖,每天三次免费机会(随机奖励) 游戏中心每日打卡(连续打卡,积分递增至最高

2k May 06, 2021
LSpider 一个为被动扫描器定制的前端爬虫

LSpider LSpider - 一个为被动扫描器定制的前端爬虫 什么是LSpider? 一款为被动扫描器而生的前端爬虫~ 由Chrome Headless、LSpider主控、Mysql数据库、RabbitMQ、被动扫描器5部分组合而成。

Knownsec, Inc. 321 Dec 12, 2022
Ebay Webscraper for Getting Average Product Price

Ebay-Webscraper-for-Getting-Average-Product-Price The code in this repo is used to determine the average price of an item on Ebay given a valid search

17 Jan 05, 2023
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

IST Research 1.1k Jan 06, 2023
Kusonime scraper using python3

Features Scrap from url Scrap from recommendation Search by query Todo [+] Search by genre Example # Get download url from kusonime import Scrap

MhankBarBar 2 Jan 28, 2022
Scraping Thailand COVID-19 data from the DDC's tableau dashboard

Scraping COVID-19 data from DDC Dashboard Scraping Thailand COVID-19 data from the DDC's tableau dashboard. Data is updated at 07:30 and 08:00 daily.

Noppakorn Jiravaranun 5 Jan 04, 2022