SnowFlake Notes
SnowFlake Notes
User Roles
● There are 4 user roles available in Snowflake.
○ SYSADMIN - default role
○ PUBLIC - with same options as SYSADMIN
○ ACCOUNTADMIN - additional options for accounts and notifications
○ SECURITYADMIN - additional options for accounts
Points to consider
● RDBMS: Databases with tables made up of rows and columns. These tables are
situated inside schemas.
● Schemas: They are organisational buckets. One schema can have multiple
tables.
● A user might have access to one schema but not another schema in the same
database. Also, a user might have access to one object in a schema but not
others.
● View: A saved SELECT statement that we can use as it were a table. It is a named
definition of a query.
2
Warehouse
● Warehouse is a collection of computing power which is used for loading,
unloading and querying data. It does NOT store/contain data. It’s not a place.
● Each warehouse has a single cluster of servers. The number of servers in that
cluster vary based on the warehouse’s size designation. Choosing the warehouse
size is the same as choosing the number of servers in the cluster.
● All Snowflake warehouses have access to all the data, all the time.
● Smaller warehouses cost less while running for the same amount of time as a
larger warehouse. Use ‘Auto Suspend’ to control costs.
● Scaling Up - Moving to a larger warehouse
● Scaling Down - Moving to a smaller warehouse
Staging
● Stages or Staging Areas are places to put things temporarily before moving
them to a more stable location. Eg.- Cloud folders or Directories
● There are two types of stages:
○ Internal Stages - act mostly like directories inside a Snowflake Account’s
local storage.
○ External Stages - act mostly like a secure gateway between Cloud Storage
Services and Snowflake Services.
● We can view the stages by: ‘list @stage_name’
Training Sessions
Database
Collection of organised data for querying and transactional processing
Data warehouse
Collection of organised data from multiple sources for analytical processing rather than
transactional processing
Features
● Snowflake is a cloud data warehouse that runs entirely on cloud infrastructure
and cannot be run on private cloud or hosted infrastructure.
● It is available on AWS, Azure and Google Cloud.
● No PK/FK constraints since it is not a relational database.
● SQL commands: DDL, DML, Stored Procedures (JS), UDF, SQL functions
(** Stored procedures are prepared codes which can be saved and reused)
● Views - Materialized and Non-Materialized
● ACID Transactions - Concurrency and Workload Separation
● Database and Object Cloning
● Database failover and replication facility between accounts
● Continuous data protection - Time Travel and Fail Safe
Integration Support
● Self-Service BI: Tableau, QlikView
● Big Data Tools: Kafka, Spark, Databricks
● JDBC/ODBC Drivers: for connecting Java applications with the SQL Databases
● Languages: Python, Go, Node.js
4
Unique Offerings
● Scalability - both Storage and Compute
● User Experience - No indexing, no performance tuning, no partitioning, no
physical storage design
● Tunable pay per use - adjust costs
Architecture
It has 3 layers - Service layer, Compute layer and Storage layer. Service layer has a fixed
cost while Compute and Storage layer uses Pay per usage method.
Snowflake Objects
Why Snowflake?
It uses Amazon S3 as its Storage Layer.
Features of S3:
● High Availability
● Durability
● Uses API to read data parts (range based)
S3 does not read in unnecessary volumes, it reads data in parts. It creates micro
partitions of uncompressed data, reorganises it into columnar format (column values
are stored together) and compresses them individually. It then adds headers to it which
contains offsets (metadata) for easy access and stores in S3.
8
Tables
Has fail safe period No fail safe No fail safe No fail safe
Create Like.. option creates another table with the same metadata, while clone option
creates a copy of the entire table.
Similar to the other table types (transient and permanent), temporary tables belong to a
specified database and schema; however, because they are session-based, they aren’t
9
bound by the same uniqueness requirements. This means you can create temporary
and non-temporary tables with the same name within the same schema.
However, note that the temporary table takes precedence in the session over any other
table with the same name in the same schema.
To create a temporary table, simply specify the TEMPORARY keyword (or TEMP
abbreviation) in CREATE TABLE. For example:
Transient tables are similar to permanent tables with the key difference that they
do not have a Fail-safe period. As a result, transient tables are specifically
designed for transitory data that needs to be maintained beyond each session (in
contrast to temporary tables), but does not need the same level of data
protection and recovery provided by permanent tables.
To create a transient table, schema, database, simply specify the TRANSIENT keyword
when creating the object. For example, to create a transient table:
The Time Travel retention period for a table can be specified when the table is created
or any time afterwards. Within the retention period, all Time Travel operations can be
performed on data in the table (e.g. queries) and the table itself (e.g. cloning and
restoration). The Fail-safe period is not configurable for any table type.
Because transient tables do not have a Fail-safe period, they provide a good option for
managing the cost of very large tables used to store transitory data. It is recommended
to use transient tables only for data that does not need to be protected against failures
or data that can be reconstructed outside of Snowflake.
External tables are read-only, therefore no DML operations can be performed on them;
however, external tables can be used for query and join operations. Views can be
created against external tables. Querying data stored external to the database is likely
10
to be slower than querying native database tables; however, materialized views based
on external tables can improve query performance.
Views
A view is a named definition of a query. Snowflake supports two types of views:
Virtual Warehouses
● A warehouse is a cluster of compute resources. It provides required resources
such as CPU, memory, and temporary storage to perform operations.
● The larger the warehouse, the more resources it has.
● While running, a virtual warehouse consumes Snowflake credits.
Access Control
● Determines who can access what and perform which operations on those
specified objects.
● Securable object: Entity to which access is granted.
● Role: Entity to which privileges are granted. Roles are then assigned to users.
● Privilege: level of access to an object.
● User - Snowflake user
● Snowflake has a combination of:
○ Discretionary Access Control (DAC) - owner has the privileges for the
objects, who can then grant access to others
○ Role-based Access Control (RBAC) - privileges are assigned to roles, and
roles are then assigned to users
● Roles:
○ Entities to which privileges on securable objects can be granted and
revoked
12
● Enforcement Model:
○ Every session has a ‘current role’, regardless of access mechanism.
○ Here’s how role is determined:
■ If specified during connection
■ If not specified, default role
■ If not specified and absence of default role, then Public
○ We can override the current role using the USE ROLE <Role_Name>
command.
○ There are no Root users (with full access) in Snowflake.
13
Data Types
Snowflake supports most SQL data types:
Structured Data Types have a predefined schema while semi-structured data types lack
a predefined structure.
● Can store values of any other data type including array and object
● Max size is 16MB compressed
● Any value can be implicitly given to the variant type
● Variant columns in a relational table are stored as separate physical columns
● Non-native data types such as dates and timestamps are stored as strings when
loading into a variant column
● Used to represent dense or sparse arrays of arbitrary size, where index is a non-
negative integer (upto 2^31-1), and values have variant data type
Semi-Structured Functions:
○ OBJECT_KEYS
● Conversion/Casting
● Type Predicates
Flatten:
Data Movement
● Copy Command
● File Formats
● Stages
○ Internal
○ External
● Loading/Unloading from local file system
● Loading/Unloading from external location
Copy Command
Used to load data from staged files on internal or external locations to an existing table
or vice versa.
Two types:
● Copy into <table> - Used to load data from internal/external stage to table
● Copy into <location> - Used to unload data from table into internal/external stage
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
15
File Format
● A named file format object provides a convenient means to store all of the
format information required for loading data files into tables.
● We can set a wide variety of parameters to set attributes such as Compression,
File delimiters, Skipping headers, Date/Time format, Encoding, Null Handling, etc
for the data stored in staged files before ingestion.
● By specifying a named file format object(or individual file format options) for the
stage, it isn’t necessary to later specify the same file format options in the COPY
command used to load data from the stage.
CREATE [ OR REPLACE ] FILE FORMAT [ IF NOT EXISTS ] <name>
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
Stages
An intermediary space where we can upload the files so that we can use the COPY
command to load or unload files.
Types:
● Internal Stage
○ User Stage
○ Table Stage
○ Named Internal stage
■ Temporary
■ Permanent
● External Stage
Internal Stage:
What Each user has a Each table has a Internal stages are
stage allocated to stage allocated to it named database
them by default for by default for objects that we can
storing files. storing files. use in-place of user
and table stage.
16
They are
recommended
stages to load the
tables.
● For user and table stages, data can be loaded via snow cli only, but for named
stages, we can use console too, but using console we load only files upto 50 MB.
Implementation:
pname VARCHAR,
drug VARCHAR,
gender VARCHAR,
age INT);
CREATE STAGE patient_stage File_Format = (type = 'CSV' field_delimiter = ',' skip_header = 1);
External Stages
● Storage locations outside the snowflake environment in another cloud storage location.
● External stages store the files in an external location that is referenced by the stage. An
external stage specifies location and credential information, if required.
● Cloud storage services supported are: Amazon S3 buckets, Google cloud storage
buckets and microsoft azure containers.
CREATE [ OR REPLACE ] [ TEMPORARY ] STAGE [ IF NOT EXISTS ]
<external_stage_name>
externalStageParams
[ COPY_OPTIONS = ( copyOptions ) ]
[ COMMENT = '<string_literal>' ]
Query Profile
It provides execution details for a query. It provides graphical representation of the
main components of the processing plan for the query, with statistics for each
component.
● Steps: If the query was processed in multiple steps, you can toggle between each
step.
● Operator Tree: The middle pane displays a graphical representation of all
operator nodes for the selected step, including the relationship between each
operator node.
● Node List: The middle pane includes a collapsable list of operator nodes by
execution time.
● Overview: The right pane displays an overview of the query profile. The display
changes to operator details when an operator node is selected.
20
Caching
Types of Cache:
● Result Cache: If the user fires the same query that has already been run,
snowflake will return the same result. But the following conditions need to be
met for the reuse of result set:
○ The user must have necessary privileges on the tables that are used in the
result query.
○ The new query should match with the previous one.
○ The data has not changed
● Metadata Cache:
○ Metadata about Tables and Micro-partitions are collected and managed
by snowflake automatically.
○ Snowflake doesn’t use compute to provide range values like Min, Max,
Number of Distinct Values, NULL count, and Row count and clustering
information.
● Warehouse Cache:
○ Every warehouse has attached SSD storage.
○ This has been used to store micro-partitions that are pulled from the
storage layer. While processing the query, the SSD is scanned first and
then data is pulled from storage level.
-------------Caching---------------
SUM(O_TOTALPRICE) AS TOTALPRICE,
COUNT(*) AS COUNT_ROWS
ON C_CUSTKEY = O_CUSTKEY
JOIN NATION N
ON N_NATIONKEY = C_NATIONKEY
JOIN REGION R
ON R.R_REGIONKEY = N_REGIONKEY
------------Result Cache-----------
-----------WAREHOUSE CACHE---------
-----------METADATA CACHE----------
Micro-Partitions
● Traditional data warehouses rely on static partitioning of large tables to achieve
acceptable performance and enable better scaling.
● Static partitioning has a number of well-known limitations, such as maintenance
overhead and data skew, which can result in disproportionately-sized partitions.
● In Snowflake, all data in tables is automatically divided into micro-partitions,
which are contiguous units of storage.
● Snowflake is columnar-based and horizontally partitioned, meaning a row of
data is stored in the same micro-partition.
22
Benefits:
Clustering
● As data is loaded/inserted into a table, clustering metadata is collected for each
micropartition, created during the process. Snowflake then leverages this
clustering information to avoid unnecessary scanning of micro-partitions.
● To improve the clustering of underlying table micro-partitions, you can always
manually sort rows on key table columns and reinsert them into the table;
however, performing these tasks can be cumbersome and expensive.
23
Benefits:
24
● Improved scan efficiency in queries by skipping data that does not match
filtering predicates.
● After a key has been defined on a table, no additional administration is required,
unless you choose to drop or modify the key. All future maintenance on the rows
in the table (to ensure optimal clustering) is done by snowflake automatically.
● Typically, queries benefit from clustering when the queries are filtered or sorted
on the clustering key for the table.
Note:
● Clustering can substantially improve the performance and reduce the cost of
some queries, but compute resources are used to perform clustering, which
consumes credits.
● Clustering keys are not intended for all the tables. The size as well as query
performance of a table should dictate whether to define a clustering key for the
table.
● In general, tables in the multi-terabyte (TB) range will experience the most benefit
from clustering, particularly if DML is performed regularly/continually on these
tables.
Pricing
Snowflake’s pricing model includes only two items: storage and compute resources.
The charge for storage is per terabyte, compressed, per month. For the US, snowflake’s
storage costs can begin at a flat rate of $23/TB, average compressed amount, per
month, accrued daily.
The charge for compute is based on the processing units, which we refer to as credits.
Compute charges are billed on actual usage, per second. For the US, compute costs
$0.00056 per second, per credit, for Snowflake On demand Standard Edition.
● Credits are billed per second, with a 60 second (1 min) minimum. Even if we use
it for the first 30 seconds, for the first minute, we need to pay for the full minute,
for the next minute, it is billed per second.
● Each time a warehouse is started or resized to a larger size, the warehouse is
billed for 1 minute’s worth of usage.
● After 1 minute, all subsequent billing is per-second.
● Stopping and restarting a warehouse within the first minute does not change the
amount billed; the minimum billing charge is 1 minute.
T
26
Usage for data storage is calculated on the daily average amount of data (in bytes)
stored in the system for:
● Files staged in snowflake locations (i.e. user and table stages or internal named
stages) for bulk data loading/unloading (can be stored compressed or
uncompressed).
● Database tables including historical data or Time travel (always compressed by
Snowflake)
● Fail safe for database tables (always compresses by Snowflake)
● Clones of database tables that reference data deleted in the table that owns the
clones.
Usage for cloud services is charged only if the daily consumption of cloud services
exceeds 10% of the daily compute usage.
If total compute credits consumed in a day is 100 and total cloud service credits used is
12, then total credits billed is 100 + 2 = 102.
Using Time Travel, you can perform the following actions within a defined period of
time:
● Query data in the past that has since been updated or deleted.
● Create clones of entire tables, schemas, and databases at or before specific
points in the past.
● Restore tables, schemas, databases that have been dropped.
Once the time period has elapsed, the data is moved to Snowflake Fail Safe and these
actions can no longer be performed.
Time Travel SQL Extensions:
To support Time Travel, the following SQL extensions have been implemented:
Fail Safe:
● A means to recover your lost data in the event of a system failure or other
catastrophic event, e.g. a hardware failure or security breach
● Done by Snowflake (not self-service)
● Not to be used as a mode to access historical data
● Recovery may take several hours or few days
● Fail safe Period starts immediately after the time travel retention period ends.
Currently, when a database is dropped, the data retention period for child schemas or
tables, if explicitly set to be different from the retention of the database, is not honored.
The child schemas or tables are retained for the same period of time as the database.
To honor the data retention period for these child objects (schemas or tables), drop
them explicitly before you drop the database or schema.
If an object with the same name already exists, UNDROP fails. You must rename the
existing object, which then enables you to restore the previous version of the object.
Provider
Consumer
Snowflake:
Provider
Consumer:
● Always up to data
● No ETL
● No storage
● Pay only for what you use
Share:
30
Share the data from one account and use the shared data from another account.
All sharing is accomplished through snowflake’s unique services layer and metadata
store.
Shared data does not take up any storage in a consumer account and thus it does not
contribute to the consumer’s monthly data storage charges.
Shares are named Snowflake objects that encapsulate all of the information required to
share a database. Each share consists of:
● The privileges that grant access to the databases, and the schema containing the
objects to share.
● The privileges that grant access to the specific objects in the database.
● The consumer accounts with which the database and its objects are shared.
Data Providers:
● Any snowflake account that creates data shares and makes them available to
other snowflake accounts to consume. Providers can share with one or more
snowflake accounts.
● For each database you share, Snowflake supports using grants to provide
granular access control to selected objects in the database.
● Snowflake does not place any hard limits on the number of shares you can create
or the number of accounts you can add to a share.
Data Consumers:
● Any account that chooses to create a database from a share made available by a
data provider.
● Once customers add a shared database to your account, you can access and
query the objects in the database just as you would with any other database in
your account.
● Snowflake does not place any hard limits on the number of shares you can
consume from data providers; however, you can only create one database per
share.
Reader Accounts:
Data sharing is only supported between snowflake accounts. If the consumer doesn’t
have a Snowflake account and/or is not ready to become a licensed Snowflake
consumer, to facilitate data sharing with these consumers, Snowflake supports
providers by creating reader accounts.
31
Similar to standard snowflake accounts, the provider account uses shares to share
databases with reader accounts; a reader account can only consume data from the
provider account that created it.
● Data sharing is only supported between accounts in the same snowflake region.
● Snowflake allows sharing the data between different regions when the data is
replicated in consumers' regions.
● Shared databases are read-only
● Following actions are not supported:
○ Creating a clone of a shared database or any schemas/tables in the
database.
○ Time travel for a shared database or any schemas/tables in the database.
○ Editing the comments for a shared database.
● Shared database and all the objects in the database cannot be re-shared with
other accounts.
Marketplace
You can discover and access a variety of third-party data and have those datasets
available directly in your snowflake account to query without transformation and join it
with your own data. If you need to use several different vendors for data sourcing, the
Data Marketplace gives you one single location from where to get the data.
Implementation:
Creating internal stages and loading data:
USE DATABASE GEM_DB;
pname VARCHAR,
drug VARCHAR,
gender VARCHAR,
age INT);
CREATE STAGE patient_stage File_Format = (type = 'CSV' field_delimiter = ',' skip_header = 1);
type = csv
record_delimiter = '\n'
field_delimiter = ','
skip_header = 1
empty_field_as_null = true
FIELD_OPTIONALLY_ENCLOSED_BY = '0x22' --- for rows with commas in
values
productName varchar,
customerName Varchar,
param1 int,
param2 int,
productType varchar);
url = 's3://mysnowbucket16/product.csv'
credentials=(
AWS_KEY_ID='********',
AWS_SECRET_KEY='********');
warehouse_size = "X-SMALL"
auto_suspend = 180
auto_resume = true
initially_suspended = true;
---- With non-materialized view, running the select statements will require the lot of time
again and again, since it does not store any cache as output -----
-------- Materialized views store caches and so, it takes less time to run queries when they
are re-run. --------
RETURNS FLOAT
AS '
CASE
ELSE
v1 * v2
END
';
RETURNS VARCHAR
LANGUAGE Javascript
AS $$
try {
if (ID<0) {
else {
}
35
catch (err){
$$;
Flatten Implementation
select * from table(flatten(input => parse_json('[1, ,77]'))) f;
------Effect of Path------
------Effect of Outer------
------Effect of Recursive------
------Effect of Mode------
from values
(12712555,
contact: [
{ business:[
(98127771,
contact: [
{ business:[
select id as "ID",
f.value as "Contact",
f1.value:type as "Type",
f1.value:content as "Details"
from persons p,
f.this:first as "First",
f.this:last as "Last"
from persons p,
Time Travel
create database new_db;
-----current data------
-----current data------
------cloning table------
------cloning schema------
------cloning db--------