Seeing the snow for the first time — Snowflake Adventures (Part 1)

The architecture of Snowflake is a hybrid between the traditional shared-disk architecture and shared-nothing architecture. How so? According to the Snowflake Documentation, Snowflake processes queries using MPP concept such that each node has parts of the data stored locally while using a central data repository to store the data that is accessible by all compute nodes.

Source: Ben Stopford

Digging into Snowflake more, it has a unique architecture which consists of 3 layers, Database Storage, Query Processing, Cloud Service.

Source: Snowflake Computing
Source: Snowflake Computing

Data ingested into Snowflake is intelligently compressed, reorganised into columnar format, and stored in cloud storage (AWS S3, Azure Blob Storage or GCP Cloud Storage — soon). To protect against failure, it is replicated 3+ ways.

Micro-Partition

Snowflake treats the newly created compressed columnar data as Micro-Partition called FDN (Flocon De Neige — snowflake in French). Each micro-partitions can have a max size of 16 MB compressed (50 MB–500 MB uncompressed), all stored on a logical hard-drive in Snowflake and only accessible via query but not directly. Due to the design of these micro-partitions, they are immutable and this allows for cool features such as Time Travel and Zero-Copy Cloning.

Automatically, Snowflake will create units of storage, partitioned based on the order of ingestion of micro-partitions. Remember that static partitioning of large tables to ensure performance and scaling for a traditional data warehouse, and how it often leads to variable partition size? Well, Snowflake micro-partition deals with that issue as well, sweet!

Query performance is very important in Snowflake and the way Snowflake intelligently creates these micro-partitions is quite important.

Data Clustering

In the real world, data is usually loaded sequentially into the DWH, simply by sequential IDs or timestamp, Snowflake will use natural clustering and colocates column data with the same value or similar range. As you can see below, colocating the same values would result in non-overlapping micro-partition and least overlap depth. This results in the improvement in query performance as Snowflake avoids unnecessarily scanning of micro-partitions.

Source: Snowflake Computing

You can define your own clustering key if you want but make sure you assess the situation closely, and this should be done for very large tables – can be expensive so be cautious. Please visit the docs for more information.

Metadata

Snowflake also maintains the clustering metadata which contains information such as the number of micro-partitions used to create the table, the number of overlapping micro-partitions, and depth of overlapping micro-partitions.

It also maintains the micro-partition metadata and stores information such as table statistics (row count, size) and micro-partition statistics (MIN/MAX column values, distinct count, NULL count). You can view these without using snowflake computation.

These metadata are stored in the Cloud Service layer.

Storage

Snowflake decouples storage from computing but the storage you pay is what the cloud provider charges. Remember, you will be charged for storage for data that is active, inactive, in Time Travel and Fail-Safe state.

Let’s talk about the different types of tables within Snowflake and how it could help you manage the cost of storage. Again, there are 3 types:

Temporary — persists within a particular session, stores non-permanent data, think of it as your landing table. Can use Time Travel (0–1) but will not Fail-Safe

Transient—persists beyond the session, but the same level of data protection is not needed like a permanent table. Can use Time Travel (0–1) but will not Fail-Safe.

Permanent—Default when creating a table, has the highest level of protection and recovery, hence, has both Time Travel (0–1 Standard, 0–90 ESD) and Fail-Safe.

Source: Snowflake Computing

Zero-copy Cloning

Snowflake also allows you to take a snapshot of your table, schema or database, at no charge. This process is called zero-copy cloning. This is through sharing the underlying micro-partitions. Think of how easy it is to create your DEV, UAT, and PROD environments!

Source: Snowflake Computing

Snowflake uses the concept of Virtual Datawarhouse which is essentially an MPP compute cluster to process queries and scales on-demand. There are two types of clusters in which Snowflake provides, Standard (single compute cluster and no scaling out) and Multi-cluster (Can add additional compute clusters and scales out). When these Virtual DWH are running, that’s when it costs you credits so don’t forget to scale down your compute or scale to 0.

Scaling Up

Snowflake defines their virtual compute cluster in T-shirt sizes, X-Small, Small, Medium, etc. As you scale up and move from each size, you get double the compute of the previous. Since the smallest compute cluster X-Small has 8 vCPUs/threads, a Medium cluster would have 32 threads but would produce your query 3 times faster but at 3 times the cost of credits.

You know what they say, with great power comes great responsibilities, BE CAUTIOUS AROUND SCALING YOUR VIRTUAL DWH! It could cost you a fortune like what happened to me.

Remember, even though the amazing technology behind Snowflake could help you bring our your result faster, but if you or your users write bad SQL queries, then it’s not going to help. My take is that Snowflake should not be a technology you hand off as a typical SQL Data Warehouse alone, rather, tie it down to educating the users on writing efficient SQL.

Scaling Out

With Multi-clusters, you can scale-out using all clusters (maximized) or auto-scale, that is, additional clusters will be introduced on demand by the number of users/queries. When would you scale out? When you want concurrency such as during peak time so queries can load balance.

Caching

Similar to AWS ElastiCache, you can cache your Snowflake Results which is cached in the Database Storage layer. That means if you’re running the same queries many times, you can cache the result for up to 24 hours and no Virtual DWH resources will be needed for the computation as long as the data remains the same.

Source: Snowflake Computing

Similar to a brain, this layer is a collection of services that orchestrates and controls activities across Snowflake, such as ensuring ACID compliance. On top of that, authentication, access control, query optimisation, infrastructure, and metadata management occurs in this layer.

Although this cloud service layer is shared across all editions of Snowflake, for VPS, you get your own private cloud service layer tied to your infrastructure on the cloud to ensure your environment is isolated.

Nevertheless, each non-VPS edition Snowflake deployment within a region sits within a Virtual Private Cloud (VPC), and this is where your accounts live. This allows for Snowflake to share resources across multiple accounts for better resource utilization and performance improvements.

Talking to some of my non-IT friends, they all worry about putting their data on the cloud because of one of their biggest concerns is security within the cloud and “what happens when it gets hacked”.

Here is the magical thing, your data within Snowflake is automatically encrypted by default for its entire lifecycle —from loading to the storage of data at rest (end-to-end encryption)

AES 256-bit keys are used to encrypt data within Snowflake. Don’t worry about someone getting your key and decrypting your data through brute force because Snowflake uses a hierarchical key model and regularly rotate their keys.

All communications between the client-side to the server is also protected TLS but if you’re still concerned about communicating over the internet you may opt for PrivateLink and Direct Connect, both are AWS services, you can read more about it here.