Top 10 Best Practices for Snowflake
Snowflake is one of the top Data Warehousing solution in the market. It is likely to be on top because to its great features and ease of use. Its has been adopted by many different sized companies in the recent years as they kept improving by adding awesome features with one of the best documentation. I personally have been working on Snowflake for many years now and have done numerous migrations from Redshift to Snowflake [discussed here]. Today, I am going to share some of the best practices that should be considered when working with Snowflake.
These practices could be applied to different warehouses that support similar features, however this will be focused on Snowflake. I may have not considered some of the recent features while writing this article.
To make the most out of Snowflake data warehousing, it's better to follow best practices that optimize performance, utilize features, save costs, etc. Here are the top 10 best practices:
Virtual Warehouse
Picking the right size cluster is important, jobs that scans less amount of data, or jobs that runs in routine should use smaller size cluster, while critical and complex production jobs should run on larger ones with autoscaling enabled. Start small and improve incrementally.
Clustering Keys
Snowflake provides clustering feature that groups related rows together within same micro partition for better performance. Defining appropriate clustering keys for tables can help boost query performance by a big factor. However, this only makes sense when the size of data has grown too much and query performance has started to go down. Starting without keys and adding keys later could be a good choice.
Query Optimization
Typical SQL optimization, however lot of optimizations already happens under the hood by Snowflake query optimizer. Still best practices for columnar data retrieval should be followed, e.g. avoiding the use of *
in SELECT
. Utilize Snowflake's query profiling and monitoring tools to identify and address performance bottlenecks.
Materialized Views
Creating materialized view for frequently accessed data can help reduce not only cost but also lower the latency by avoiding re-computation of aggregated metrics. Materialized views are faster due to their caching mechanism. They can use their cache for data that hasn't changed and use the source table for any data that has changed.
Time Travel & Snapshots
When data recovery and auditing are needed, time travel and snapshots can be very helpful. However, making sure to set the right retention policies for these are important to avoid storage costs, the retention period is usually set depending on several reasons, like cost of loss data, re processing cost, business' value etc.
Tagging
Snowflake provides a feature to enable data tagging that can help with compliance, discovery or resource usage. Tag is a schema level object that can be applied to Snowflake objects like tables, schemas, warehouses, tagging can provide lot of value if used in a consistent way.
External Tables
External Tables allow you to query data that is not within Snowflake Storage, Snowflake does not store or manage the data. These are read only tables and the performance would be slower compared to the regular table. However, External Tables should be considered when data is already present in cheaper storage like data lake (S3, GCS) and performance does not really matter.
Variant Types
Snowflake Variant
type is very powerful, allowing you to store semi structured data like JSON
easily. This enables team to not spend time on flattening out the data or making Kimball's style data models, instead one wide data model works great. Remember, querying data in a Variant
type can be slower, extracting nested data into Materialized Views can help in overcoming performance issues.
Functions & Tasks
Functions are like stored procedure, and Task is the scheduler. You can schedule your SQL scripts like refreshing views, cleanup or dedupes through Tasks, and you can leverage functions for common utilities like extracting nested columns.
Snowpark
Snowpark allows you to run non-SQL code e.g. Python, Java and Scala within their data cloud, that means you don't need to pull data into your instance to run your code, rather leverage the Snowflake power by using Snowpark. Having this support also means you can easily migrate your existing code e.g. converting Pandas into Snowpark dataframe.
For further understanding, please visit: Snowflake Documentation
These best practices would help in different areas of Snowflake, from cost saving to query performance while leveraging the Snowflake features. You can apply the similar approach to your existing data warehouse depending on the features. Other notable practices are very generic like data organization, alerting, monitoring, access and security etc.
If you are a Snowflake user, let me know what else should have been up there.