Databricks Delta Lake — A Friendly Intro

​Introduction to Apache Spark

​Building data lakes

Credits — Databricks

Problem with data lakes

Why are these projects struggling with reliability and performance?

Reliability issues

  • Failed jobs leave data in corrupt state. This requires tedious data cleanup after failed jobs. Unfortunately, cloud storage solutions available don’t provide native support for atomic transactions which leads to incomplete and corrupt files on cloud can break queries and jobs reading from.
  • No schema enforcement leads to data with inconsistent and low-quality structure. Mismatching data types between files or partitions cause transaction issues and going through workarounds to solve. Such workarounds are using string/varchar type for all fields, then to cast them to preferred data type when fetching data or applying OLAP (online analytical processing) transactions.
  • Lack of consistency when mixing appends and reads or when both batching and streaming data to the same location. This is because cloud storage, unlike RDMS, is not ACID compliant.

Performance issues

  • File size inconsistency with either too small or too big files. Having too many files causes workers spending more time accessing, opening and closing files when reading which affects performance.
  • Partitioning, while useful, can be a performance bottleneck when a query selects too many fields.
  • Slow read performance of cloud storage compared to file system storage. Throughput for Cloud object/blob storage is between 20–50MB per second. Whereas local SSDs can reach 300MB per second.

What is Delta Lake?

Performance key features

  • Compaction: Delta Lake can improve the speed of read queries from a table by coalescing small files into larger ones.​
  • Data skipping: When you write data into a Delta table, information is collected automatically. Delta Lake on Databricks takes advantage of this information (minimum and maximum values) to boost queries. You do not need to configure data skipping so the feature is activated (if applicable).​
  • Caching: Delta caching accelerates reads by creating copies of remote files in the nodes local storage using a fast intermediate data format. The data is cached automatically when a file is fetched from a remote source. Successive reads of the same data are performed locally, which results in significantly improved reading speed.

Reliability key features

  • ACID transactions: Finally! Serializable isolation levels ensure readers never see inconsistent data, same as RDMS.
  • Schema enforcement: Delta Lake automatically validates the data frame schema being written is compatible with table’s schema. Before writing from a data frame to a table, Delta Lake checks if the columns in the table exist in the data frame, columns’ data types match and column names cannot be different (even by case).
  • Data versioning: The transaction log for a Delta table contains versioning information that supports Delta Lake evolution. Delta Lake tracks minimum reader and writer version separately.
  • Time travel: Delta’s time travel capabilities simplify rollback and audit changes in data. Every operation to a Delta table or directory is automatically versioned. You may time travel using timestamp or version number.

Taking a look at versioning in Delta

--

--

--

Leading big data and AI-powered solution company https://www.sertiscorp.com/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to Convert DOCX to JPG in C# .Net Framework

The (non) importance of health-checks in your code

10 free stock photos you would actually use (Thursday 2nd 10PM edition)

How to validate a Code Identifier in PHP

Oh, My ZSH! with Powerline Fonts. Pretty simple as you deserve!

Git — The most used technology by developers

Expungement In Utah

utah expungment

How to get dark mode working with CSS

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sertis

Sertis

Leading big data and AI-powered solution company https://www.sertiscorp.com/

More from Medium

Introduction to Azure Synapse Analytics

Arrow Library Installation Error on Databricks Cluster

Churn Prediction with PySpark

Azure Synapse Analytics — End to End for Automated ML using Azure Machine learning