Scalable ETL with Azure Databricks

What is Azure Databricks?

Azure Databricks offers an managed Data Engineering & AI platform running on Azure. Databricks is an integrated platform simplifying developing and working with Apache Spark.

Most traditional BI set-ups leverage traditional ETL technology to transform certain data (e.g. SSIS). During the transformation phase, sometimes heavy computations are required to perform the transformations. These traditional tools, however run on a single-node (serial) and can't be parallelized. As both data grows and the number of jobs, BI and Data teams need a scalable compute platform for ETL and ELT jobs: this is Azure Databricks.

Azure Databricks offers an integrated platform for data engineers to collaborate and work in both classic SQL, Python or Scala to write their ETL and ELT jobs. Once written, jobs can be scheduled using Azure Data Factory and be part of a broader ETL sequence.

Databricks isn't an ETL tool like SSIS. It rather works together with other tools like Azure Data Factory to jointly offer an end-to-end ETL and ELT tool including both Extract (with Azure Data Factory), Transform (with Databricks) and Load (with Databricks).
If you are looking for a pure visual/drag-and-drop tool, you can use Azure Data Factory's Data Flow service which is also running on a Databricks

Continue reading about running BI with Databricks on our website.

How does Databricks look like?

Recommendations in using Azure Databricks

  • Use Azure Databricks as AI Workbench

    A a lot of organizations start with Data Science development locally on a laptop or a Virtual Machine. However, organizations who embrace the power of AI will at a certain time need both more compute power as well as the ability to truly collaborate among teams. Databricks is a hassle-free platform offering both IT as well as data users (analysts and scientists) a top-notch platform with tools they know (Python, R, Scala).

    Continue reading about Machine Learning on Databricks on our website.

  • Leverage Databricks Delta from day 1

    Over time, your Data Lake will typically have multiple data pipelines reading and writing data concurrently. It's hard to keep data integrity.

    Databricks Delta is a managed version of Delta Lake, the open-source Spark add-on, which runs on top of an existing data lake (Azure Data Lake Store, Azure Blob). Delta brings data integrity with ACID transactions while at the same time, allowing reading and writing from/to same directory/table. Additionally it brings benefits like unifying Batch and Streaming code , ability to run updates and deletes in your Data Lake, do time travel, have flexible schema evolution and much more.

    Continue reading about Databricks Delta on our website.

« Back to homepage