This year’s Microsoft Build Conference brings us Fabric.
Now that the hype is out and the dust has settled, we, the Azuriens tribe, are eager to dive fully into this new technology. But of course, it’s not every day that Microsoft coins new technology as the future of analytics in the age of AI. Rest assured, we are as curious as you are.
This blog series is intended to dig in, discover the ins and outs, the pros and cons, and see if Fabric can live up to Microsoft’s claim of being a revolution in end-to-end analytics. We’ve all seen the demos, but now it’s time to put this to the test!
Whether you are a Data Engineer loving to code Spark, or your true passion lies in writing TSQL code, analyst, or experienced dev, this blog is for you.
First, let’s try and demystify Microsoft Fabric. I hope the articles may be helpful to you.
If you want anything added, please comment or contact us directly.
This is a work in progress, and articles will be added as we go along. We are Azuriens – Azure data platform experts – with a holistic view of data strategy. This is our Fabric journey.
If we can believe the hype, this Fabric is the first of its kind. It offers a SaaS experience for all services needed to build a modern analytics estate.
The key differentiator seems to be that any portion of data anyone will ever need is available in one place, OneLake, the equivalent of OneDrive for all your analytics needs, provides a single experience and a tenant-wide store for data that serves both professional and citizen developers, Spark enthusiasts and SQL gurus.
When adopting Fabric, organizations no longer require stitching together stand-alone services; instead, one platform allows for connecting, onboarding, and operating all data.
As data professionals are increasingly expected to work with data at scale, in a secure, compliant, and cost-effective way. At the same time, business users want to use that data for decision making, in a more effective and quick way.
The foundation of Microsoft Fabric is a Lakehouse, built on top of OneLake, a scalable storage layer combining the flexibility and scalability of a data lake, plus the structure, the ability to query and analyze data like a data warehouse.
A Lakehouse presents as a database and is built on top of a data lake using Delta tables.
Lakehouses combine the SQL-based analytical capabilities of a relational data warehouse and the flexibility and scalability of a data lake. Lakehouses store all data formats and can be used with various analytics tools and programming languages. As cloud-based solutions, lakehouses can scale automatically and provide high availability and disaster recovery.
Some benefits of a lakehouse include:
A Lakehouse is a great option if you want a scalable analytics solution that maintains data consistency. It’s important to evaluate your specific requirements to determine which solution is the best fit.
In Microsoft Fabric, you can create a lakehouse in any premium tier workspace. After creating a lakehouse, you can load data – in any common format – from various sources; including local files, databases, or APIs. Data ingestion can also be automated using Data Factory Pipelines or Dataflows (Gen2) in Microsoft Fabric. Additionally, you can create Fabric shortcuts to data in external sources, such as Azure Data Lake Store Gen2 or a Microsoft OneLake location outside of the lakehouse’s own storage. The Lakehouse Explorer enables you to browse files, folders, shortcuts, and tables; and view their contents within the Fabric platform.
After you’ve ingested the data into the Lakehouse, you can use Notebooks or Dataflows (Gen2) to explore and transform it.
Dataflows (Gen2) are based on Power Query – a familiar tool to data analysts using Excel or Power BI that provides visual representation of transformations as an alternative to traditional programming.
Data Factory Pipelines can be used to orchestrate Spark, Dataflow, and other activities; enabling you to implement complex data transformation processes.
After transforming your data, you can query it using SQL, use it to train machine learning models, perform real-time analytics, or develop reports in Power BI.
You can also apply data governance policies to your Lakehouse, such as data classification and access control.
What do I choose?
Microsoft Fabric makes this easy for you; its lake-centric nature stores all structured, semi-, and unstructured data in one location and technology. OneLake is built on top of ADLS (Azure Data Lake Storage) Gen2, a flexible and scalable service to handle huge volumes of data at blazing speeds.
covering the full capabilities needed by such, including data movement, data lakes, data engineering, data science, real-time analytics and business intelligence. All baked in to one shares platform providing a robust set of security, governance and compliance measures.
A lot of focus has been set to the audiences assigned to working with the data, whether you are a Spak skilled person, SQL oriented, without voorkeur, or even not technical at all, there is something for each flavor of Spark, SQL, or no-code spectrum.
If you want to develop in Spark, then you will most likely spin up a notebook in seconds. If you’re a SQL guru, then you will most likely find your way in the Warehouse. Something that will be loved is that SQL endpoints are automatically created for the Lakehouse. On the spectrum of the visualization, a big advantage is that Power BI datasets are spawned at your disposal by default. It literaly is a click away to autogenerate a Power BI report.
On the low-code spectrum datamarts are ready at your disposal.
One thing to keep in mind is that when you create a Spark managed Lakehouse, these artefacts will only be updateable by the respective Spark engine, but they remain readeable bij the SQL engine, and vice versa.
From now on Fabric blends the the goodness of Synapse SQL serverless and dedicated pools offering a highly elastic, super performant engine. Proprietary formats are a thing of the past and are replaced by open data format over Delta (more on Delta here) and all compute engines in Microsoft Fabric can reason over the Delta format – the Synapse Warehouse happens to do this extremely fast.
The Synapse Warehouse is based on Polaris, a distributed query engine designed to enable lots of important features like the ability to resize live workloads, deliver predictable performance at scale, and efficiently handle both structured and unstructured data.
One of the advantages serverless compute is that workload management is automatically handled for you. Compute capacity is autonomously scaled up or down as queries flow through. In the spirit of SaaS, the goal is to offer a ‘no knobs’ experience, challenges that before were the customers responsibility are now transferred towards Microsofts engineers to be solved. Still the fact remains that if knobs are beneficial for extra tweaking, they sure will.
For each quey, workload characteristics determine their isolation so that optimal performance is ensured. Ad hoc and analytical workloads will not share compute resources with ETL jobs, this frees you up worrying about scheduled loading jobs interfering with reports initiated by executives.
There are a few options when you are in this perdigament. First of all Synapse SQL pools (Gen2) is a GA product that is a PaaS offering, and will remain so for years to come.
The Lake First nature of Fabric can be seen as an evolution of the existing Synapse platform, where all things are Delta (more on the Delta format here), allowing the compute engine of your choice to leverage data. The big pro that Delta brings to the table is that data can be instantly leveraged by SQL and Spark fans alike.
For those interested in a migration track, Microsoft will not keep you hanging and is developing migration tooling, follow this blog to get updates on this.
Domains can fully isolate workspaces into logical groups. The various departments can be assigned to their domain containing one or more Workspaces. These domains are fully isolated and have their own Domain Admins, who have full control over the Domain.
2. Capacity & performance
the amount of compute power is dictated by the capacity of your Fabric tenant. Where most analytics platforms see large usage spikes during specific times of the day – when loading new data or intense query activity – the problem is that we would need more compute power and capacity compared to more quieter moments in the day. this is where Smoothing and Bursting play their part
Typically a Data Lakehouse is the merger of the best parts a data warehouse offers, think governance and structure, plus the flexibility and scale of a data lake.
Lakehouses in Fabric fit a little different in the whole. A Lakehouse Artifact (spawned by the Data Engineering persona) essentially represents a single layer each, but contained in a workspace can enable a medallion architecture. Sitting on Delta they take advantage of the huge amount of capabilities and performance that come with this.
Spark allows data engineers to build out robust meta-data driven data transformation and processing processes. Via Notebooks (part of the Data Engineering persona) data engineers are free to work locally via a VS Code extension connecting to the Spark cluster.
X. Vertipaq & DIRECT LAKE
Power BI’s Vertipaq compression engine has been deeply integrated into Fabric. Since Vertipac and Parquet are both column-oriented, this makes combining both as the dynamic duo resulting in a smaller file size and means faster writing and reading to Delta from Fabric. Direct Lake on the other hand is a feature found in Power BI, and it uses Vertipaq, and enables blazing-fast refresh performance, through a direct connection to underlying Delta tables. Compared to Direct Query Mode or Import Mode, previously the only alternatives in Power BI, both came with their respective trade-offs, query performance vs. data latency. With DIRECT LAKE Power BI can now scan the One Lake directly, resulting in faster performance, all thanks to Delta.
Co-Pilot is so deeply integrated into Fabric, that it can help us style reports, model datasets, write code snippets, or help to query data. Truly one f the most interesting parts of how Fabric will enable business engagement, citizen development, and self-service BI.