Exploring the evolution of Power BI using Direct Lake

Alle Broadcasts

Exploring the evolution of Power BI using Direct Lake

21 visninger

25. maj 2023

What happens to #PowerBI now? We asked Microsoft for an update on what current Power BI professionals can expect.

In session 3, Lars Andersen Senior Program Manager from the Power BI CAT team at Microsoft will cover the exciting new DirectLake storage mode for Power BI Datasets, and more!

View transcript

hey everyone and welcome to a full day about the new Microsoft fabric um today in the in the room I have Lars Anderson from Microsoft senior program manager at the the cat team who will explore a little bit more about one of the specific topics of Microsoft Fabrics but for any new viewers out there I'll just add a few words about what is Microsoft fabric as a whole so Microsoft fabric is a the newest product launch as software as a service solution for Microsoft for handling the entire data stack not just reporting and data set data modeling but the whole end-to-end data stack everything from data ingestion through transformation to storage and data science machine learning real time and of course the the data modeling and Reporting as we know it so in um in in Mouse and fabric we have expanded our toolbox with a bunch of new tools some of them are not directly the the the power bi tools we know but they're bundling together and it's a unified platform and they work really well together with with power bi we have now a total of six experiences with all different different areas where we can do everything from data engineering data science data integration to doing the reporting we already know and love and in power bi and that's why we invited you to the to the to the session here today we also had some have some new things and especially one feature comes to mind here do you want to elaborate on that yes and glad to have you yeah well thanks for having me with us yeah so what I will be talking about here today is this new feature in uh Microsoft fabric uh and in power bi specifically that we call direct link so I will uh go through some slides that uh yeah so that's what the scene what it is and then I will do a demo of how to work with it but before I get started uh as Nancy said I'm Las Anderson from Microsoft I work in the yeah I should probably say formally uh Power bi catching because eventually we will be in renamed fabric uh catsim where I work with customers in uh in Europe specifically in northern Europe Eastern Europe and Spain and uh and Italy so that's what I do on a daily basis and I've been with Microsoft for more than nine years so I've been on the whole power bi Journey so we have been very excited about what we launched at build uh on Tuesday uh so it's it's great that we can finally talk publicly about this and see how all you as customers uh yeah look at this and and start using it so let me take you through a little presentation here where we talk about this driver direct leg and what it's about so uh what Messi has just showed in his slide very very brief what fabric is uh the one of the ideas behind fabric is that it's one leg so we should only have data in one place we should not duplicate data and we should have this uh yeah one leg one data Lake and that's the foundation in in Fabric and and you don't have to think about creating this one like that's done for you automatically so with direct link um it's going to use this uh Delta Lake parquet format for for the all the artifacts in in one leg and they can actually be reused in a power bi data set without having to to move your data from your Source into a powerbia data set so again a few slides here to show you this is how it has been until Tuesday this week so we have had uh basically two modes in power bi data sets we have had the diary query mode where everything was still residing in the underlying data source and when you ever you had a visual in power bi then this visual was querying the underlying data set or sorry underline data source and then the underlying data so it was in the result back it works and it still works and it supports yeah very large volumes of data um then we also have import mode which is the preferred mode of working with Power view because that is where we take data from the source and actually import it into a powerbit data set so you get a low latency and very very fast but in those scenarios you actually have to to move data into Power bi so what we have done with direct link is that this slide here says it's the perfect mode so direct link or mode is about having your parquet files and then uh they will be the source so you we're just scanning the parquet files we're not moving the files into an other storage layer if we look at that uh on a another way here so these dollars modes that we have today directly on import for small data sets direct query the the there's always the question about the query speed with direct query because we are 100 uh depending on the underlying data source but we don't have to don't need to import data and the model size is uh is unlimited when you when it comes to the power bi model size for smaller models imported models works great but when we grow to larger models maybe hard to see in this slide but but the smiling faces for time to import and model size is not smiling a lot here for large modes because it can take long time to refresh your data set and there's a limitation on how much uh data you can actually have in memory in power bi so if we go to the direct link mode instead then we are heavier because it still works great for uh for smaller data sets and for larger data status data sets it's also great because we don't have to spend time importing data set and we can actually uh at least eventually have uh model sizes that are bigger than you can with Imports today so an overview of this architecture in uh uh yeah maybe a little funny looking uh overview here is not an architecture but this is basically the way fabric is working we have our uh one leg in the middle uh our Delta lake with parquet files and then they can be queried and and modified using different uh languages and different Technologies so what we will be looking at today specifically in this session about direct link is using a pipeline to take data from a source and load it into our lake house and then we will use power bi to to query those files which basically is Delta tables in our lake house what you will notice here as well is this concept that we call fallback mode and I'll come back to that in a second yeah so why do we want to use parquet it is the new uh let's say standard in in fabric everything is a parquet files and most of you who are listening here probably knows all about it but let's just take a very quick recap of it so it's open source it's open data format so you can take your file move it from Microsoft Technology to somewhere else I mean it's we want to make it open and we want to make sure that you can actually use a technology that that works it's a column oriented file format uh and it's very efficient for for data stores and actually retrieving the data as well so if you've been in the the industry for a long time you know XML Json and stuff like that it works great or CSV files uh but it's not always great for for querying and that is where parque also uh is a is a great asset we can also do efficient compression and we'll see in a slide in a second how the compression actually done when we talk about the fabric because we're doing something uh something special in here um and then as you saw in the previous slide you can work with your parquet files using yeah whatever language you choose basically and finally uh it's a common data storage format for a lot of the solutions out there for Microsoft Technology for databricks for Snowflake and other Technologies so so a lot of people are already used to working with Barclay files and now we want to make it the the core of of fabric and the way you should be working with data going forward so talking about this compression uh this is an example uh we have with uh Microsoft sales data it's 162 tables and if we have that in a CSV file it's 880 gigabytes which is a decent size yes and what we do if we just load that into the k files let's say a standard packet file then we actually reduce the volume to 268 gigabytes which is that's that's significant yeah but what we do in fabric is we use this V order compression which compresses the the files even further so going from these 268 gigabytes to the V order compressed this example actually takes it down to 84 gigabytes so that's a 3.2 Factor compression and that means of course less IO and uh faster uh querying and and that is the whole Magic about making this direct lick work as fast as as import mode so um so so this is what you will be doing and a lot of this V order compression is actually done automatically when you work with the lake house in uh in fabric so um the direct Lake mode when we work with it and you'll see in a second how we create a direct Lake data set but initially we have a data set that is MC so if you have an imported data set in power bi it has let's say it's a 10 gigabyte data set then it takes up 10 gigabytes of memory when you load it into memory at direct Lake mode uh data set has zero memory because it's there's no data in there so whenever you start querying your uh your data set then it will actually load the The Columns that are in your report in your visualizations into memory as as you request them basically so they're being loaded from these parquet files on uh let's call it on demand and then if we have columns in memory that is not being used for a while then fabric will automatically evict this data set which means that it will actually remove those data sets or those columns from memory so it frees up memory for when you columns that that are being queried and then as I mentioned in the that overview slide we have this concept of fallback so there can be cases where directly does not uh support your in memory and that there are different uh reasons for that it's not documented uh right now and I don't know I don't have all the details when it's going to happen but for instance if if your column size is too large to fit into memory well what do we do then I mean then we will fall back to a SQL to directory so instead of querying the the parquet files uh directly then we're using the SQL endpoint in Fabric and then we are querying uh the the same data and the same files but using another technology so of course it will be a little slower that that's nothing short of amazing I mean the way I see it today or or a week ago if we wanted to do just something close to this we would have to include a mix of direct query and Import in a composite model we would have to configure aggregate tables and even then it sounds like this can do all that out of the box and more yes so that's the whole idea and it is it is great and I mean we are very excited about this remember this is in we just launched it in public preview two days ago it's still not 100 compatible with an imported model but we will get there I have Goosebumps look literally so um looking at this fallback uh on a a visual uh representation here is that in the upper left hand corner here we have our power bi data set that is in the index or MDX to to the data set and the first question the data set has to answer is uh are this a fallback to direct query against our SQL endpoint that is querying the Delta lake or can we do the direct leg and hopefully in most cases you'll be able to do direct legs so that you get the the amazing fast performance but in some cases it will fall back to diver query so everything will happen as as needed so what it says here on demand transcoding as needed then is this transcoding that is the on-demand loading of your Delta files into a memory in power bi I'm not going to go into details about uh licensing uh because uh we don't have all the details available publicly yet but you will still need uh some capacity to handle all this memory because we need to load it into memory uh so it's not like uh you can just do everything with a very very small capacity if you have large data sets you will still need a large capacity that can handle that workload so um and I just want to mention that in 45 minutes we will have the last session of today which will will also cover what we do know about capacity and pricing which I should say is definitely not all of it but but we know something about what are the capacity sizes and so forth we will we will cover that later um but but definitely this is really really cool yeah I have some questions here we can take them later yeah let's take them people talking about what how will this change the tax we are writing and will it change the texture of writing and maybe we don't know the answer for this yet it will not change the text it's still a power bi data set so when we look at the demo in a second when you go into the service you still have a powerpi data set now it's it's a direct leg mode it's not diary query it's not import it's um uh yeah it's directly so it's it's the same interesting but of course I mean to to go to this new mode of of uh storage uh it's not something you can just do by clicking three times and then you're ready to go so it is a new architecture and you will need to redo some of your existing data architecture but uh this is something you should definitely think about for for new projects and if you have some models today where you are let's say Limited in in size and stuff like that then this could also be a way to go I mean as hopefully everyone knows the what we have been uh saying for yeah a lot of years actually at least I think in 2019 I think we had the first blog post where we said that power bi data says that will be the superset of analysis Services um and in 2019 it was a great blog post but we were not they had technology wise today we are so if you're starting a new project well then you should always consider as your starting point to do it as a power bi premium data set and then maybe look at Azure an Azure Services as your second option absolutely that doesn't work and this will hopefully just make it uh the the choice easier cool so for anyone out there who may be worried about what they built a week ago will that still continue to work unchanged and they don't have to do anything and so going forward yeah yeah so we get questions uh on a regular basis about Azure knowledge services are we going to deprecate action other services and there's no plan to deprecate hnl services at some point we may want to duplicate it but you can continue to use it for for many years going forward we will not invest in new features for Azure analysis Services everything we launch for new stuff is in power bi but you can still continue to use what you have and I mean if it ain't broke don't exactly exactly that's good and a relief you have more questions now or should we continue here good um yeah so the last slide I have here is a link to the documentation uh that yeah explains what directly is um so let me go to My Demo environment here so now I want to show you how are you from scratch uh can actually build a data set using a direct Lake mode so I have a workspace here in my uh my fabric environment and one of the new things with the fabric is that in the lower left hand corner here we now have this select of which experience do you do you want to be in so I'm in my power bi experience now which means that when I click new then I see uh Power bi objects if I go to my data engineering experience then I I see something new here and then I can go into a workspace and then I see my data engineering Optics in here or elements so so this is a new if if you are in whatever experience you're in and you want to create something else you can always click on show all and then you will see all the experiences that you have access to so the first thing that I need to do in order to create a direct link data set is that I need to create a lake house and load my data into uh to the lake house which is uh in these Delta tables that is the foundation for my uh created lake house that sounded like something that would be a six month project yeah so it's uh let's see how long time we can time it it may be a a six second project so you click on this lake house and then you give it a name so we will call this fellow mind and click create and then it's creating your lake house in the in the fabric environment so now we have our lake house it's an Interlake house so it's not fun but but we have our lake house uh and on the lake house right now as I said it's empty we can see the list of our tables and list of files right now we don't have any data in here so what we want to do initially we want to load some data and we can do that using uh the next generation of data flows data flows gen 2. we could use a pipeline that you may know from synapse pipelines or Azure data Factory you can create a notebook if you would like to do some Pi spark or something to to generate your data or you could create a shortcut too some something else even to Something in AWS or something um nice so I like how those buttons are right there when you where you need them exactly so so what we want to do is that we in this example I have my data in an Azure SQL database so I I could use data flows but the most Enterprise way of doing it would probably be to use a data pipeline so I click on on data Pipeline and I have to give it a name I'll just keep my default and then I click create so now I will create a new uh Pipeline and what you may know that we launched in power bi not too long ago is this multi experience that uh yeah I'll show you that in a second so I can easily change between my Pipeline and my lake house so when you launch this initially since I told my lake house that I want to copy the data then I go into this copy data experience and now it's it's started from from the lake house so it I'm kind of telling it that I actually want to copy the data into my lake house that's why it says but you could also use a pipeline to copy data to another database or whatever you want to do yeah and we have made it easy for you so if you don't have any data yourself then you can actually start by using some sample data and you can see here that we do have uh these New York Taxi data which you've probably seen and that is a two gigabyte per K5 so that's a decent size um I'm not going to use uh that large data set for this demo but I'm going to go in here onto Azure and Azure SQL database and I've already used my database before so I have a connection I can can reuse so this is my Azure SQL database so it has all the credentials and everything in place if so I don't need to create it more than one so I click next and when I click next here then I'm connecting to my database and now I can select the tables that I want to use and I will select a few tables in here I will select a customer Dimension date Dimension product dimension and a fact table and I click next and in the next year uh this pipeline asked me where what should my data destination be and again since I launched it from the lake house it will by default select the lake house that that I launched it from I could create a new lake house select another lake house if I had another one but that's this is what I want to do so nice and easy yes we're already in there of course that's where we want to exactly to send it so next I click next and then I I selected four tables in my database and then I have to select what is my uh data destination here so while I loaded into tables Delta tables or with a loaded into files and I want to use the Delta saver because that is the foundation for my uh directly as we talked about and what I can do on each table here is I can choose an action to a pin so append the data every time I load or I can choose to override I'll just keep the default for now and I can do that for each table then I click next and then uh I get a summary of what I've done and then I click ok so now I created this pipeline that is going to load data from Azure SQL database into my lake house we don't want to spend time waiting for for this data to load it's not taking long time but but for for the for the sake of time it's not very interesting to look at uh so I have an another uh workspace here where I already created a lake house with data and I can click and you can see here when I create a lake house I get a three elements or three artifacts here in uh in fabric uh the bottom one here is my pipeline so that's not creative with my lake house but I get the lake house itself which is my my files then I get my SQL endpoint and then I get a default data set um the default data sets can be used but it's it's good practice at least today to uh to create your own uh whether this default data set will will keep there or we will remove it at some point I don't know but but if you want to make sure that the the direct leg is working then you will need to create your own but I can click on it has the same name uh so this one is called lake house with data when I click on the SQL endpoint it takes me to uh the the SQL endpoint experience where I can actually start creating some T SQL on top of these four files if I like I can also create a a new report I can do a visual query using uh Power query online experience and while I'm in this view I can actually toggle between the the SQL endpoint View and the lake house View so if I in the upper right hand corner go from SQL info into lake house then it takes me to the lake house view instead if you look at the documentation that I linked to you we are writing that but now uh the direct leg uh mode is available for for lake house at some point it will also be available for uh for the SQL endpoint so you can create it in there it's in in private preview right now so while I'm in Lake House mode here and I loaded my four tables you'll now see that I have this option to create a new power bi data set so that's what I want to do new data set and I have to select which sales do I want to have in my data set and I have these four so I will select all four of them and click confirm so what's happening now is now power or power bi fabric is launching this data model editing experience in the browser that belaunched a few months ago and this is today the only way to create a direct like data set we will also enable this functionality in power bi desktop in yeah in the future but what you're saying is that the the direct lake is not something we can hope to test out in three months or four months it's actually right there in the service this is this tenant I'm using here is a public team this is not a Microsoft internal only so anyone spinning up a trial can actually check out these features today you can do this today now but wait until we finished yes don't leave okay so so this is the experience you see here this data modeling on the web uh and there's a a yellow uh warning here that says that keep in mind all changes will be permanent and automatically saved that's the experience right now uh which is um some times it's good sometimes at that uh I think we've all tried working in a power bi digital file and then it's starting to crashes for some reason and maybe you forgot to save it um that's not going to happen here but if you just do something that you want to redo then it's already saved so um but we this is again a journey and we will um we will change this eventually but let's quick question someone's asking but these copying data into the lake house even necessary could we could we mount a source instead with we saw the the heavy path of mounting something could be mounted source and would that ruin the purpose of utilizing the direct Lake if we just mounted them as fast I know then we will need to load it into the lake house to get it into this border compression to make it work so we're stuck with kind of a direct query scenario if we're just mounting an external database I don't even know if you can do that in yeah Microsoft fabric no I I to be honest I don't know exactly if we can do it today if you want to do direct Lake then you have to do this yes yes and uh here as you saw in the the slide the the compression I think it speaks for itself uh yeah absolutely um so so that is that is the magic um so uh coming back to uh this uh model editor here so uh I have these four tables uh and um I want to create my relationship so I have a customer Dimension so I will drag it it's if you've worked with this it's the exact same experience that you have in the um yeah the data modeling that we launched in power bi two months ago um and for some reason it's a little slow now um so this is it is a live demo so uh okay um there's a bit more chat about the whole the copying thing and I think the way I understand it is that we are covering we are copying data to our lake house or two hour one day yes and and that's where we want to have it then anything here can be shortcuted or virtualized but yeah but we do need to make that copy into our platform to utilize the benefits of the compression and the file format that's that's correct so so yeah so we're unifying on fabric it's not that that we don't need to ever move data we need to move it into Fabric and then we can use it everywhere yes yeah so and in many cases when you move data into fabric I mean you need to do some transformation you need to create your yeah I'm an older guy so I've been creating data warehouses many years ago and we did a lot of details to take data from different systems and combine it into an Enterprise data warehouse and and that is what is happening in in most organizations anyways so um I mean I think everyone always have had this uh yeah wish of being able to virtualize everything so you don't have to move data it works at least on PowerPoint but I don't think it works in reality especially not when the large data volumes and so on so yeah so you will need to move or duplicate data once that's uh yeah obviously but if you only do it once then I think it's okay yep it's loaded up it's yes so uh I create this relationship between my fact table and my customer Dimension and let's hope that the next one is a little faster uh so my product uh here yeah I don't know why it was a little slow uh maybe it signed out and then I do it to my date Dimension as well so um I'm not going to create the prettiest uh data model in here we don't have time for that but I want to do one thing I want to hide all my columns in my effect table because that's we should only have measures in our fact table and then I will create one measure and notice all this is working in the service so this is you could not do this in the service two or three months ago so we will call this total sales and it's going to be a very simple sum of sales amount here so let's just keep it at that for now so now I have this data model which yeah it looks like a data model that you're creating power bi desktop um uh the difference is that the files here are accessible in the report as you'll see in a second and I haven't moved I only moved them once that we just discussed um so uh so that's that's the beauty of it uh what I will show you once we have a greater report is that there's actually a mechanism of making your new data in your Lake House available uh in in your data set and and by default it will be available immediately but you can also choose to manually determine when the new file should be available all right so it will automatically include new tables or new columns new well new data in your new data yeah in your new new columns okay so but if you load yesterday's data today and then tomorrow you're loading today's data you can freeze it to the data right now and don't include new data yeah I'll I'll show you the settings so so now I created uh my my um my measure and my effective and from in here I can launch create a new report so this data set or one thing I should probably show is that it has this blue uh dotted line here so you can see here oops it's oh it's not working with zoom but you can actually see here that the storage mode is directly so that is that is a thing and uh it's not something you're saying no no so this report here is uh just an ordering report this is my data set so since I removed I only have a measure here then my my fact table is on top I add the total sales it's loading and then let's just have it as by calendar year so as I talked about before I'm loading this into to memory as isolate the columns this is not a large data set if you saw the demos at build they are using uh yeah some of the the Microsoft sales data which is gigabytes of data and it still works uh placing fast so so it it it works as it uh it's it's supposed to so I can I can save this uh uh I can call it my first Delta sorry Direct report so now we have basically done everything that we talked about we loaded our data into the lake house from our source we uh created our data model using dialect and we created a report on top of that so that was what I promised and and now you're seeing it it is actually working you can do this now so going back to our workspace now so now we have a data set and it has this default name I could change that of course uh and then I have my report so this report is running on top of the of the new data set but if I go into the settings of my data set here you can also see I can open this data model here that was the view we was looking at before so when I go to the settings then uh I will have some data source credentials and these credentials are not against my Azure SQL database they are against my uh my Delta link or lake house yeah and then under refresh we now have a new Option here and that's the top option um so yeah it doesn't fit in the screen here when I assume but this is the one that says keep direct data Lake up to date so when this is turned on by default then if I have my uh most cases it will be effect table when I load new data from my Azure SQL database in this case then they will be available on the data set immediately if I turn this one off then I will need to invoke let's say a Refresh on your on my data set but the refresh is just doing something that we are calling a framing getting the the new data that that I loaded so it's never full refresh you just get the new or something happens in the engine yeah so whether it's a full reach phrase or not depends on how you uh pipelines are working so if you like to do a full refresh then then you do that in your pipelines but I think most organizations will do mental Refresh on the back end as well so so that is that's the way it's it's working that's really cool so uh so you don't need to configure a refresh in here and all the apis that works again power bi data set also works here so I heard you say the word immediately so what what is immediately I mean are we talking minutes are we talking seconds are we talking we don't have time to do that demo now but it is it is instantly so if I if I go into the report and click refresh once new data is loaded in yeah I I well I don't have new data anymore but when I click refresh here once in your data so then then it will be available okay do we know the interval of if you had it to auto update the data how often would it would it check for new data you can see it's it's a new engine we I don't even know how it works no I don't know the exact uh no details on that but as we call this um um uh transcoding so it's taking the data from the files into the memory of your data set so that is what's happening so there's happening something in the the transcoding that moves the data when you need it okay in a push kind of manner it arrives and then it knows to push it to the door no I think it's more in a pool so when I open this report let's say that I have three visuals that are occurring uh five columns ah and my data set has ten columns then it will load these five columns into my data set right then if I go to another report page that are using another two columns then it will load those two columns right and so again the best of direct query in a sense it it notices that the specific visual needs some data and then it checks for that data in your data data lake house yes it loads us into memory and and the eviction part is also very important so that if you haven't been using it for a while then we load it out of memory nice so right now everything is done yeah automatically and there's no configuration but um but that that is that's the way it works really nice cool I mean I'm a I'm a bit blown back to be honest this is uh this is very very very uh promising so I mean uh I don't have more demo if I don't know if there's more questions but this is uh as you said a number of times yes this is available today and I mean you don't even need to uh have your own data to play with this uh we have a lot of sample data that you can just start using today and and get a feel of How It's Working uh and uh then eventually when you're ready then start playing with your own data and again uh we just launched it in in public preview and the recommendation for Microsoft is not to go into production with with the preview features it's a preview uh so uh but again we want to learn from from you who have the actual scenarios so uh please play with it bring us feedback and that is the way we can improve and and make the product better cool well I thought this had been amazing and I'm much much more mind blown about the what what the what direct like what meaning diode Lake will have for the power bi world going forward and could have yeah it's new uh thing since uh best things than slight spread I think this is the best thing yeah that is I mean it is the let's be honest that is the the most significant element of fabric that that has been in uh uh in in yeah been launched now um we have also talked about co-pilot which will also do a lot of great things for power bi but that is not available yet it will be available uh initially in private preview for a selected group of customers and uh and then eventually uh as property preview so um so and and we also have uh the git integration which of course also is uh Power bi and which if you saw the session with the rich and Wade and so ridiculous and Ruby Romano uh they have a session from buildwell they're actually showing how you can use a git integration as a devops natively built in amazing amazing yeah so if before this session I thought that the Microsoft fabric was going to be a huge thing it is and then we have for power bi we only have a direct Lake well then at least you changed one person's mind here because to me this seems like direct like will be the revolution of power bi yeah in connection too absolutely well thank you so much for coming here and showing these things I've it's been an absolute pleasure I'm still having Goosebumps and chills and I don't know what but but yeah this this was incredible and I can't wait to see what we will find out in the time going forward around the dark Lake and power bi in general yeah and thank you for tuning in we have the last fourth and last session in around 20 minutes at at 2 pm Central European Time so tune in and hear more about how we should take and what we should think about for our Organization for capacity and pricing for collaboration and all the the important things we also should ask ourselves beyond the new gadgets Technologies amazing Technologies so uh nice to have you all and enjoy see you later bye guys

Alle Broadcasts