how to build a data lake on-premise

You would [00:34:00] have multi-protocol access, theres like a lot of engineering thats gone into building the FlashBlade made from ground up. Good evening, wherever you may be. So lets get started. Navin Albert is a Sr. Its on medium, I can throw that in the chat, and we also have a glossy solution sheet to kind of walk you through the solution and what are some of the benefits to you. We want you to keep the storage that you have and never pay for storage that you already bought. So basically what the users [00:33:30] asking where the customer is asking for it is basically, hows it different from just like, Once you have SSDs just put together. Right? The second shift that were seeing is, of course, with the cloud, people are moving towards object story, object storage with a war on structured data. [00:26:00] And finally its simplifies operations as you scale, like I said, Pure [inaudible 00:26:05] is managed from the cloud, it can be consumed as a service, its completely storage as a service, you only pay for what you use and you never have to be down for any upgrade patching, and even if you need to do a controller upgrade, thats all covered with Pures Evergreen guarantee. All right, so thats all the questions we have for time in the session. So box.Naveen: [crosstalk 00:35:52] Thats essentially what I answered. [00:10:00] So lets talk about the infrastructure underlying these cloud, like data lakes on premises and see how we build those and give you some actual examples of how to build those, right? The second, it needs to be an intelligent architecture built up on todays technologies, todays storage demands flash, [00:16:00] right?

So [00:33:00] all right, let me copy the link. Theres a lot of design that has gone into flashlight to build, to create the three things, right? Thank you guys. What were seeing in 2020 beyond, especially with innovators, just like Dremio is youre [00:09:00] seeing Cloud Data Lakes that are built on open data, where theres a separation between compute and data, where you have a open data layer on top of your storage that may be built on open metadata standards, open file formats like parquet and open table format, suggest to data lake and, and Iceberg and other data formats, and then youve got this open data layer on top of your storage [00:09:30] and then that open data layer is accessed by various applications via Dremio, Spark or [inaudible 00:09:37] or whatever the application may be. So [00:00:30] were going to start with some challenges with legacy data Structure today, and how a modern data architecture solves some of these problems, and then were going to talk about some requirements for modern infrastructure to create these Cloud Data Lakes on premises, and finally, well show you how to accelerate data insights at your organization and finally, well conclude with some pointers to where you can find out more technical [00:01:00] in-depth resources about what Im talking about to kind of show you some of the proof points and some of the examples of how other people have done it.So lets get started. Its got something called safe mode, which locks it against ransomware attacks, so you bring consistent [00:25:00] performance security and everything FlashBlade is managed from the cloud, so its very simple to manage, you can literally forget, and it can be managed with the APIs and the latest [inaudible 00:25:10] so you can just forget about managing storage.

First, we have unpredictable performance, youve got data pipelines that service various teams with various requirements and their jobs [00:04:00] might be slow, their queries might be slowing them down, anybody that has a query thats stuck is going to just give up and not use the system, right?So, your users and your businesses leaders and your customers are impatient and they want predictable performance, but its hard to tune every system and figure out where the bottlenecks are and whether its a latency bottleneck or a throughput bottleneck, or is it just a process thats stuck, its hard to [00:04:30] find out where the performance is bad. Okay.

Note:For readers who are NOT ready but plan to install on-premise data lake, please skip the middle part of the article and only read the Introduction and the Conclusion at the bottom; For readers who are interested in AWS data lake, please refer to the article of migrating on-premise data to AWS data lake. Finally, from an organizational perspective, from an environmental perspective, youre seeing security becoming a big concern because that data is now the new oil that is your IP and you have to protect [00:14:00] it is ransomware attacks everywhere, locking up your data and demanding ransom and so you want to keep it safe. raid array of SSDs is that direct attached storage architecture that we spoke about you, where you have compute and storage co-located in a single node and youre just scaling the nodes, and we spoke about some of the disadvantages of that, right? You can put the bunch of SSDs together and make it work for like a few terabytes of data. Lets say you have a cluster of a certain analytics cluster, you can add either higher capacity nodes, [00:06:00] when you add higher capacity nodes, you know whats going to happen is when one of those nodes fail, its going to cause a huge amount of rebalancing in your cluster and especially direct attached storage, right? Theres like MLOps thats also super big buzz word in the industry right now. If we didnt get your questions, I think weve got them all, but if we didnt, then you can hit up Naveen in Slack, [00:31:00] but before you go, we would appreciate it if you would please fill out the super short Slido session survey, which youll find in the chat and the next session is coming up, I think we have a panel actually, or a keynote, a fireside chat, I believe. As you support various use cases, more data sources going from simple dashboards to machine learning, to actual [productionizing 00:25:31] [00:25:30] machine learning based software, right? It also needs a dynamic scalability, [00:17:30] so as you scale data, you usually are faced with more complexity and more like performance issues and you dont want to deal with that because you scale your data, you dont want to have downtime, and the performance goes up with scale. Unified Fast File and Object is a platform that would be engineered from ground up from, with flash to deliver simplicity, like literally nobody wants to [00:19:00] deal with storage, it just needs to work and its needs to be out of sight, so delivers simplicity, and at the same time delivers the multi-dimensional performance that you need for todays unstructured analytics workloads, and we call this category of devices, a Unified Fast File and Object Storage and Pure Storage FlashBlade just happens to be the leading industry, leading platform for that. Lets say you have a cluster of a certain analytics cluster, you can add either higher capacity nodes, [00:06:00] when you add higher capacity nodes, you know whats going to happen is when one of those nodes fail, its going to cause a huge amount of rebalancing in your cluster and especially direct attached storage, right?Im talking about direct attached storage, where you have a hyper-converge infrastructure, where you have nodes.

Lets get started with the agenda.

Okay, thanks Naveen. Thats the value that [inaudible 00:35:18]. And finally multi-protocol support, you dont want to bank all your dollars on one particular protocol. Thats the old architecture old way of doing things, [00:28:30] the hyper-converge architecture, where you have a fixed amount of compute and a fixed amount of storage, and if you need to add storage, the compute just comes along with it.Lets say I have very little queries, but Im getting more data, I need to add storage, Id have to add couple of extra nodes there, right.

The storage that I would build to meet these requirements, like, itd be an object store, itll be capable of many things and lets see what are those capabilities that we should bring here. Youre not creating pipelines for the sake of creating data pipelines, and you may encounter new tools, you want to use [00:03:00] the latest and greatest tools, newer tools, and you want to allocate the right amount of resources to the right project at the right time, right? So, your users and your businesses leaders and your customers are impatient and they want predictable performance, but its hard to tune every system and figure out where the bottlenecks are and whether its a latency bottleneck or a throughput bottleneck, or is it just a process thats stuck, its hard to [00:04:30] find out where the performance is bad. [00:05:00] Thats going to be a problem for you to be agile and create value quickly to your end user teams.And finally complexity, management complexity. Youd have to have people who are extremely technically smart to manage petabytes and petabytes [00:35:00] of data storage, and with Pure you can just literally set it and forget it will just exist and you dont have to manage it, you dont have to tune performance to it, its just going to keep delivering that simplicity performance and scale simply, thats it. Lets say I have very little queries, but Im getting more data, I need to add storage, Id have to add couple of extra nodes there, right. If you have more nodes, then you have frequent failures, so hundreds of nodes its again, managing hundreds of nodes [00:06:30] is complex, patching them and securing them, and theres going to be lots of failures happening all the time, either one has its problems. All of these cost complexity, and you guys are well aware of that. You dont want to pay for storage that you did not use, [inaudible 00:16:45] let me pay only for the gigs that I use today and not for let me not plan for capacity for five years and then buy everything today and provide all my money, people are very operationally focused, so only pay for [00:17:00] what, what you use.And it needs to be reliable and available always, even if youre doing upgrades, you want to add capacity, you dont want to take the storage down, it needs to be always available, no matter what youre doing, upgrades patches and the data needs to be protected against ransomware attacks, against any kind of failure scenarios. Okay. [inaudible 00:10:24] Spark, Dremio, [inaudible 00:10:26] , or whatever youre using, which you [00:10:30] could be doing data science environments, you could be streaming real time analytics, or you could be doing scale-out SQL analytics. Next, it needs to be cloud ready, even if youre on premises, it needs to be at agile infrastructure, which [00:16:30] gives you flexible flexibility to bring compute to the data and also with segregated, compute and storage, and also provides you consumption choices that are cloud-like, right? So what you want to do is you want to run some nodes and just run the operating system and the basic functions on the local SSDs on the local drives and you want to keep all your data on a centralized object store file in object store that we call UFFO and that way you create an open data layer that can be used by any application, so youre not locking yourself into silos, to me, thats the difference between flash blade and RAID array of SSDs. Lets go ahead and open it up for Q and A. Good afternoon. Im sure youve seen several slides like this throughout this conference and everybody starts with one of these slides and everybody knows that todays environment is in silos, you have data warehouses, you have a team working on streaming analytics, theres a backup copy [00:01:30] of some data somewhere, Data Lake, theres a team working on AI and ML, and many times you have to create copies of your data into all of these different environments and these different environments have different teams managing them, it has different levels of service, it has different reliability standards and has different security, right?

Thank you so [00:31:30] much for the session. And again, go check out that Field Day by Brian gold, from Pure Storage And hes going to explain how we built this ground up architecture to scale. While its clear in 10 years as forward thinking, companies say that most of the code generated would be AI and ML code. Below the Kubernetes layer, youre going to have a layer that says Thats for data management services for Kubernetes. So the data management services for Kubernetes its going to as a container is spun up, spun down, the data management services layer is going to provide the storage to do the Kubernetes layer, and then youre going to have a layer, [00:11:30] which is your modern data lake layer, which is based on open data formats, and this software layer, or this layer is going to be built on top of Block or ObjectStore, or it could be more legacy systems, its going to be built on a [inaudible 00:11:49] .So lets double click into that storage layer, Im from Pure Storage, obviously Im going to double click into that storage layer and just find out like, What are some [00:12:00] of the requirements of that storage layer in this modern data analytics world? And what are some of the key drivers in market drivers for this layer, for data today, actually just not the storage layer, just what are the key market drivers for modern data delivery today. All right. Im talking about direct attached storage, where you have a hyper-converge infrastructure, where you have nodes. But as you start scaling, youre going to see all these problems with complexity and performance.

This is a fantastic shift, it really brought in the elasticity and agility to the cloud world. [00:36:30] See you on slack.Dave: Yep.Naveen: Bye. Its got something called safe mode, which locks it against ransomware attacks, so you bring consistent [00:25:00] performance security and everything FlashBlade is managed from the cloud, so its very simple to manage, you can literally forget, and it can be managed with the APIs and the latest [inaudible 00:25:10] so you can just forget about managing storage. This is what you have in mind, so lets look at [00:11:00] storage and how we bring this paradigm to storage. We got one more question here. Okay. Folks, if you have any other questions, lets go ahead and get them in. You were able to do this with a combination of a Hive source of [00:23:00] data and Dremio and Flash registry. When weve kind of divided this into three trends, depending on whether you look at it from a business angle, or you look it from a data angle, [00:12:30] the three trends are first weve got workloads that are shifting towards more AI and ML workloads, your data is more machine generated data. When weve kind of divided this into three trends, depending on whether you look at it from a business angle, or you look it from a data angle, [00:12:30] the three trends are first weve got workloads that are shifting towards more AI and ML workloads, your data is more machine generated data.

We got one more question here. It could be some kind of deep learning software, you have to keep performance tuning and users are always complaining about query speeds and not being there or some something not functioning, so you have to keep performance tuning.All of these cost complexity, and you guys are well aware of that. [00:05:00] Thats going to be a problem for you to be agile and create value quickly to your end user teams. So I think we should [inaudible 00:36:14] and I did post the link there, you might just go check out slack just to see if anybody else is there, but at this point, I think we should just wrap it up. Or what are some of the challenges that we face today? So let me introduce to you Unified Fast File and Object Storage, this is not a name of a product, this is what we call a storage platform that meets those requirements that I just outlined, right? Unified Fast File and Object is a platform that would be engineered from ground up from, with flash to deliver simplicity, like literally nobody wants to [00:19:00] deal with storage, it just needs to work and its needs to be out of sight, so delivers simplicity, and at the same time delivers the multi-dimensional performance that you need for todays unstructured analytics workloads, and we call this category of devices, a Unified Fast File and Object Storage and Pure Storage FlashBlade just happens to be the leading industry, leading platform for that. with containers, you get that elasticity and agility that you need. A Brief Lesson on CombinatoricsBasic Counting Principles, My journey becoming a Unity game developer: Game Over Cutscene-Cinematic cut & Dolly Track setup, Applying Domain-Driven Design with Salesforce. It can be used for building automating, protecting your cloud native applications, would module to just core storage, backup, disaster recovery, application, data migration, security, and infrastructure automation, all of that is taken care [00:21:00] with this a hundred percent software solution called Kubernetes data services platform. And if youre using multiple clusters, different types of clusters you may be under utilizing resources in one area and over utilizing resources and other area, you cannot keep trying to rebalance those. It works with any storage, any infrastructure, with FlashBlade, youre getting a much better version, much more ground up built version [00:30:30] for your specific needs, low latency and other characteristics, simplicity and other characteristics. The second shift that were seeing is, of course, with the cloud, people are moving towards object story, object storage with a war on structured data. [00:27:30] All right, let me get up set here. Need an App Maker? So you can bring compute to whatever you need, rather than allocating specific compute silos. So I think we should [inaudible 00:36:14] and I did post the link there, you might just go check out slack just to see if anybody else is there, but at this point, I think we should just wrap it up. Good morning. First, Dremio is very versatile, you can access data [00:21:30] on any area with any protocol, so the mix of NFS and S3, data stored in FlashBlade can be accesses through NFS or S3 or SMB, so you can use the multi-protocol approach to do batch streaming, random access, whatever the workload might be, you can use that, and also, you can start small with just, you by chance, you just hit one blade, and then you can just keep slipping in blades with no [00:22:00] downtime, with no need to do anything. That [00:29:30] answers the question. First, Dremio is very versatile, you can access data [00:21:30] on any area with any protocol, so the mix of NFS and S3, data stored in FlashBlade can be accesses through NFS or S3 or SMB, so you can use the multi-protocol approach to do batch streaming, random access, whatever the workload might be, you can use that, and also, you can start small with just, you by chance, you just hit one blade, and then you can just keep slipping in blades with no [00:22:00] downtime, with no need to do anything. Youre not creating pipelines for the sake of creating data pipelines, and you may encounter new tools, you want to use [00:03:00] the latest and greatest tools, newer tools, and you want to allocate the right amount of resources to the right project at the right time, right? You had nodes, these hyper-converged nodes, and youve given a certain number of nodes for a particular application, whether its Hadoop or Spark or whatever [00:08:00] application that may be and you had these nodes that you just [inaudible 00:08:06] to scale like hundreds of nodes to 200 nodes, 300 nodes. So this is what most data teams want and we know that, but what are the infrastructure challenges that are sort of preventing us from getting there?

You cannot do that and just manage it with like one or two guys, and forget storage, right? [00:22:30] How can this help you migrate away from sort of older infrastructure to a more modern architecture and this kind of slide shows you this. azure datalake

Sitemap 32

how to build a data lake on-premise

This site uses Akismet to reduce spam. rustic chalk paint furniture ideas.