DATA LAKES AS A SERVICE

     A Data Lake is a data store used for storing and processing large volumes of data. They are often used to collect raw data in native format before datasets are used for analytics purposes. Data Lake-as-a-Service solutions provide enterprise big data processing in the cloud for faster business outcomes in a very cost-effective way.

     Data lakes are often used to collect raw data before datasets move into a production analytic environment, like a data warehouse. The main difference with a data lake vs. a data warehouse has to do with how data is structured (or not) and stored, which in turn affects load times, pre-processing requirements and analytic performance. A data warehouse is based on relational database technology, which can only store consistent, structured data. A data lake is based on technologies that allow one to store raw data and then incrementally apply structure, as defined by analytic requirements. Data lake characteristics often include fast ingest/write speeds and low-cost storage, as they are designed to manage high-volume, high-velocity raw data (think millions or billions of records per day). Data lakes have widely-varied analytic capabilities.

     “Data Lake-as-a-Service” is a data lake that leverages cloud resources, which are managed and maintained by a vendor “as a service.” It’s often advantageous to deploy data lakes in the cloud because of easy scalability for large data volumes, inexpensive storage and because raw big data is increasingly generated in the cloud from sources like sensors, mobile apps or social media.

     The performance and SLA requirements of a data lake are highly dependent on its role within production data processes and the importance of analytics to a company’s success. If the processes are important to the bottom line, it’s critical to ensure reliable performance. No one wants highly skilled, expensive data scientists and analysts spending time troubleshooting software, or waiting around, when they could be focused on analytics. This is another reason companies are using Data Lakes-as-a-Service, which can often offer more predictable performance and experts to help troubleshoot. This monitoring can be very hard to build yourself or integrate with existing management systems.

     For the near future, most companies will have hybrid architectures, with some data and analytic processing in the cloud (say, for large raw datasets) and some data and analytic processes kept on-premises (say, for highly regulated data.) Given that this is the trend, data movement and integration technologies are critical considerations when choosing a data lake technology or designing hybrid architectures.