Sunday, May 8, 2022

data lake

  •  What is a data lake?

A data lake is a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data. It can store data in its native format and process any variety of it, ignoring size limits.


A data lake provides a scalable and secure platform that allows enterprises to: ingest any data from any system at any speed—even if the data comes from on-premises, cloud, or edge-computing systems; store any type or volume of data in full fidelity; process data in real time or batch mode; and analyze data using SQL, Python, R, or any other language, third-party data, or analytics application.

https://cloud.google.com/learn/what-is-a-data-lake#:~:text=A%20data%20lake%20is%20a,of%20it%2C%20ignoring%20size%20limits.



  • Four key differences between a data lake and a data warehouse


Data Lake Data Warehouse

Data Structure Raw Processed

Purpose of Data Not yet determined Currently in use

Users Data scientists Business professionals

Accessibility Highly accessible and quick to update More complicated and costly to make changes


Organizations often need both. Data lakes were born out of the need to harness big data and benefit from the raw, granular structured and unstructured data for machine learning, but there is still a need to create data warehouses for analytics use by business users


https://www.talend.com/resources/data-lake-vs-data-warehouse/#:~:text=Data%20lakes%20and%20data%20warehouses,processed%20for%20a%20specific%20purpose.