Data Lake Architecture vs. Traditional Datawarehouse Architecture
Data lake is a repository for storing huge volumes of structured, unstructured and semi-structured data. There is no limit on the file size or the format that can be stored.
Data lake architectural components
- Data Ingestion- It contains connectors to extract data from multiple data sources (databases, servers, emails etc) , in a variety of format( structured, semi-structured and unstructured). It provides data curation options.
- Data storage component- It is able to store raw and curated data in any format. This component allows compression and encryption of data.
- Security components- Security is enabled at all stages of information flow in data lake be it data ingestion, data storage, data consumption or data discovery
- Data quality management- Data Lake implementation allows options for setting data quality rules, data quality reporting and remediation
- Meta data management – Data Lake has mechanisms for data audits, data lineage checks, data lifecycle management and policy enforcement.
- Data auditing- Data Lake provides options for complete data auditing and recording data transformation from the perspective of risk and compliance. It helps audit who/how/or when the data elements have been changed
Flow of information in a data lake
There are multiple layers in this architecture
- Ingestion Tier- This layer ingests data in various format
- Storage Tier- This layer stores the raw data
- Insights Tier- These layers provide insights of the input data
- Distillation Tier- This layer consumes data from storage and converts it into structured format for better analysis
- Processing Tier- This layer uses algorithms and processes user queries
- Presentation layer- This layer presents the results and analysis
Traditional data warehouse architecture
It consists of three tiers
- Ist Tier (Bottom Tier)- It contains the database server which extracts data from data sources
- 2nd Tier (ETL Tier or Middle Tier)- The data is extracted, transformed and loaded into the enterprise data warehouse and then into data marts.
- 3rd Tier (Client layer)- The data prepared for analysis is then thoroughly analysed by high level data analytic tools and presented as reports