ETL process in a data warehouse.

August 16, 2021

ETL stands for Extract, Transform, and Load. It is a data processing flow used by companies to manage and use the large amount of data. It is used to combine data stored in multiple locations like data warehouse, data store, or any data lake. Companies had ETL tools that were used for processing data stored on-premisis and the storage and software to load the data were both present locally. Nowadays, the storage as well as the processing tools have moved to the cloud. ETL basically creates a pipeline of flow of data with the help of data processing tools. ETL is a complete process of retrieving structured or unstructured data stored anywhere by the organization and then transform it and provide to place where this data cound be used to solve business related problems.

Extract:

Extract in ETL means to retrieve data from one or many sources of data storage(may be locally or on cloud or both) into a processing environment.

Transform:

Transform here means using the extracted data and cleaning the data, managing null values, removing duplicate rows or columns, removing outliers joining multiple tables, or any other data related operations.

Load:

Loading is pushing the transformed data to a target data storing location or maybe a machine learning model.

There are many tools available for ETL operation on-premises. For ETL on cloud(GCP), we have to use services provided by GCP.

The procedure used on GCP to implement a ETL pipeline is shown.

Storage Services:

Data can be stored in Google Cloud Storage, Cloud Filestore or Bigquery.

Google cloud storage is a object based online data store which offers affordable, reliable, highly available storage which can accumulate many types of files. Google cloud filestore is a NFS based storage. It is mounted with VM by attaching it to compute engine. It provides low latency file system for quick access which is used for high speed application. Big query is a serverless data warehouse with analytical abilities It gives free 10GB storage with affordable price to extend capacity.

Extraction and transforming services:

Google Cloud Dataflow is a data processing service which does batch or real time data processing. It is used for creating data pipeline and transforming data. Cloud functions is a serverless coding environment. It is a fully managed service which supports many coding languages like Java, Python, etc. It gives us option to create a user defined custom code function which can be used for extraction and transforming. Cloud Data prep is a serverless service used for visualizing, data cleaning, processing the data. These are some services which is used for extracting data from multiple data storages which may be on cloud or on any local devices connected to internet.

Loading service:

These processed data is the final data that is to be used for solving business related problems. It can be used to feed data to other services for many purposes. It can be given to AI platform which can feed it to ML model training and testing. The Cloud SQL provides an efficient storage option for structured tabular data. Cloud Datastore provides a scalable, fully managed NoSQL data storage option. These services are used to load and then use that data for storage and processing it for any specific purpose or storage for further use.

GCP provides a very fast and efficient service solution which can easily replace all local system based tools. ETL pipeline proposed in this blog can be used for creating an efficient pipeline for ETL operation fully on cloud.