If you have worked on R or Python to build Machine Learning models, you can relate to how time-consuming it can get to select the right syntax, perform EDA, feature selection, packages to import or decide on the values of the hyper parameters.
Google Cloud Platform has not just made it easy for all the coders but also for those who do not know how to perform Machine Learning using R or Python. Google’s Big Query ML introduced by GCP comes to rescue as it only requires the knowledge of SQL to build and implement ML models and generate insights out of any sort of data.
Big Query is Google’s managed data warehouse in the cloud. It is incredibly fast and can scan billions of rows in seconds. It is also comparatively inexpensive and easy to use. Big Query ML broadly consists of:
Creating and Training a model: This allows you to write a CREATE statement for various ML algorithms including Time-series, DNN, Tensor flow etc. Running this statement will split the data and build a Training model.
Model Evaluation: This step provides you with different functions like ML.EVALUATE, ML.CONFUSION_MATRIX, ML.ROC_CURVE to evaluate different type of models.
Model Prediction: Finally, it automatically takes the testing part of the data from the backend query and provides 3 functions to give predictions for different types of data. ML.FORECAST for time-series data, ML.PREDICT for regression and ML.RECOMMEND for Classification.
Big Query ML also provides functions for Model and Feature inspection where you can select from the in-built functions to check the model diagnostics and see how valuable every feature is.
Building and deploying models using just SQL can be a game changer in the Data Science market in future. I personally cannot wait to get my hands dirty and try on Google’s new tools and generate meaningful insights out of the data.
written by Sanjula Kaur
Hadoop is an open-source framework of programs that is used to store and process big data. Hadoop uses multiple clusters of computers to analyze big data sets in parallel. The distributed processing of data sets can