The volume and demand of data have been increasing significantly in recent years with the development of Big Data analysis technologies as well as growing capabilities of the IT infrastructure. Nowadays, people can also gain insights from their data and help them to make a more accurate prediction based on machine learning models. Thus, a qualified data scientist is able to:
The education on data engineering has still been limited in spite of its importance. In many ways, the only feasible path to get training in data engineering is to learn on the job, and it can sometimes be too late. I am going to share my experience and how I passed the Google Cloud Professional Data Engineer Certification Exam without three years of practical experience.
Over the past few months, I have spent my time taking online courses. The topics include the basic of machine learning, SQL, data processing, Hadoop Ecosystem and Google Cloud. From my experience, I believe the foundation (collect, move/store, explore/transform) in the pyramid shown above is the most important. You should know how to collect/access the raw data and properly design table schemas or build data pipelines (ETL i.e. extract, transform, and load). Without a clean dataset, nobody can perform a good analysis or build a model on top of that.
Due to the large volume of data and the requirement for low latency analysis, cloud computing is preferred because the workload has been distributed across many machines in order to handle large datasets. Amongst all available platforms, I chose Google Cloud Platform mostly because of its machine learning landscape (ahead of the pack and moving quickly). In order to obtain the certification from GCP, you should know how to use the appropriate product to store, process and analysis data plus the knowledge of machine learning. You are also expected to know the IAM role and how to monitor the workflow on GCP using Stackdriver. My suggestion is to learn from the mock exam questions as many as you can and dive deep in the GCP documentation.
There are many public datasets available online. You should test your knowledge using these online resources. I suggested the reader that you should try to analysis and run some queries on the dataset on your own or write your own codes to build a machine learning model from the beginning.
If you know nothing about data engineering and you are interested in jumping into the world of data, start learning today using all the online resources and practice your skills with the public datasets.
Alpha Reply is the company of Reply specialized in Finance, Data and Predictive Analytics. I joined Alpha in July after completed my Math PhD at Oxford and started immediately my training in Finance and Data Science. How Alpha Reply can Help
Alpha Reply's mission is to excel in helping Financial Institutions become truly Data-Driven. Building on our large experience of incumbent banks and fintechs, we help our clients transform their business in the Risk & Compliance, Digital and Customer Engagement areas. For more information, contact us at alpha@reply.com
References
The AI Hierarchy of Needs article from Monica Rogati accessed on 04 September 2019