Google Data Engineering certificate covers the data engineering lifecycle, machine learning, Google case studies, and GCP’s storage, compute, and big data products. Here is the part 1:
What is Data Engineering?
Data engineering enables data-driven decision making by collecting, transforming, and visualizing data. A data engineer designs, builds, maintains, and troubleshoots data processing systems with a particular emphasis on the security, reliability, fault-tolerance, scalability, fidelity, and efficiency of such systems.
A data engineer also analyzes data to gain insight into business outcomes, builds statistical models to support decision-making, and creates machine learning models to automate and simplify key business processes.
• Build/maintain data structures and databases
• Design data processing systems
• Analyze data and enable machine learning
• Design for reliability
Google Compute Platform (GCP)
GCP is a collection of Google computing resources, which are offered via services. Data engineering services include Compute, Storage, Big Data, and Machine Learning.
The 4 ways to interact with GCP include the console, command-line-interface (CLI), API, and mobile app.
The GCP resource hierarchy is organized as follows. All resources (VMs, storage buckets, etc) are organized into projects. These projects may be organized into folders, which can contain other folders. All folders and projects; can be brought together under an organization node. Project folders and organization nodes are where policies can be defined. Policies are inherited downstream and dictate who can access what resources. Every resource must belong to a project and every must have a billing account associated with it.
Advantages: Performance (fast solutions), Pricing (subhour billing, sustained use discounts, custom machine types), PaaS Solutions, Robust Infrastructure.
Hadoop Data can no longer fit in memory on one machine (monolithic), so a new way of computing was devised using many computers to process the data (distributed). Such a group is called a cluster, which makes up server farms. All of these servers have to be coordinated in the following ways: partition data, coordinate computing tasks, handle fault tolerance/recovery, and allocate capacity to process.
Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. It is comprised of 3 main components:
• Hadoop Distributed File System (HDFS):a distributed file system that provides highthroughput access to application data by partitioning data across many machines
• YARN: framework for job scheduling and cluster
•MapReduce: YARN-based system for parallel processing of large data sets on multiple machines
Each disk on a different machine in a cluster is comprised of 1 master node; the rest are data nodes. The master’ node manages the overall file system by storing the directory structure and metadata of the files. The data nodes physically store the data. Large files are broken up distributed across multiple machines, which are replicated across 3 machines to provide fault tolerance.
Parallel programming paradigm which allows for processing of huge amounts of data by running processes on multiple machines. Defining a MapReduce job requires two stages: map and reduce.
YARN- Yet Another Resource Negotiator
Coordinates tasks running on the cluster and assigns new nodes in case of failure. Comprised of 2 subcomponents: the resource manager and the node manager. The resource manager runs on a single master node and schedules tasks across nodes. The node manager runs on all other nodes and manages tasks on the individual node.
An entire ecosystem of tools have emerged around Hadoop, which are based on interacting with HDFS.
Hive: data warehouse software built o top of Hadoop that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL-like queries (HiveQL). Hive abstracts away underlying MapReduce jobs and returns HDFS in the form of tables (not HDFS).
Pig: high level scripting language (Pig Latin) that enables writing complex data transformations. It pulls unstructured/incomplete data from sources, cleans it, and places it in a database/data warehouses. Pig performs ETL into data warehouse while Hive queries from data warehouse to perform analysis (GCP: DataFlow).
Spark: framework for writing fast, distributed programs for data processing and analysis. Spark solves similar problems as Hadoop MapReduce but with a fast in-memory approach. It is an unified engine that supports SQL queries, streaming data, machine learning and graph processing. Can operate separately from Hadoop but integrates well with Hadoop. Data is processed using Resilient Distributed Datasets (RDDs), which are immutable, lazily evaluated, and tracks lineage.
Hbase: non-relational, NoSQL, column-oriented database management system that runs on top of HDFS. Well suited for sparse data sets (GCP: Big Table)
Flink/Kafka: stream processing framework. Batch streaming is for bounded, finite datasets, with periodic updates, and delayed processing. Stream processing is for unbounded datasets, with continuous updates, and immediate processing. Stream data and stream processing must be decoupled via a message queue. Can group streaming data (windows) using tumbling (non-overlapping time), sliding (overlapping time), or session (session gap) windows.
Beam: programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. After building the pipeline, it is executed by one of Beam’s distributed processing backends (Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow). Modeled as a Directed Acyclic Graph (DAG).
Oozie: workflow scheduler system to manage Hadoop jobs
Sqoop: transferring framework to transfer large amounts of data into HDFS from relational databases (MySQL)
Are you ready for Part-2? Let me know in comment below if you like it.