GCP Professional Data Engineer Certification Workshop

Architecting Data Processing Solutions


Now, let’s take a look at some common architectural solutions on GCP.

Almost everyone has static web content. Images, videos, code files like CSS, JavaScript, and HTML. This type of content should be delivered from a Cloud Storage bucket. Cloud Storage is cheap, highly available, and massively scalable. You can even turn versioning on to ensure you don’t override a file by accident. Use Google’s DNS service to map your domain to the bucket. If you have users everywhere, you can take advantage of the CDN for caching around all over the world.

You might need a database for hosting a dynamic web application. One good way of doing this is using Cloud SQL to host a MySQL or PostgreSQL database. For high availability, you can create a failover replica in another zone. This is not the only way to do this. If you prefer a NoSQL database, you could use Firestore instead.

Google’s main data warehouse product is BigQuery. Data from many sources can be imported into BigQuery and then easily analyzed using SQL.

Google Cloud Storage is often used as a staging location. So, upload your data to Storage first and then importing the data into BigQuery will be faster.

If your data needs to be transformed before it is put in BigQuery, you can use Dataflow to build an ETL pipeline.

Cloud Dataflow is Google’s service for building data processing pipelines. Dataflow uses an open-source data processing framework called Apache Beam to code the pipeline.

Dataflow can do batch processing of static data that is uploaded to Storage or some other data service.

Dataflow can also process streaming data in real time by attaching to Google’s messaging service Pub/Sub. Messages can be sent into Pub/Sub from any application or device and Dataflow can process those messages as they arrive.

At the end of the pipeline, Dataflow can write to many different storage services. BigQuery may be the most common, but depending on how the data is processed, Bigtable might be better, or maybe a relational database like Cloud SQL or Spanner.

All requests made to GCP are written to the logging system provided by Stackdriver. You can also program your applications to send messages to the logs. Logs can contain valuable information about your application usage, security, cost, and many other things.

You can have the system output the logs to storage or Pub/Sub. If you choose Pub/Sub, the logs will be streamed in real time. Dataflow can be used for processing the log data and writing the results into BigQuery for analytics.

Using Google’s data storage and processing services, you can aggregate data from many applications and sources and build complex data analysis systems without having to worry about all the plumbing that makes these services work.

Being a competent Google Cloud Professional Data Engineer requires knowledge of these services and where they can fit into your overall solution.

This chapter was a big picture overview of data engineering problems and how GCP services can fit into how you architect solutions.

To finish this module, take the quiz. Also, don’t forget to do any hands-on exercises and explore any links that are provided in this chapter.