Step-by-step transformations on arbitrary text input
When I started out learning how to use Spark to transform data from Kafka streams, I had some difficulty figuring out what I needed to do. …
Does “do-it-yourself” have to be so hard?
The process of installing Arch Linux is quite different compared to other OS’s. In fact, installing a standard Linux ISO is more similar to installing Windows 10 than it is to installing Arch Linux. And, if you have been around the computer-savvy side of the internet, you know of Arch’s notorious installation difficulty.
Why install Arch if it is so difficult? First, the difficulty itself stems from Arch letting you make almost all the decisions. You can choose between Linux kernels, packages to install, partitioning schemes, and many other aspects. …
For when you really just don’t want to pay for vSphere
When I was pursuing my undergraduate degree in IT, one of my favorite classes taught me about how cloud computing services worked behind the scenes. The best part was the group project I spent far too much time on. The class was split up into teams and given computers and networking equipment. …
CDC-like data pipeline using MySQL binary logs
In my previous set of tutorials, I explained how to use the Debezium connector to stream database changes from Microsoft SQL Server. However, Debezium has connectors for many other databases. One of the more popular choices is Oracle’s MySQL and we’ll be going over using Debezium to stream changes from it.
NOTE: This tutorial can more or less stand in for Part 1 in my “Creating…
Creating a CDC data pipeline: Part 3
This is the second part in a three-part tutorial describing instructions to create a Microsoft SQL Server CDC (Change Data Capture) data pipeline. However, this tutorial can work as a standalone tutorial to install and use Grafana to visualize metrics.
Creating a CDC data pipeline: Part 2
This is the second part in a three-part tutorial describing instructions to create a Microsoft SQL Server CDC (Change Data Capture) data pipeline. However, this tutorial can work as a standalone tutorial to install Apache Spark 2.4.7 …
Creating a CDC data pipeline: Part 1
In this three-part tutorial, we will learn how to set up and configure AWS EC2 instances to take Change Data Capture row insertion data from Microsoft SQL Server 2019, collect it in Apache Kafka, aggregate periodically with Apache Spark’s streaming capability, and track the live updates using Grafana.
II. System Requirements
III. Part 1 — Create AWS Instances (~15 minutes)
IV. Part 2 — Setup on Each Node (~15 minutes)
V. Part 3 — Initialize Greenplum Database (~15 minutes)
Greenplum Database uses an MPP (massively parallel processing) database architecture that is able to take advantage of distributed computing to efficiently manage large data workloads.
The basic structure of a Greenplum cluster involves one master node and one or more segment nodes each running an independent Postgres instance. The master node serves as the entry point for client requests and segments nodes each store a portion of…