Step-by-step transformations on arbitrary text input

Outline

  • Introduction (including software versions used)
  • Creating AWS Instance
  • Installation (Spark, Kafka)
  • Spark job 1: Output raw data to console
  • Spark job 2: Run custom functions on input and output as new column
  • Spark job 3: Parse JSON and output specific fields
  • Spark job 4: Run SQL functions on streaming dataframe
  • Spark job 5: Using Kafka Topic as sink for Apache Spark stream
  • Conclusion

Introduction

When I started out learning how to use Spark to transform data from Kafka streams, I had some difficulty figuring out what I needed to do. …


Does “do-it-yourself” have to be so hard?

Introduction

The process of installing Arch Linux is quite different compared to other OS’s. In fact, installing a standard Linux ISO is more similar to installing Windows 10 than it is to installing Arch Linux. And, if you have been around the computer-savvy side of the internet, you know of Arch’s notorious installation difficulty.

The icon of joy and frustration

Why install Arch if it is so difficult? First, the difficulty itself stems from Arch letting you make almost all the decisions. You can choose between Linux kernels, packages to install, partitioning schemes, and many other aspects. …


For when you really just don’t want to pay for vSphere

Outline

  • Introduction
  • Requirements
  • Part 0: Installing VirtualBox and Extension Pack on the Command Line
  • Part 1: Creating and Deleting a VirtualBox VM using “VBoxManage”
  • Part 2: Enabling RDP access for a VirtualBox VM
  • Part 3: Additional Considerations

Introduction

When I was pursuing my undergraduate degree in IT, one of my favorite classes taught me about how cloud computing services worked behind the scenes. The best part was the group project I spent far too much time on. The class was split up into teams and given computers and networking equipment. …


CDC-like data pipeline using MySQL binary logs

Outline

  • Introduction
  • Creating Security Groups and EC2 Instances (~15 min)
  • Installing MySQL and Configuring to Allow Binary Log Reading (~15 min)
  • Installing/Configuring Kafka and Debezium Connector (~15 min)

Introduction

In my previous set of tutorials, I explained how to use the Debezium connector to stream database changes from Microsoft SQL Server. However, Debezium has connectors for many other databases. One of the more popular choices is Oracle’s MySQL and we’ll be going over using Debezium to stream changes from it.

NOTE: This tutorial can more or less stand in for Part 1 in my “Creating…


Creating a CDC data pipeline: Part 3

Outline

  • Introduction
  • Creating Security Groups and EC2 Instances (~5 min)
  • Installing Graphite Carbon, Graphite Web, and StatsD (~15 minutes)
  • Installing Grafana (~5 min)
  • Configuring StatsD (~5 min)
  • Starting All Pipeline Services (~10 min)
  • Configuring Grafana and Creating a Dashboard (~10 min)
  • Completed Python File
  • Conclusion

Introduction

This is the second part in a three-part tutorial describing instructions to create a Microsoft SQL Server CDC (Change Data Capture) data pipeline. However, this tutorial can work as a standalone tutorial to install and use Grafana to visualize metrics.


Creating a CDC data pipeline: Part 2

Outline

  • Introduction
  • Creating Security Groups and EC2 Instances (~5 min)
  • Installing/Configuring Spark (~5 min)
  • Starting All Pipeline Services (~10 min)
  • Extracting CDC Row Insertion Data Using Pyspark (~15 min)
  • Running Own Functions on Output
  • Changing the Spark Job to Filter out Deletes and Updates
  • Completed Python File
  • Addendum

Introduction

This is the second part in a three-part tutorial describing instructions to create a Microsoft SQL Server CDC (Change Data Capture) data pipeline. However, this tutorial can work as a standalone tutorial to install Apache Spark 2.4.7 …


Creating a CDC data pipeline: Part 1

Outline

  • Introduction
  • Creating Security Groups and EC2 Instances (~15 min)
  • Configuring SQL Server for CDC (~15 min)
  • Installing/Configuring Kafka and Debezium Connector (~15 min)
  • Reading CDC Topic (~5 min)
  • Addendum 1: Important Commands Used
  • Addendum 2: Next Article in the Tutorial

Introduction

In this three-part tutorial, we will learn how to set up and configure AWS EC2 instances to take Change Data Capture row insertion data from Microsoft SQL Server 2019, collect it in Apache Kafka, aggregate periodically with Apache Spark’s streaming capability, and track the live updates using Grafana.

Part 1 will cover steps…


Source: https://twitter.com/Greenplum

Overview

I. Introduction

II. System Requirements

III. Part 1 — Create AWS Instances (~15 minutes)

IV. Part 2 — Setup on Each Node (~15 minutes)

V. Part 3 — Initialize Greenplum Database (~15 minutes)

Introduction

Greenplum Database uses an MPP (massively parallel processing) database architecture that is able to take advantage of distributed computing to efficiently manage large data workloads.

The basic structure of a Greenplum cluster involves one master node and one or more segment nodes each running an independent Postgres instance. The master node serves as the entry point for client requests and segments nodes each store a portion of…

Sandeep Kattepogu

A man with a passion for information technology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store