Greenplum 6.7.1 on AWS
How to install a three-node Greenplum Database cluster with segment mirroring
Overview
I. Introduction
II. System Requirements
III. Part 1 — Create AWS Instances (~15 minutes)
IV. Part 2 — Setup on Each Node (~15 minutes)
V. Part 3 — Initialize Greenplum Database (~15 minutes)
Introduction
Greenplum Database uses an MPP (massively parallel processing) database architecture that is able to take advantage of distributed computing to efficiently manage large data workloads.
The basic structure of a Greenplum cluster involves one master node and one or more segment nodes each running an independent Postgres instance. The master node serves as the entry point for client requests and segments nodes each store a portion of accessible data. Useful information can be found on the Pivotal Docs page for Greenplum here.
System Requirements
In this tutorial, we will be using Amazon Web Services (AWS) to create EC2 instances to serve as the nodes for our 3-node Greenplum cluster. System requirements can be found on this Greenplum Pivotal Docs page.
NOTE: This tutorial assumes basic familiarity with creating and administering AWS instances, using SSH, using SCP, and basic Linux administration using BASH.
Part 1 — Create AWS Instances
Create AWS instances with the following settings. Accept defaults where details are left unspecified.
AWS Instance Details:
- Image type: Ubuntu Server 18.04 LTS (HVM)
- Minimum recommended instance type: r5ad.large
- Number of instances: 3
- Subnet: 172.31.0.0/16 (Use your default subnet; anywhere this subnet is used in the tutorial, substitute your own)
- Inbound Security Rules: SSH from My IP; SSH from subnet; All TCP from subnet; All ICMP IPv4 from subnet
NOTE: Listed Security Rules are not secure. Specific ports to consider include the various ports for Greenplum to communicate among nodes and the Postgres default port. Specific info about ports can be found in this Pivotal Docs page.
Part 2—Setup on Each Node
NOTE: The following steps must be carried out for every node created in Part 1 of this tutorial.
- Download Greenplum 6.7.1 deb package from GitHub.
- Secure copy the package to each instance.
- SSH to each instance and change the hostname.
# Master Node
sudo hostnamectl set-hostname master;
sudo reboot;# Segment Node 1
sudo hostnamectl set-hostname seg1;
sudo reboot;# Segment Node 2
sudo hostnamectl set-hostname seg2;
sudo reboot;
- Edit /etc/hosts file on with the new hostnames corresponding to their private IPs.
- Edit /etc/ssh/sshd_config to allow PasswordAuthentication.
- Restart the sshd service.
sudo service sshd restart
- Create storage areas for Greenplum to use.
# Master Node
sudo mkdir -p /gpdata/master;
sudo mkdir -p /gpdata/mirror# Segment Nodes
sudo mkdir -p /gpdata/primary;
sudo mkdir -p /gpdata/mirror
- Install the deb package.
sudo dpkg -i greenplum-db-6.7.1-ubuntu18.04-amd64.deb
- Create the gpadmin group and user.
sudo groupadd gpadmin;
sudo useradd gpadmin -r -m -g gpadmin;
sudo chsh -s /bin/bash gpadmin;
sudo passwd gpadmin
- Assign ownership of storage areas to “gpadmin” user.
- Assign ownership of Greenplum files to “gpadmin” user.
sudo chown -R gpadmin:gpadmin /gpdata;
sudo chown -R gpadmin:gpadmin /usr/local/greenplum-db-6.7.1
- Log in as gpadmin and create a key pair.
su gpadminssh-keygen -t rsa -b 4096
# Accept defaults; Do not input a password
- As gpadmin, import Greenplum environment variables.
source /usr/local/greenplum-db-6.7.1/greenplum_path.sh
- (Optional) Add lines to gpadmin’s .bashrc for ease of use and source the edited file.
echo "source /usr/local/greenplum-db-6.7.1/greenplum_path.sh; export MASTER_DATA_DIRECTORY=/gpdata/master/gpseg-1; cd ~" >> ~/.bashrc; source ~/.bashrc
Part 3—Initialize Greenplum Database
NOTE: The following steps are to be completed only on the master node.
- Log in as gpadmin to the master node and run ssh-copy-id for all three nodes. You will need to input the gpadmin password for each node.
ssh-copy-id masterssh-copy-id seg1ssh-copy-id seg2
- Create a file with the hostname of each node in a file called “hostlist”.
- Create a file with only the segment hostnames in a file called “hostlist_segonly”.
- Run the “gpssh-exkeys” command that comes with the Greenplum installation.
gpssh-exkeys -f hostlist
- Copy the example intialization file to the current directory (you may want to switch to /home/gpadmin).
cp /usr/local/greenplum-db-6.7.1/docs/cli_help/gpconfigs/gpinitsystem_config ./
- Change the following lines in the copied initializtion file:
declare -a DATA_DIRECTORY=(/data1/primary /data1/primary /data1/primary /data2/primary /data2/primary /data2/primary)
declare -a DATA_DIRECTORY=(/gpdata/primary /gpdata/primary /gpdata/primary /gpdata/primary)
# The above line forces the creation of four segments per segment nodes because the location "/gpdata/primary" appears four times.
# Any locations or combination of locations can be added here.
# Appropriate segmentation is addressed in the Pivotal Docs here.MASTER_HOSTNAME=mdw
MASTER_HOSTNAME=masterMASTER_DIRECTORY=/data/master
MASTER_DIRECTORY=/gpdata/master#DATABASE_NAME=name_of_database
DATABASE_NAME=testdatabase_1#MACHINE_LIST_FILE=/home/gpadmin/gpconfigs/hostfile_gpinitsystem
MACHINE_LIST_FILE=/home/gpadmin/hostlist_segonly
- Run the Greenplum initialization.
gpinitsystem -c gpinitsystem_config
- Enter “Y” when prompted to “Continue with Greenplum creation”.
- Enable segment mirroring using the “gpaddmirrors” utility. You will need to supply the location “/gpdata/mirror” multiple times.
gpaddmirrors -p 10000
- Supply the location “/gpdata/mirror” when prompted. This will happen based on the number of segments you specified per host.
- Enter “Y” when prompted to “Continue with add mirrors procedure”.
- Run the “gpstate” command to find out if all segments were created successfully. Because we have 2 segment nodes with 4 segments each, each segment having a corresponding mirror segment, we should see 16 segments.
gpstate -b
- Access the test database created as part of the initialization.
/usr/local/greenplum-db-6.7.1/bin/psql -h master -d testdatabase_1 -U gpadmin