Amazon Redshift- The data warehouse solution
Hi
folks, this is Utpal and in this article I have discussed Amazon Redshift
with demo.
In the same week, I have written another article on CloudEndure (migration of server to AWS); you may read this later.
Data warehouse is the centralized warehouse of data which is a OLAP (Online Analytical Processing) database.
Basically, there are two types of data- operational data and strategic data.
Operational data used for day-to-day purposes like transaction.
Strategic data (also known as historical data) used for various analysis, reporting, mining. These are useful to take business decisions. They are stored in data warehouse.
For example, let we have a company with total 15 branches throughout the country. Each branch produces various types data which are needed to be retrieved to get the overall view, do some analysis, reporting, mining etc.
Now, it is not a good solution to retrieve data from each of the 15 branches every time we need. Instead, we should have a large central data storage and from each of the branches we should bring data to it. There we can retrieve data whenever we need. This is called data warehouse.
All data coming from the branches are integrated into a standard format.
Here the stored data varies with time and we can do time-based analysis. Hence it is time variant.
Data warehouse is called non-volatile as here, the stored data cannot be changed, only can be deleted. Although there is separate mechanism to store any changes into data warehouse.
Data warehouse consists of small data marts each having details of any particular information (example sales, customer info, product info etc.). Thus, it is subject oriented.
There are 4 stages in data warehouse-
- Collect or fetch data from sources by ETL (Extract, Transform, Load) process.
- Store the data in staging layer.
- Analyze the data to take various business decisions.
- Consume and explore the data with tools like Tableau, Quicksight etc.
Redshift is based on PostgreSQL. It has 10 times better performance than other data warehouses and can scale to Petabytes.
Here data is stored in column, thus overall disk I/O requirements and amount of data to be loaded from disk is reduced.
From S3, DynamoDB, DMS we can load data to Redshift. We can use S3 as a data lake before loading the data into Redshift. Here is a brief description-
First, data is exported from a source system into S3 bucket etc in compressed format. Then using COPY command, the data is copied into Redshift table. Then using SQL client, we can run query on the data and the result will be returned to the client.
Redshift has 1-128 nodes (each having up to 160 GB space) and the collection of nodes is called cluster. There are leader node and compute nodes. Leader node is used for query planning and result aggregation across all the compute nodes. Compute nodes perform the queries. There are two types- ‘Dense Compute (DC2)’ with SSD storage and “Dense Storage (DS2)’ with magnetic storage.
The pricing is pay as you go.
With Redshift Enhanced VPC Routing all the copies of data go through VPC instead of public internet (hence more secure).
There are point-in-time backups (snapshot) of a Redshift cluster which are internally stored in S3 using an SSL connection. Snapshots are incremental and can be restored into a new cluster. There are both automated and manual snapshot options available. For disaster recovery, Redshift can be configured to automatically copy snapshots to another region.
With Redshift Spectrum, we can query data in S3 bucket without loading it into Redshift cluster. Though a cluster is needed to start the queries which are submitted to Redshift spectrum nodes.
So, let go to the console.
From the AWS Management Console, opened Amazon Redshift. From the left navigation pane clicked on Clusters then Create cluster.
In the cluster configuration section, entered cluster identifier, Production, node type DC2 large (for High performance with fixed local SSD storage) and number of nodes as 2.
In the database configuration section, entered the database name & port, master username and password.
In the cluster permission section, selected ‘Redshift role’ form the drop-down menu and added.
In the additional configuration section, deselected ‘Use defaults’
In network and security, selected the VPC, security group, cluster subnet group. Left all other options as default.
Finally clicked on Create cluster.
After few minutes, the Redshift cluster has been launched. In the Actions menu, we have some options. In the DASHBOARD of the left navigation pane, we can view cluster overview.
By clicking on the cluster name, the configuration can be viewed.
To use Redshift, we need an SQL Client that provides UI to type SQL and support JDBC or ODBC. I have used the Redshift Query editor.
From the left navigation pane clicked on Editor. Then selected ‘Create new connection, my cluster name from drop-down menu, entered database name, database user and password. Finally clicked on Connect to database.
In the Query-1 window, I have created a table using SQL and clicked on Run.I have also imported data (pipe separated) from S3 bucket into Redshift. Entered the following SQL command into Query-1 window and clicked Run.I have run various queries on this imported data.I have also used a web-based PostgreSQL client (“pgweb”) to connect Redshift. For this I have created another Redshift cluster.In the Host, entered the endpoint of my Redshift cluster and removed ‘:5439/lab’ from the end. Then entered the username, password, database, port. Selected SSL required. Clicked on Connect.
Then in the console, clicked on SQL Query tab, entered SQL commands to create a table and clicked on Run Query.I have uploaded sample data in an S3 bucket and loaded them into Redshift using COPY command and run this-
Data files are loaded in parallel from the S3 bucket.In my Redshift cluster, clicked on Query monitoring tab. By refreshing here, we can view the history of data loading and queries.By clicking on any one of the queries, we can view its details.Below in the Query plan tab, we can view the cluster health status, CPU utilization, used storage capacity, active database connections.In the Maintenance and monitoring tab, we can view recommendations, maintenance status, audit logging, usage limit. We can create alarm.In the Cluster performance tab, we can view information about the cluster.In the Backup tab, we can create and view snapshot, restore table.In the Properties tab, we can we can manage the cluster permission, resize the cluster, edit the database configuration etc.So, this is the end of this article. I hope it will be helpful. For more details please view the AWS documentation.