Amazon Redshift is data warehouse cloud solution that makes it easy to gain new insights from your data, it is a fully managed service so you won’t have the burden of updating, patching or backing up your database: Amazon will take care of these duties.
Amazon Redshift lets you query files in open format such as CSV, Parquet, ORC, JSON, Avro using standard SQL, and integrate other AWS or even third party services. Using Federated Queries, you can query data that live across one or more Amazon RDS databases effortless, without the need to move them. Redshift ML is the perfect tool for Data and Machine Learning engineers to create and train Amazon SageMaker models using data stored in Amazon Redshift, and also use those models directly when querying the database.
Amazon Redshift has been evolved from PostgreSQL and is now the industry leading solution for performance and flexibility. Thanks to the AQUA (Advanced Query Accelerator) for Amazon Redshift, a distributed and hardware-accelerated cache, your queries will run 10x faster with respect to other enterprise data warehouse solutions. Materialized views and result caching enables you to achieve faster query performance when using Business Intelligence (BI) tools or when developing Extract Transform and Load (ETL) workflows. The database engine is powered by machine learning algorithms to deliver high throughput even during concurrent activities. The Short Query Accelerator (SQA) estimates the computational effort required by the queries and prioritize short running queries.
Amazon Redshift is virtually unlimited. You can setup managed storage to allow the automatic increase of storage when needed, up to 8PB of (compressed) data. Also, using the console or a few API calls you can scale in or out the number of nodes in a cluster in a minute. Thanks to the Concurrency Scaling, you can run thousands of concurrent query independently from the fact the query targets data stored on the warehouse or directly in your Amazon S3 data lake. Data sharing functionality enables you to share living data across different Redshift Cluster. This functionality gives you and your organization high performance access to any data inside any Redshift cluster you may have.
With a couple of parameters, it is possible to enable both data encryption at rest (AES-256) and in transit (SSL) without caring about key management, that is taken in charge by Amazon Redshift itself. Once encryption at rest is enabled, not only data written to disk is encrypted but also backups are. According to the least privilege principle, Amazon Redshift gives the possibility to set up fine-grain access control policies at rows and columns level. The tight integration between Amazon Redshift and AWS CloudTrail allows a detailed monitoring of every Redshift API call. All the operations executed against the database, like SQL queries, connection attempts or changes to the data warehouse are logged into system tables and can be queried or exported to a secured Amazon S3 bucket.
As Storm Reply, we have developed a set of best practices for Amazon Redshift by leveraging our experience in design, implementation and maintenance of huge Data Warehouse solutions for our enterprise customers.
Data protection in Amazon Redshift can be enabled both for data in transit and at rest. The service integrates with AWS Certificate Manager to support SSL connections for in transit encryption while at rest encryption can be managed both client-side or server-side. Server-side encryption is fully managed by AWS and your data is encrypted before it’s written to disk and this is the advised choice for most of use cases because it reliefs the user from the burden of key management.
Amazon Redshift continuously backs up your data to Amazon S3 and provides tools to restore snapshots in any AZ in case of failure. In some cases, faster RPO/RTO are required and for such scenarios you can integrate Amazon Redshift with Amazon Kinesis and Amazon Route53 to deploy parallel clusters and achieve automatic failover.
Amazon Redshift has a wide offering for what concern the hardware features for the nodes of your cluster. Choosing the right node size is crucial both for performance and pricing. We carefully analyse your data and use case to choose the node type that best fits your needs. RA3 nodes provide high speed caching, managed storage (this allows to scale compute separately from storage) and high bandwidth networking, whereas DC2 nodes are optimised for compute-intensive workload on the data warehouse. DS2 nodes are storage optimised and the best solution for workflow with a medium compute workload and huge amount of data.
Amazon Redshift uses a compressed columnar data structure that provides high query throughput itself and can be leveraged at its most by following some tips. For example, Materialized Views can significantly boost query performance by precomputing frequent queries used during ELT processes or by BI tools. Concurrency scaling allows your Amazon Redshift cluster to add capacity dynamically in response to the workload arriving at the cluster, or if you have spikes with predictable schedule you can use the elastic resize scheduler feature. The data inside your nodes can be distributed and sorted by marking some column of your tables as Distribution Key or Sort Key. The use of a proper distribution key(s) enables optimized JOIN queries whereas selecting the right sort key(s) enhances the performance of SELECT queries.
www.storm.reply.com