Heuristic Optimization of Amazon Redshift Table Configurations

Focusing on Distribution Style, Sort Keys and Column Encodings in Amazon Redshift

Master Thesis (2025)
Author(s)

X.L. Hu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Neil Yorke-Smith – Mentor (TU Delft - Algorithmics)

A. Katsifodimos – Mentor (TU Delft - Data-Intensive Systems)

Christoph Lofi – Graduation committee member (TU Delft - Web Information Systems)

Derek van den Broek – Mentor (PostNL B.V.)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
17-07-2025
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This thesis presents a comprehensive, heuristic cost-driven framework for optimizing database table configuration in Amazon Redshift focusing on distribution styles, sort keys and column encodings. Unlike existing approaches that treat optimization parameters independently, this research develops a sequential optimization methodology that captures complex interdependencies between configuration choices and their performance impacts across different data scales.

The study addresses four research questions examining individual parameter optimization strategies and their integrated effects on system performance. The experimental evaluation employs two datasets: a primary table containing 300 million data records across 23 columns where optimization is performed, and a secondary join table with 117.5 million data records across 12 columns that remains unchanged. Scale-dependent analysis is conducted using subsets of 10 million and 100 million data records selected from the primary dataset to enable controlled comparison across different data volumes.

Key findings demonstrate that table configuration optimization in Amazon Redshift exhibits pronounced scale-dependent performance characteristics across the experimental datasets, with three distinct performance regimes identified: a small-scale regime (10M data records) characterized by query-type dependent optimization effectiveness, a medium-scale regime (100M data records) showing optimization trade-off transitions, and a large-scale regime (300M data records) dominated by I/O and storage optimizations. The research reveals mixed optimization outcomes, with performance improvements ranging from 21 percent CPU reduction at small scales to 62 percent I/O improvement at large scales, while demonstrating that optimization strategies effective at one scale can become counterproductive at another. Overall, the framework shows variable success in parameter selection for distribution style, sort key and encoding selection.

The research identifies fundamental challenges in optimizing Amazon Redshift table configurations where internal algorithms remain opaque and optimization benefits exhibit non-linear scaling patterns across the tested different data volumes. While the framework provides valuable insights into scale-dependent optimization patterns, the mixed results highlight the complexity of achieving consistent performance improvements across different scales and query types. These findings challenge assumptions about uniform optimization benefits and emphasize the need for empirical validation approaches in cloud database optimization, providing practical insights for database administrators and theoretical foundations for developing adaptive optimization systems.

Files

License info not available
warning

File under embargo until 17-07-2027