A Coflow-based Co-optimization Framework for High-performance Data Analytics

Conference Paper (2017)
Author(s)

Long Cheng (Eindhoven University of Technology)

Ying Wang (Chinese Academy of Sciences)

Yulong Pei (Eindhoven University of Technology)

D.H.J. Epema (TU Delft - Data-Intensive Systems)

Research Group
Data-Intensive Systems
Copyright
© 2017 Long Cheng, Ying Wang, Yulong Pei, D.H.J. Epema
DOI related publication
https://doi.org/10.1109/ICPP.2017.48
More Info
expand_more
Publication Year
2017
Language
English
Copyright
© 2017 Long Cheng, Ying Wang, Yulong Pei, D.H.J. Epema
Research Group
Data-Intensive Systems
Pages (from-to)
392-401
ISBN (electronic)
978-1-5386-1042-8
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Efficient execution of distributed database operators such as joining and aggregating is critical for the performance of big data analytics. With the increase of the compute speedup of modern CPUs, reducing the network
communication time of these operators in large systems is becoming increasingly important, and also challenging current techniques. Significant performance improvements have been achieved by using state-of-the-art methods, such as reducing network traffic designed in the data management domain, and data flow scheduling in the data communications domain.
However, the proposed techniques in both fields just view each other as a black box, and performance gains from a co-optimization perspective have not yet been explored.
In this paper, based on current research in coflow scheduling,
we propose a novel Coflow-based Co-optimization Framework
(CCF), which can co-optimize application-level data movement
and network-level data communications for distributed operators,
and consequently contribute to their performance in
large distributed environments. We present the detailed design
and implementation of CCF, and conduct an experimental
evaluation of CCF using large-scale simulations on large data
joins. Our results demonstrate that CCF can always perform
faster than current approaches on network communications in
large-scale distributed scenarios.