Searched for: subject%3A%22Apache%255C%2BArrow%22
(1 - 20 of 20)
document
Su, Kexin (author)
General-purpose GPUs, renowned for their exceptional parallel processing capabilities and throughput, hold great promise for enhancing the efficiency of data analytics tasks. At the same time, recent developments in query execution engines have integrated the support of OLAP operations in a way that benefits from the zero serialization overhead...
master thesis 2023
document
Groet, Philip (author)
With the rise of the new interconnect standards CXL and previously OpenCAPI, has come a great deal of possibilities to step away from the classical approach where CPUs are in charge of moving data between external devices and local memory. Specifically, OpenCAPI allows for attached devices to directly interface with the host memory bus in a near...
master thesis 2023
document
Ahmad, T. (author)
The ever increasing pace of advancements in sequencing technologies has enabled rapid DNA/genome sequencing to become much more accessible. In particular, next (second) and third generation sequencing technologies offer high throughput, massively parallel and cost effective sequencing solutions. Individual sample sequencing data volumes as well...
doctoral thesis 2022
document
Ahmad, T. (author), Al-Ars, Z. (author), Hofstee, H.P. (author)
Moving structured data between different big data frameworks and/or data warehouses/storage systems often cause significant overhead. Most of the time more than 80% of the total time spent in accessing data is elapsed in serialization/de-serialization step. Columnar data formats are gaining popularity in both analytics and transactional...
conference paper 2022
document
Abrahamse, Robin (author), Hadnagy, A. (author), Al-Ars, Z. (author)
The concept of memory disaggregation has recently been gaining traction in research. With memory disaggregation, data center compute nodes can directly access memory on adjacent nodes and are therefore able to overcome local memory restrictions, introducing a new data management paradigm for distributed computing. This paper proposes and...
conference paper 2022
document
Ahmad, T. (author), Ma, Chengxin (author), Al-Ars, Z. (author), Hofstee, H.P. (author)
Current cluster scaled genomics data processing solutions rely on big data frameworks like Apache Spark, Hadoop and HDFS for data scheduling, processing and storage. These frameworks come with additional computation and memory overheads by default. It has been observed that scaling genomics dataset processing beyond 32 nodes is not efficient on...
conference paper 2022
document
Luppes, Bob (author)
The increasing volume and latency requirements of big data impose challenges on the processing capacity of existing computing systems. FPGA accelerators can be leveraged to overcome these challenges, but questions remain as to how these accelerators are best deployed to accelerate big data frameworks. This work investigates how future big data...
master thesis 2021
document
Ahmad, T. (author), Al-Ars, Z. (author), Hofstee, H.P. (author)
Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of...
review 2021
document
Peltenburg, J.W. (author), Hadnagy, A. (author), Brobbel, M. (author), Morrow, Robert (author), Al-Ars, Z. (author)
JSON is a popular data interchange format for many web, cloud, and IoT systems due to its simplicity, human readability, and widespread support. However, applications must first parse and convert the data to a native in-memory format before being able to perform useful computations. Many big data applications with high performance requirements...
conference paper 2021
document
Peltenburg, J.W. (author), van Straten, J. (author), Brobbel, M. (author), Al-Ars, Z. (author), Hofstee, H.P. (author)
As big data analytics systems are squeezing out the last bits of performance of CPUs and GPUs, the next near-term and widely available alternative industry is considering for higher performance in the data center and cloud is the FPGA accelerator. We discuss several challenges a developer has to face when designing and integrating FPGA...
journal article 2021
document
Peltenburg, J.W. (author), Van Leeuwen, Lars T.J. (author), Hoozemans, J.J. (author), Fang, J. (author), Al-Ars, Z. (author), Hofstee, H.P. (author)
In the domain of big data analytics, the bottleneck of converting storage-focused file formats to in-memory data structures has shifted from the bandwidth of storage to the performance of decoding and decompression software. Two widely used formats for big data storage and in-memory data are Apache Parquet and Apache Arrow, respectively. In...
conference paper 2021
document
Hadnagy, A. (author)
Recent trends in large-scale computing demonstrate continuous growth in the need for raw processing performance. At the same time, the slowdown of vertical scaling pushes the industry towards more energy-efficient heterogeneous architectures. With the appearance of FPGAs in the cloud and data centers, a new architecture is offered for offloading...
master thesis 2020
document
Nonnenmacher, F.M. (author)
Through new digital business models, the importance of big data analytics continuously grows. Initially, data analytics clusters were mainly bounded by the throughput of network links and the performance of I/O operations. With current hardware development, this has changed, and often the performance of CPUs and memory access became the new...
master thesis 2020
document
Ma, Chengxin (author)
Variant calling is a classic example of DNA data analysis, in which variants from DNA sequence data are identified. In recent years, a number of frameworks have been developed for implementing the variant calling pipeline. As a result, deeper scientific insights have been gained into genomics. From the computational point of view, however, these...
master thesis 2020
document
Ahmad, T. (author), Ahmed, N. (author), Peltenburg, J.W. (author), Al-Ars, Z. (author)
The rapidly growing size of genomics data bases, driven by advances in sequencing technologies, demands fast and cost-effective processing. However, processing this data creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing....
conference paper 2020
document
Ahmad, T. (author), Ahmed, N. (author), Al-Ars, Z. (author), Hofstee, H.P. (author)
Background: Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for...
journal article 2020
document
Metaxas, Konstantinos (author)
As the digitisation of the world progresses at an accelerating pace, an overwhelming quantity of data from a variety of sources, of different types, organised in a multitude of forms or not at all, are subjected into diverse analytic processes for specific kinds of value to be extracted out of them. The aggregation of these analytic processes...
master thesis 2019
document
Peltenburg, J.W. (author), van Straten, J. (author), Wijtemans, L. (author), Van Leeuwen, Lars (author), Al-Ars, Z. (author), Hofstee, H.P. (author)
Modern big data systems are highly heterogeneous. The components found in their many layers of abstraction are often implemented in a wide variety of programming languages and frameworks. Due to language implementation differences, interfaces between these components, including hardware accelerated components, are often burdened by...
conference paper 2019
document
van Leeuwen, Lars (author)
With the advent of high-bandwidth non-volatile storage devices, the classical assumption that database analytics applications are bottlenecked by CPUs having to wait for slow I/O devices is being flipped around. Instead, CPUs are no longer able to decompress and deserialize the data stored in storage-focused file formats fast enough to keep up...
master thesis 2019
document
Peltenburg, J.W. (author), van Straten, J. (author), Brobbel, M. (author), Hofstee, H.P. (author), Al-Ars, Z. (author)
As a columnar in-memory format, Apache Arrow has seen increased interest from the data analytics community. Fletcher is a framework that generates hardware interfaces based on this format, to be used in FPGA accelerators. This allows efficient integration of FPGA accelerators with various high-level software languages, while providing an easy-to...
conference paper 2019
Searched for: subject%3A%22Apache%255C%2BArrow%22
(1 - 20 of 20)