Print Email Facebook Twitter Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: A case study of SciDB Title Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: A case study of SciDB Author Liu, H. Contributor Van Oosterom, P. (mentor) Tijssen, T. (mentor) Commandeur, T. (mentor) Lindenbergh, R. (mentor) Faculty Architecture and The Built Environment Department Geomatics Programme Geomatics Date 2014-10-28 Abstract Like many ICT related domains, hydrology enters the era of big data and managing large volume of data is a potential issue facing hydrologists. However at present, hydrologic data research is mostly concerned with data collection, interpretation, modelling and visualization. Management and query of large datasets do not draw much interest. The motivation of this research originates from a specific data management problem reflected by Hydrologic Research B.V. and that is, time series extraction costs intolerable time when the large multidimensional dataset is stored in NetCDF classic or 64-bit offset format. The essence of this issue lies in the contiguous storage structure adopted by NetCDF. So in this research, NetCDF-4 format and a multidimensional array database applying chunked storage structure are benchmarked to learn whether and how chunked storage structure can benefit queries executed by hydrologists. To achieve a convincing and representative benchmark result, expert consultancy was conducted to collect queries and sample datasets frequently handled by water experts. From the raw consultancy records, 5 classes of query were summarized and specific queries for benchmarking were devised. After this, 9 criteria were established to assess which multidimensional array database is most suitable for benchmarking and finally SciDB was chosen. To establish a fair benchmark test environment, HydroNET-4 system was utilized and adapters for NetCDF files and SciDB were developed to manage and query data. For final benchmark tests, influence of data compression on query, and scalability of diverse data solutions, i.e. 64-bit offset, NetCDF-4 and SciDB solutions were investigated. In addition, chunk size and dimensions order effect of SciDB arrays on query performance were also explored. It turns out that NetCDF-4 solution with a recommended chunk size has the best overall management and query performance among all solutions. SciDB arrays utilizing small chunk sizes present favorable performance. However with current implementation of SciDB, large amount of small chunks cause huge overload of main memory which constraints SciDB’s scalability. For SciDB, DEFLATE compression can either have negative or no effect on query performance. In time series extraction test, compression effect is found to be correlated with chunk size and the negative impact of compression on query decreases as chunk size reduces. It is deduced that with hypercubic and modest chunk sizes, the internal data structure of chunks in SciDB has no significant influence on query performance. The research demonstrates that for large data management and query, a file based solution can be a better choice than a database utilizing smart caching and indexing strategies. But due to the limited scope of the research, for instance no parallel query processing tested, more work need to be conducted in the future. Subject NetCDFSciDBchunked storage structurebenchmark testhydrologic datasetsmultidimensional array database To reference this document use: http://resolver.tudelft.nl/uuid:e55765a9-dfcb-4a6d-b9fb-f9279137c50c Part of collection Student theses Document type master thesis Rights (c) 2014 Liu, H. Files PDF ThesisHydroDataBenchmark.pdf 3.98 MB PDF P5Slides.pdf 917.06 KB Close viewer /islandora/object/uuid:e55765a9-dfcb-4a6d-b9fb-f9279137c50c/datastream/OBJ1/view