Hopkins Storage Systems Lab

Storage and Database Systems for Science and Engineering

  • Increase font size
  • Default font size
  • Decrease font size
Home Research Processing Large Scientific Workloads

Processing Large Scientific Workloads

The Sciences are encountering a data avalanche problem in which improvements in physical instruments and data pipelines lead to an exponential growth in data size. Federated and clustered databases are attractive solutions for exploring the resulting massive, widely-distributed data. Examples include the SkyQuery federation of Astronomy databases and the JHU Turbulence Database Cluster. To ensure high job throughput and prevent starvation of traditional workloads, we propose LifeRaft, a query processing disciplines to eliminate redundant I/O for scan-based queries. Sites in SkyQuery service millions of queries each month in which queries may execute for several hours or an entire day. We reduce redundant I/O requests to the disk from concurrent queries by relaxing in-order scheduling. Specifically, rather than execute queries in arrival order, we interleave I/O requests from multiple queries based on contention for shared data. While our approach increases response time of certain queries, we observe over two fold improvement in system throughput for both Astronomy and Turbulence workloads.

Data size and geography in SkyQuery dictate that transmitting data takes a long time and has a profound impact on query performance. Each site may produce hundreds of megabytes of data that are sent to other sites to be joined before results are delivered to the scientists. We devised algorithms that incorporate network structure, such as the throughput of paths and clusters of sites, and account for data access requirements in query scheduling. By exploiting excess capacity in the network and avoiding transfers across large geographies, we achieve orders of magnitude benefit for queries that join ten or more databases.

Finally, we devise automated physical schema design tools for the management of large scale scientific databases. Many current tools are offline and demand that DBAs identify a representative workload for tuning, decide when tuning is required, and guesstimate the relative benefits of changing the schema design. We are currently developing AdaptPD, a tool that continuously monitors the workload and adapts the schema design to suit the incoming workload. It models schema design as a metrical task system and also includes query and transition cost estimation modules to ensure that the tuning process is light weight.

Publications

 

 

Last Updated on Saturday, 05 December 2009 01:58