Item Details

Automated Data Management in Cloud Computing

Ruiz Alvarez, Arkaitz
Format
Thesis/Dissertation; Online
Author
Ruiz Alvarez, Arkaitz
Advisor
Humphrey, Marty
Abstract
Scientists are increasingly relying on computational resources, both compute and storage, to expand scientific knowledge. For example, the data deluge is quickly overcoming the capacity of storage systems and the increasing use of simulation requires large compute capabilities. Thus, scientists need to expand their local resources with highly available and scalable systems. We consider cloud computing to be the solution that provides scientific applications with the computational resources needed. However, the services offered by the cloud providers do not address several important issues: how to meet the data requirements with the storage systems available, and how to optimize cost and other performance metrics. The variety of storage and compute choices with different characteristics and prices, the growth of the data stored in terms of size and number and the data management requirements make these tasks overwhelmingly complex for individual users. To address these challenges, we focus on four key elements of data management: the analysis of current storage services, the expression of data requirements and storage capabilities, data management algorithms and data-aware scheduling algorithms. We combine the information from our analysis of the storage services with their capabilities in a machine-readable format that can be processed by our implementation of the user's data requirements. Thus, we can obtain within a few milliseconds a list of storage services per application dataset that meet the user's requirements, and provide cost and performance estimates. Our unique approach to data management generates an integer linear programming problem with this list. The solution to this problem is an optimal assignment of the application's data to cloud services. Our implementation can provide optimal solutions for our use cases in less than one second. We have also created new scheduling algorithms for two types of cloud applications (MapReduce and watershed model calibration) that balance cost and execution time. The scheduling decisions are Pareto optimal and, therefore, superior to other strategies. We believe that these four elements can provide the users with a comprehensive solution to the data management problem, and allow them to take advantage of the new opportunities that cloud computing offers.
Language
English
Date Received
20120430
Published
University of Virginia, Department of Computer Science, PHD (Doctor of Philosophy), 2012
Published Date
2012-04-27
Degree
PHD (Doctor of Philosophy)
Collection
Libra ETD Repository
Logo for In CopyrightIn Copyright

Availability

Read Online