UAB Condor Pilot
The UAB Condor Pilot explored the utility to research applications of aggregated unused compute cycles harvested from many computers. The pilot established a demonstration spare-cycle compute fabric using the Condor scheduler and compared the performance of a molecular docking workflow on this fabric and several larger production Condor fabrics to the performance of the same workflow running on our campus compute cluster Cheaha.
The UAB Condor Pilot successfully demonstrated the value of harvested unused compute cycles to the molecular docking workflow. The results suggest that similar applications, especially those that can scale by repeating the same task on distinct data sets, will likewise benefit from the abundant compute resources that can be harvested via a Condor compute fabric. The UAB Condor Pilot ended in May 2012 with the presentation of our results (pdf) at Condor Week 2012.
Condor is a resource allocation and management system designed to simplify harvesting idle compute cycles from under-utilized computers. Condor is a production-quality software system developed by researchers at the University of Wisconsin. It is deployed in a wide range of environments from lab or departmental compute pools with 10's of processors to global compute fabrics such as the Open Science Grid (OSG) harnessing between 80,000 and 100,000 processors.
There is an active user and developer community around Condor. Condor is supported on Linux, Mac and Windows ensuring utility to a broad spectrum of applications, from scientific computations to large scale statistical analyses. There are personal instances to support individuals migrating or developing their own workflows. The loosely-coupled nature of the resource collections also makes it straight forward to dynamically scale out on popular cloud computing fabrics such as Amazon's EC2 fabric.
Molecular docking is a process for discovering an ideal orientation between two molecules, the receptor and the ligand. There are a number of approaches which can be taken to explore how ligand (drug) can bind to a receptor (protein). The approach used in this pilot was a conformational space search using genetic algorithms which evolve the orientation of the molecules to find the most likely orientation for docking to occur, as implemented by the AutoDock application from The Scripps Research Institute.
This virtual screening of protein-drug interactions is computationally intense and can incorporate large databases of chemical compounds. This makes it an ideal candidate for finding as many compute resources as possible to leverage during the screening process. The structure of this workflow using AutoDock analyzes each receptor-ligand pair can independently, making it an ideal candidate for leveraging the loosely couple collection of computers made available through the Condor scheduler.
Molecular docking is the computational part of a much larger workflow that involves discovery of the protein structure via X-ray crystallography. This process is nicely described in this video describing X-ray crystallography at the Institute of Molecular and Cell Biology in Strasbourg, France. Similar facilities at Argonne National Laboratory are used by researchers at UAB's Center for Biophysical Sciences and Engineering to explore the structure of proteins.
Components of Pilot
The pilot leveraged three representative Condor implementations to assess the performance of the molecular docking workflow. The first testbed was a pilot UAB campus Condor pool established as part of the pilot itself. This pool included approximately 40 64bit Linux workstations from labs in the Department of Computer and Information Sciences and 40 32bit Windows desktops from UAB IT Desktop Support. These systems are representative of the type of computers available on campus that typically have long idle periods and which could potentially offer those cycles to compute tasks which need them.
The second testbed was the University of Wisconsin campus Condor pool. This is a production Condor pool that has over 1000 64bit Linux workstations. This pool is operated by the Center for High-Throughput Computing (CHTC) at the University of Wisconsin and is part of the production compute fabric available to researchers at the University of Wisconsin. It was made available to us as part of our pilot through the generous support of CHTC and Dr. Miron Livny.
The third testbed was the dynamically provisioned Condor pool made available by the Engage VO (Virtual Organization) which leverages compute cycles provided by the Open Science Grid (OSG). The Engage VO was established to encourage exploration of OSG by new users. It operates a dynamically provisioned Condor pool using a technology called glidin-WMS to affiliate available compute cycles in OSG with a virtual Condor pool. This is the essence of the on-demand resource allocation of cloud computing. The Engage VO is operated by RENCI (RENaissance Computing Institute) a multi-institutional research resource for North Carolina. Engage VO access and support was provided by John McGee, PI of the grant that established the Engage VO.
The pilot migrated a representative molecular docking workflow that runs on UAB's 888 core campus compute cluster using the SGE batch scheduler to a workflow that could be executed in Condor scheduling and resource allocation environment. This workflow is constructed on the AutoDock molecular docking application from The Scripps Research Institute. Standard AutoDock workflows group individual receptor-ligand docking search as independent compute jobs. A large database of ligands and receptors is broken down into individual docking pairs each of which are processed by an independent instance of the AutoDock application. This independence between work units makes an AutoDock-based molecular docking workflow an inherently good candidate for migration to the distributed and dynamic compute fabrics typical of Condor environments.
The primary consideration in the migration of this workflow to Condor was to re-package the data sets so they could be effectively distributed to many unrelated compute nodes. In traditional compute clusters all nodes are in close physical proximity and operated as a common administrative unit. This makes it possible to provide access to a shared file system across the cluster and hide data distribution to compute nodes behind a global file name space and high speed networks. In a Condor environment, compute nodes are typically spread across many physical locations and their administrative independence makes it difficult to provide global file name space. The potential variability in network speeds to compute nodes also necessitates disciplined data transfer. It is detrimental to distribute the full data set to all compute nodes when each node may only work on a small portion of the data. Given the three Condor fabrics targeted in the pilot, our focus was on a simple, static packaging that would treat one receptor and five ligands as a single job. That is, each job would compute 5 molecular pairings. Furthermore, each job would only require that the 1 target receptor and the 5 candidate ligands be staged to the compute node providing an adequate balance between data set size (job staging transfer time) and computational time.
A second consideration in the workflow migration was availability of tools on the compute nodes. The standard AutoDock workflow consists of two parts, data preparation and molecular docking. The data preparation component could potentially be distributed but it has many application dependencies that would have to resolved on each compute node. Because data preparation is computationally light in comparison to the docking search effort and because it is independent of the actual docking, we chose to run this step once locally and produce a ready-to-run data set for molecular docking. The molecular docking step is completely managed by the AutoDock executable which is self contained and easily distributed along with the data sets.
Finally, there are syntactic difference between SGE and Condor requiring conversion of the job submit code to support Condor in lieu of SGE.