Data Management Framework: Difference between revisions

From Cheaha
Jump to navigation Jump to search
(Create draft of framework from email thread)
 
(No difference)

Latest revision as of 17:40, 12 December 2011

The Research Computing Platform supports defining and curating data sets for use by researchers directly or by reference in data analysis packages.

Philosophy of Data Management Framework

Data sets should be treated in a manner akin to apps. Different apps can have different admin/owner groups, organized by app. An app is a work product whose outcome is a curated application install. These apps go in /share/apps/<apptag> and the permissions are defined based on the group maintaining the app.

Similarly, data sets should be considered as work products whose outcome is a curated data set. As with applications, there is no single group that will manage all data sets. Data sets should be organized in /luster/projects/public-datasets/<datasettag> (or better /lustre/data/<datasettag>). Permissions on /lustre/data/<datasettag> should be based on people who are agreeing to maintain a specific <datasettag>. Some users will be admins on multiple data sets; some groups may bundle a bunch of data sets under one datasettag, others may prefer a strict separations dictated by upstream sources or orgs. (Think github here.)

Galaxy Example

Considering the Galaxy application, the current /lustre/project/galaxy/public-datasets fits into the above model if you think of this as a curated data set for Galaxy where the dataset admins have chosen to treat a number of distinct data sets as part of a single collection. This also facilitates developing datasets with additional artifacts that support inclusion in select tools, e.g. a "galaxy public data set". It also supports layering dataset products so that one data set might just be the metadata associated with hooking another data set into specific tools.

This organization of apps and datasets helps us treat them as similar abstractions with similar management/curation/oversight demands. It also let's us map Galaxy's needs more clearly into an environment that is consistent across tools.