Data Management Framework

From Cheaha
Revision as of 17:40, 12 December 2011 by Jpr@uab.edu (talk | contribs) (Create draft of framework from email thread)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Attention: Research Computing Documentation has Moved
https://docs.rc.uab.edu/


Please use the new documentation url https://docs.rc.uab.edu/ for all Research Computing documentation needs.


As a result of this move, we have deprecated use of this wiki for documentation. We are providing read-only access to the content to facilitate migration of bookmarks and to serve as an historical record. All content updates should be made at the new documentation site. The original wiki will not receive further updates.

Thank you,

The Research Computing Team

The Research Computing Platform supports defining and curating data sets for use by researchers directly or by reference in data analysis packages.

Philosophy of Data Management Framework

Data sets should be treated in a manner akin to apps. Different apps can have different admin/owner groups, organized by app. An app is a work product whose outcome is a curated application install. These apps go in /share/apps/<apptag> and the permissions are defined based on the group maintaining the app.

Similarly, data sets should be considered as work products whose outcome is a curated data set. As with applications, there is no single group that will manage all data sets. Data sets should be organized in /luster/projects/public-datasets/<datasettag> (or better /lustre/data/<datasettag>). Permissions on /lustre/data/<datasettag> should be based on people who are agreeing to maintain a specific <datasettag>. Some users will be admins on multiple data sets; some groups may bundle a bunch of data sets under one datasettag, others may prefer a strict separations dictated by upstream sources or orgs. (Think github here.)

Galaxy Example

Considering the Galaxy application, the current /lustre/project/galaxy/public-datasets fits into the above model if you think of this as a curated data set for Galaxy where the dataset admins have chosen to treat a number of distinct data sets as part of a single collection. This also facilitates developing datasets with additional artifacts that support inclusion in select tools, e.g. a "galaxy public data set". It also supports layering dataset products so that one data set might just be the metadata associated with hooking another data set into specific tools.

This organization of apps and datasets helps us treat them as similar abstractions with similar management/curation/oversight demands. It also let's us map Galaxy's needs more clearly into an environment that is consistent across tools.