Git Primer

From Cheaha
Jump to navigation Jump to search


Attention: Research Computing Documentation has Moved
https://docs.rc.uab.edu/


Please use the new documentation url https://docs.rc.uab.edu/ for all Research Computing documentation needs.


As a result of this move, we have deprecated use of this wiki for documentation. We are providing read-only access to the content to facilitate migration of bookmarks and to serve as an historical record. All content updates should be made at the new documentation site. The original wiki will not receive further updates.

Thank you,

The Research Computing Team

Git is a tool to help you keep track of your content. Your data. The information that drives your world.

Git is very popular with people who curate large collections of instructions for machines and humans. In fact, Git is used to keep track of all of the instructions that make this machine work; that make this machine a website.

Because these people, "coders," are the dominant community of users for Git, a lot of the features built into Git are tuned to support the effort involved in managing improvements to the steps of a process. A lot of the documentation on the web focuses on that use case. If you aren't familiar with this domain or what Git is doing you'll get lost.

If you don't understand this context, working with Git will appear to be magic, and scarey magic at that.

It's only magic if you don't understand it. And like all magic, if you don't understand it can have unintended consequences. Imagine using a lawn mower to dust the dirt off the floors in your house. That's not what that it's for and if you use it like that don't be surprised if you break stuff or cut off your toes.

Git isn't magic and it isn't hard. There are explicit steps with predictable outcomes. The outcomes are completely deterministic. Like any powerful tool, there is more than one way to use Git. Understanding Git's flexibility is the key to mastering your craft.

This is a getting started guide for Git to help clarify what it does and how to use it to manage your content.

You're likely here because you are documenting process. If you keep you're mind open to that task you'll find a friend and trusted partner in Git in no time.

Getting Started

Git is used to store content (files) in a container (repository) and to maintain a history of changes you make to the content in that container. It is designed to help you copy containers and merge history of changes made in one container with changes made in other containers.

This makes Git a very powerful tool for collaborating with other people because you can share your work with other people and keep track of the history of changes to your container so you can understand how you got to where you are now and change course if you find yourself at a dead end.

So the work flow for Git includes maintaining a history of changes to content and sharing this history through a process known as merging. Combined, these two actions implement a data structure known as a directed acyclic graph. Scary, but it doesn't have to be.

Data structures are devices we have contrived to help us organize and manage collections of data, ie. information. You're likely already familiar with the most common data structures like lists (ordered collections of data) or arrays (ordered collections data with a numeric index). You may also be familiar with two dimensional arrays (collections of data with a row and column index), eg. spreadsheets and matrices. You're likely also familiar with queues (collections of data with a front and a rear, where data is added to the rear and removed from the front), eg. think grocery store checkout lines.

There is a whole branch of mathematical theory around graphs and computer science theory implementing those graphs as data structures. While it's fun to know this stuff, we only need to know a little of it to become effective users of git.

A directed acyclic graph, or DAG for short, describes a pattern that is common to collaboration, as Wikipedia puts it DAGs may be used to model processes in which information flows in a consistent direction through a network of processors. While not very personable, this is the basis of collaboration. Collaborators are the "network of processors" and the data that we are sharing is the "information flow".

So git is really a tool that helps you manage a process central to collaboration: tracking the actions of many independent actors working on (ie. modifying) a shared collection of data.

Git is most commonly used by developers to work on developing code in parallel and merging the results of those independent efforts. And we'll take this perspective in this primer because that's the use were putting it to.

Git has two sides: the data set you are tracking and the history of modifications to that data set. Git doesn't have too much to say about the the data you are tracking. It does assume that all the data you are managing in your collection exists in a single directory tree of your file system. It assumes that all you responsible for defining the relationships between your file and directory objects. That is, it doesn't dictate any structure for your data. It will happily merge two wholly unrelated data collections if that's what you tell it to do.

The second side of Git is the history of modifications. This is the part of git that builds, maintains, and lets you review the history of changes that have occurred to the directory tree. Every recorded file change, every merge with work from another copy of the data set will build up the history for this copy of the data set. It is this history that forms the directed acyclic graph.