For PHY7518 • April 14, 2025 • Uni Bonn
Sagnik Ghosh
@smutch.github.io ( "... and yes this comic is under version control" )
xkcd/1597
Version control: Idea 1
my_folder_1
my_file_1
my_file_2
my_img_2
my_folder_1a
my_file_1
my_file_2a
my_img_2
folder_1a
folder_1b
folder_2a
folder_1.1b
folder_2c
folder_2d
folder_1a_new
folder_1a_new_new
folder_1arrrrgghhhhh!
But not all is lost:
Pointers, data, and B-Trees
In 1970, Rudolf Bayer and Edward M. McCreight , while at Boeing, introduced B-trees a self-balancing tree data structure. B-trees allowed efficient data insertion, deletion, and search operations, making them ideal for databases and file systems.
git uses a slightly advanced version Directed Acyclic Graph (DAG)
Git tracks content using four types of objects:
Blob – A file's contents.
Tree – Like a directory: it maps filenames to blobs or other trees (yes, this is the tree in git tree
!).
Commit – Points to a tree and has metadata (author, message, parents).
Tag – A label pointing to a specific commit or object.
commit <object_size>
tree <tree_hash>
parent <parent_hash> # (optional, omitted in initial commit)
author <name> <email> <timestamp> <timezone>
committer <name> <email> <timestamp> <timezone>
<commit message>
a git commit contains the following info in SHA-1 or SHA-256 format
tree <tree_hash>
Points to a tree object that represents the directory structure and file contents at the time of the commit.
Think of it as a snapshot of the project’s state.
parent <commit_hash>
References the immediate predecessor commit(s).
Regular commits have 1 parent.
Merge commits have 2 or more parents.
The first commit has no parent.
author <name> <email> <timestamp> <timezone>
This is who originally wrote the code and when.
You can set this with git config --global user.name
and user.email
.
committer <name> <email> <timestamp> <timezone>
This is who added the commit to the repository.
Often the same as the author — but not always (e.g., someone else rebases or cherry-picks your work)
The message describing what the commit does.
This is what you write after git commit -m "your message"
.
The commit is stored in .git/objects/
as a compressed blob.
Its SHA-1 (or SHA-256) hash uniquely identifies it and is calculated from the commit's entire content.
Changing anything (e.g., author name, message, tree) changes the commit hash.
You can check all this using:
git cat-file -p <commit_hash>
But what about big-data
Normally services like github/gitlab has a limit on both individual filesizes as well as the size of the commits being pushed.
Enters lfs.
Has to be set up seperately
git lfs install
Then in .gitattributes
git lfs track "*.psd"
make sure .gitattributes is tracked
git add .gitattributes
But ultimately this is a bad idea. Best practice is to store codes and data seperately. Alternatives includes sftp.uni-bonn.de. You can initiate a git for its own!
Version tracked data can also be publicly shared (modulo some restrictions)
Zenodo 🚀 is an open-access repository developed by CERN 🏛️ through the OpenAIRE initiative 🇪🇺. It lets researchers share, preserve, and cite all kinds of research outputs—papers 📄, datasets 📊, software 💻, presentations 📽️, and more. Every upload gets a DOI 🔗, making it easy to cite and find. Zenodo supports all disciplines 🌍 and is free to use 💸, promoting open science and long-term preservation 📦.
So far so good, so how to use git?
Initialise repo locally
git init
or clone
git clone <repo-url>
to check status
git status
after change, stage a specific file or all changes
git add <file>
git add .
Initialise repo locally
git commit -m "Your commit message"
finally to push to github/gitlab
git push
git push <remote> <branch>
to view current remotes
git remote -v
you can also add more remotes to the repo
git remote <name> <url>
.gitignore 🚫📁 is a special file in Git that tells it which files or folders to skip when committing. It keeps your repo clean by ignoring things like logs 📝, temp files 🧹, and secrets 🔐—stuff you don’t want in version control.
# Ignore Python cache
__pycache__/
# Ignore log files
*.log
# Ignore environment files
.env
# Ignore OS-specific files
.DS_Store
GitHub is the world’s leading platform for hosting and collaborating on code. Built around Git 🧠, it lets developers share projects 🌐, track issues 🐞, and work together through pull requests 🔀. With features like Actions ⚙️ for CI/CD, Pages 🌍 for documentation, and Copilot 🤖 for AI-assisted coding, GitHub powers open-source 🔓 and enterprise development alike. Owned by Microsoft 🪟, it’s the go-to hub for coders around the world 🌎.
GitLab is a full DevOps platform that brings your entire software development lifecycle into a single application. From version control 📂 and issue tracking 🧾 to CI/CD pipelines 🚀, security scans 🔒, and deployment tools ☁️—GitLab covers it all. Built on Git, it’s available both as a cloud service ☁️ and self-hosted solution 🏠, making it perfect for teams that want speed ⚡, control 🔧, and powerful collaboration 👥. Open core and proudly community-driven 💬!
Since Jan 2024 Bonn offers a gitlab instance.
GitHub Organizations are shared accounts where teams can collaborate on projects more efficiently. They provide structured access control, central management, and powerful tools for scaling development across multiple repositories.
Repository Management: Create, manage, and share repositories with team members.
Access Control: Configure permissions for repositories, ensuring secure and efficient collaboration.
Security & Compliance: Utilize features like SAML SSO and audit logs for enhanced security and compliance.
Project Integration: Seamlessly integrate with GitHub Actions, Issues, and other GitHub tools for streamlined workflows.
Visibility & Insights: Get detailed insights into contributions, activity, and project progress.
Custom Workflows: Leverage GitHub Apps and integrations to tailor your development process.
GitHub Classroom is a free tool by GitHub that helps teachers manage coding assignments with ease. It lets instructors create, distribute, and grade programming tasks using GitHub repositories—perfect for computer science classes, bootcamps, and workshops.
".. through a beautiful distributed graph theory tree model"
Trees can branch :)
git branch -v
>> master 8e1f3aa added points
Of course every tree already has at least one branch. To view:
prints branch names, and the current head of the branch
there are tools to visualise this:
from:
...
note: github, gitlab and git can show different naming convention for the master branch
used to call it "master", now mostly calls it "main"
(including the bonn instance) calls it "main"
This actually depends on two things:
If created by local installation, this depends on the version
github switched to git 2.8 on October 1, 2020.
for gitlab instances depends on what is the backend.
Bonn-git happened only in January, 2024 :)
Anyhow now starting with git 2.8 the convention is globally configurable to whatever is your preference. The command is:
git config --global init.defaultBranch main
And ofc you can rename any branch too. If you are on the same branch currently,
git branch -m <new-branch-name>
If you are in a different branch,
git branch -m <old-branch-name> <new-branch-name>
note: this is changes are ofc local, so you have to implement this changes seperately in github
a safe workflow is:
1. Push the newly named branch to remote:
2. Delete the old branch from the remote:
3. Update the upstream tracking (optional but recommended):
git push origin <new-branch-name>
git push origin --delete <old-branch-name>
git push --set-upstream origin <new-branch-name>
syntax:
git branch <branch-name>
But this does not switch you to the new branch. You have to checkout!
git checkout -b <branch-name>
# or (recommended in newer Git versions)
git switch -c <branch-name>
If your git version is >2.3 , this can be achieved with one command switch
git switch -c <branch-name>
"-c" indicates create.
Recall: git hash contains info about the tree, parent hash, commit name and committer.
So, what happens if two authors changes two different files in the same branch, commits and pushes the change without pulling first?
git automatically creates a branch and merges it back to main! (which contains a separate hash)
Things of course does not go this well if the change is in the same line of the same file and you get,
! [rejected] main -> main (fetch first)
error: failed to push some refs to 'https://github.com/your/repo.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
if you now try git pull
Auto-merging filename.txt
CONFLICT (content): Merge conflict in filename.txt
Automatic merge failed; fix conflicts and then commit the result.
This happens because git pull is shorthand for,
git fetch
git merge origin/your-branch
One work around is to use git pull --rebase which adds the local commits after the HEAD and shifts the HEAD upfront. This is also actually a short hand for
git fetch
git rebase origin/your-branch
But note this fix are locally handled and cant be applied at level of host. So always best practice:
"pull" before "push"
GitHub Issues are a powerful tool for tracking tasks, bugs, enhancements, and feature requests within a project. Integrated directly into GitHub repositories, they help teams stay organized, prioritize work, and collaborate more effectively. Each issue can be labeled, assigned, commented on, and linked to code changes, making it easier to manage a project's development lifecycle.
Key Benefits of GitHub Issues:
✅ Task Management: Clearly define and track to-dos, bugs, and features in one place.
🏷️ Organization & Prioritization: Use labels, milestones, and assignees to categorize and schedule work.
🔄 Integration with Code: Link issues to commits, pull requests, and branches for seamless traceability.
💬 Team Collaboration: Enable discussions, feedback, and updates in a centralized, transparent space.
Milestones in GitHub provide a way to group issues and pull requests under a common goal or deadline—like a version release, sprint, or project phase. They help visualize progress and ensure that work is aligned with broader project objectives.
When you assign an issue to a milestone, it becomes part of that goal’s progress tracker. GitHub automatically updates the milestone’s completion percentage based on the number of open vs. closed issues, giving you a clear view of what's left to do.
Using Milestones to Track Issues:
📅 Define Goals: Create milestones for key project phases (e.g., v1.0, Sprint 3, "MVP").
🔗 Link Issues: Assign related issues or PRs to a milestone to group them together.
📊 Track Progress: View how many issues are completed vs. remaining at a glance.
🚦 Plan & Prioritize: Use milestones to focus your team’s efforts and stay on schedule.
GitHub Projects offer a Kanban-style board or spreadsheet-like interface to organize and manage work across issues, pull requests, and milestones. They help teams break down goals into manageable tasks while keeping everything tied to the source code.
When you close an issue that's part of a milestone and also added to a GitHub Project, it updates both progress bars—providing real-time insight into how close you are to completing a milestone or project.
Why Closing Issues Helps Track Progress:
✅ Progress Automation: Closed issues auto-update milestone and project progress bars.
📦 Milestone Tracking: Each closed issue brings the milestone closer to completion, making it a visible indicator of progress.
📋 Project Board Sync: Items in GitHub Projects can auto-move (e.g., from “In Progress” to “Done”) when the issue is closed.
📈 Reporting & Focus: Helps managers and contributors instantly see what's done, what’s pending, and if the team is on track.
GitHub Wikis are a powerful way to maintain shared, living documentation right alongside your codebase. Designed for collaboration, a wiki is stored in a separate Git repository, meaning it can be cloned, edited, and version-controlled just like your main project—but without cluttering your core code.
Wikis are written in simple Markdown, making them easy to read and write, even for non-developers. Plus, they support LaTeX-style math syntax, which is perfect for technical documentation involving formulas or equations.
Why GitHub Wikis Are Great for Shared Docs:
🧑🤝🧑 Collaborative & Centralized: Keep team knowledge, guides, and notes in one accessible place.
🧾 Clean, Easy Formatting: Uses Markdown for fast, readable formatting—with support for headings, lists, code blocks, etc.
🧮 Math-Friendly: Supports LaTeX-style expressions for math-heavy projects (e.g., $E=mc^2$
renders beautifully).
🗂️ Standalone Git Repo: Clone or edit the wiki independently using git clone https://github.com/user/repo.wiki.git
.
README.md
is the front door to your project—it's the first thing most people see when they visit your GitHub repository. Written in Markdown, it gives users and collaborators a quick overview of what the project is about, how to use it, and how to contribute.
A good README
helps others pick up the project quickly, reducing onboarding time and confusion. It should include all essential information someone needs to get started or decide if they want to contribute.
Why a Good README.md
Matters:
🚀 Quick Onboarding: Gives users an instant understanding of the project’s purpose and usage.
🛠️ Setup Instructions: Helps new contributors get the environment up and running without digging through code.
📖 Essential Documentation: Acts as a one-stop reference for key info—features, dependencies, usage examples, etc.
🤝 Collaboration Ready: Should list contribution guidelines, licensing, contact info, and any project-specific conventions.
GitHub Pages lets you turn your GitHub repository into a fully hosted website, directly from your code—no external server or hosting service needed. It’s perfect for project documentation, personal portfolios, blogs, or landing pages.
You can create GitHub Pages using plain HTML, Markdown, or Jekyll (a static site generator), and they’re served right from your repo—either from a special gh-pages
branch or your docs/
folder.
Why GitHub Pages Are Useful:
🌐 Instant Websites: Host project documentation or personal pages with just a few clicks.
🧾 Great for Docs: Turn your README, wiki, or Markdown files into clean, readable websites.
🔧 Customizable: Supports Jekyll themes, custom CSS, and even custom domains.
🚫 No Extra Hosting Needed: Free and seamlessly integrated with your GitHub repo
HDF5 is a powerful, flexible file format designed to store and organize large, complex datasets. Used widely in science, engineering, and machine learning, it supports high-performance I/O, hierarchical structures, and cross-platform compatibility—making it ideal for big data applications.
Hierarchical Storage 🌲 – Organize data in a file like folders and files (groups and datasets).
Supports Large Data 🗃️ – Handles terabytes of data efficiently.
Self-Describing Format 🧠 – Data is stored with metadata, so it’s easy to understand and parse.
Cross-Platform 🖥️📱 – Works consistently across operating systems and programming languages.
Compression Support 🗜️ – Built-in compression saves disk space.
Partial I/O 📦 – Load just the parts of the dataset you need—great for big files.
Multi-language Support 🧪 – Compatible with Python (via h5py
), C/C++, Fortran, Julia, MATLAB, and more.
This is a strongly recommended industry standard way to store your data.
Since you can add attributes to the data itself, it helps documentation automatically. You can also automate when a file is created, it automatically stores metadata such as the machine, number of cores, date/time, the compiler version, list of packages used, and the code itself, making it extremely reproducible. Only sky is the limit!
Here's a tutorial to get you started:
Artwork by: