Getting Started with Bioinformatics on Tufts HPC

Part I: Introduction to the Linux/Unix Command Line

Author: Yucheng Zhang, yucheng.zhang@tufts.edu

Date: 10/02/2025

What is Linux?

Linux Logo

Linux is a free, open-source, and Unix-like operating system kernel that was originally created by Linus Torvalds in 1991. Over time, Linux has grown into a full-fledged operating system used worldwide across various types of devices, from servers and desktop computers to smartphones and embedded systems.

GNU/Linux

by Richard Stallman
Many computer users run a modified version of the GNU system every day, without realizing it. Through a peculiar turn of events, the version of GNU which is widely used today is often called “Linux,” and many of its users are not aware that it is basically the GNU system, developed by the GNU Project.

Ubuntu: A user-friendly distribution popular for desktop and server use, based on Debian.

Fedora: A cutting-edge distribution often used by developers and those who want the latest features.

Debian: Known for its stability and extensive software repositories, often used in server environments.

CentOS/AlmaLinux/Rocky Linux: Enterprise-grade distributions derived from Red Hat Enterprise Linux (RHEL).

Arch Linux: A rolling release distribution known for its simplicity and customization, aimed at advanced users.

Kali Linux: A distribution designed for penetration testing and security research.

Our clusters’ OS

OS upgrade

Feature Red Hat Enterprise Linux (RHEL) Rocky Linux
Origin Developed and maintained by Red Hat (IBM-owned) Community-driven rebuild
License Commercial, subscription required for updates & support Free and open-source, no subscription required
Support Paid enterprise support from Red Hat Community support
Security Certified security patches Security patches synced from RHEL sources
Ecosystem Widely certified by software/hardware vendors Not officially certified, but works with same ecosystem
Cost Paid subscription Free; optional paid support available

Files and File System

Everything is a file

A file is an addressable location that contains some data which can take many forms.

  • Text data
  • Binary/Image data

Files have associated meta-data

  • Owner
  • Group
  • Timestamps
  • Permission:
    1. Read r
    2. Write w
    3. Execute x
    4. No permission -

File organization

Everything is mounted to the root directory

Files are referred to by their location called path


Must-known Linux/Unix Tools

man: manual pages

When working on Linux, you don’t need to Google every command — the manual pages (man pages) are built right into the system. The man command shows documentation for most Linux commands and tools.

Usage
man <command>
Example
man ls

File and Directory Management

Linux provides powerful tools for managing files and file systems. Here we will introduce a few essential commands.

pwd: print the current working directory
Usage
$ pwd
/cluster/home/yzhang85
$ cd /cluster/tufts/rt/yzhang85/
$ pwd
/cluster/tufts/rt/yzhang85
cd: change directory
Usage
cd [directory]

If a directory is not supplied as an argument, it will default to your home directory.

$ pwd
/cluster/tufts/rt/yzhang85
$ cd ..
$ pwd
/cluster/tufts/rt
$ cd
$ pwd
/cluster/home/yzhang85
Shortcuts
ls: list all the files in the given directory
Usage
ls [options] [directory]
Common options:
chmod: manage file permissions
Symbolic Notation
Examples
$ chmod g+w filename ## Give the group write permission
$ chmod u+x filename ## Give user execute permission
$ chmod a+r filename ## Give all users read access
$ chmod u=rw,g=r,o=r filename ## Give user read and write permission, group and other only read permission.
Recursive updating permissions with -R

To apply permissions recursively to all files and subdirectories within a directory, use the -R option:

$ chmod -R g+rx /path/to/directory
touch: create new files or update timestamps

touch is used to create new files or to update the timestamps (access and modification times) of existing files.

Create new file
$ touch newfile.txt
Update timestamps of existing files
$ touch existingfile.txt
mkdir: create new directory
Usage
mkdir [options] dir_name
Common option
$ mkdir -p rnaseq/output

This will create output folder as well as its parent folder rnaseq if it doesn’t exist.

mv: move a file/directory to a new location or rename it
Usage
mv [options] source destination
Common option
cp: copy a file/directory
Usage
cp [options] source destination
Common option
rm: remove files/directories
Usage
rm [options] file/directory
Common option

Storage management

ncdu: disk usage analyzer

When your storage space starts running low on an HPC or Linux system, it’s important to figure out which files and folders are using the most space.

ncdu stands for NCurses Disk Usage, and it provides an interactive, text-based interface for exploring disk usage.

Usage
ncdu [directory]
Example
$ ncdu ~
$ ncdu /cluster/tufts/mylab

df: check disk space

When working on Linux (especially on shared HPC systems), it’s important to know how much disk space is available on different filesystems. The df command (disk free) shows this information.

Usage
$ df -h /cluster/tufts/mylab
$ df -h /cluster/tufts/yzhang85
Filesystem Size Used Avail Use% Mounted on
10.246.194.77:/projects/yzhang85 1.1T 961G 64G 94% /cluster/tufts/yzhang85

Text processing

Linux command-line tools are invaluable for bioinformatics text processing due to their efficiency and flexibility. They allow for rapid manipulation and analysis of large biological datasets, such as DNA sequences, protein structures, and gene expression data. Commands like grep, sed, awk, and cut are essential for filtering, extracting, and reformatting text-based biological information.

cat: catenate files(joins their contents)
Usage
cat [options] file1 file2 …
Common option
head/tail: display the beginning/end of a file
Usage
head/tail [options] file
Common option
less/more: view the content of a file page by page
Usage
$ less largefile.txt
$ more largefile.txt
grep:Extracting lines matching (not matching) a pattern
Usage
grep [options] PATTERN file
Common option
sed: Stream editor for modifying file content

sed (short for stream editor) is a powerful text-processing tool in Bash that allows you to parse and transform text in files or streams. It is commonly used to perform basic text manipulations like search and replace, insert and delete lines, and apply regular expressions on text data.

Substitution (Search and Replace)

Replace the first occurrence of old with new in each line:

sed 's/old/new/' filename.txt

Replace all occurrences of old with new in each line:

sed 's/old/new/g' filename.txt
In-place substitution
sed -i 's/old/new/g' filename.txt

Warning: Use this command with caution as it directly modifies the original file. To create a backup, use -i.bak:

sed -i.bak 's/old/new/g' filename.txt
Delete lines
sed '/pattern/d' filename.txt

Data Compression and Archiving

When working with files on Linux, compressing them to save space and bundling multiple files into a single archive is a common practice. The commands gzip, gunzip, and tar are essential tools for file compression and archiving in Bash.

tar: Archive multiple files into one or extract them.

tar is used to create, extract, and manipulate archive files. However, tar itself does not compress files; it only archives them by combining multiple files and directories into a single file. This file usually has a .tar extension. However, tar can be used in combination with other compression utilities (like gzip or bzip2) to compress the archive.

Create a .tar archive without compression
tar -cvf archive.tar my_folder
Extract a tar file
tar -xvf archive.tar
Creating a compressed archive(.tar.gz)
tar -cvzf archive.tar.gz my_folder
Extracting a compressed archive(.tar.gz)
tar -xvzf archive.tar.gz

Other useful tools

Environment variables
Define variables
VARIABLE=value ## No space around =
Variable reference
$VARIABLE ## echo $VARIABLE
Commonly used environment variables
Redirection: >, », <
$ cat file1 file2 > files
Pipe: |

Pipes in Linux are a powerful feature that allows you to connect the output of one command directly as the input to another command. This is a key concept in Unix/Linux philosophy, which promotes the use of small, modular tools that can be combined to perform complex tasks.

A pipe is represented by the | symbol. When you place a pipe between two commands, the standard output (stdout) of the command on the left of the pipe becomes the standard input (stdin) for the command on the right.

Usage
command1 | command2
Example
$ sort file.txt | uniq
Wildcards: selecting multiple files/directories based on patterns

Bandit Wargame: learning Linux commands by playing games

If you’d like extra practice with the Linux command line beyond today’s workshop, I recommend trying the Bandit wargame from OverTheWire Bandit.