Running ColabFold

Some instructions for running ColabFold on epyc. I presented some of these instructions and a general overview of AlphaFold2 and ColabFold during an SBDR seminar (slides).

These are basic instructions for connecting to epyc and running a ColabFold job from the command line. It is also possible to use Microsoft’s VSCode to connect if you are wanting a better experience.

Basic command line instructions

Note

These instructions assume you have a remote shell open on epyc, which is outfitted with 2 NVIDIA A100-80 GPUs.

Activate colabfold conda environment.

colabfold_batch is the command line tool you will be using. It is installed into a preconfigured conda Python environment named colabfold. If your default shell is configured properly you should be able to activate the colabfold conda environment with this command:

conda activate colabfold

For most users with the default bash shell the above command should just work. If you are using tcsh or if you have twiddled with your shell configuration files in the past you may get a warning that conda can’t be found. This means your shell is not yet configured to use conda. You can try initializing conda with this command:

/usr/local/anaconda3/bin/conda init

Then logout and log back in to ensure the changes are applied to your shell.

You can view the available conda environments:

conda env list

and activate the colabfold environment.

conda activate colabfold

If this worked your shell prompt should look something like this with the name of the active conda environment in parentheses at the beginning of your prompt:

(colabfold) [16:58]username@epyc:~$

Attention

If you are struggling to get the colabfold conda environment activated or run into other problems please contact Scott

Create a working directory

You will want to enforce some organization for your colabfold data so make a directory.

mkdir colabfold_data

and make a dedicated directory for your protein/system of interest.

cd colabfold_data
mkdir my_prot

Create your fasta sequence file

This is quite simple if you have a single chain. For example create a file named my_prot.fasta (you can of course name it whatever you want)

An example of a fasta file
>1RDR_1|Chain A|POLIOVIRUS 3D POLYMERASE|Human poliovirus 1 (12081)
GEIQWMRPSKEVGYPIINAPSKTKLEPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQLMSLDINTEQMCLEDAMYGTDGLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTKVEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEEKLFAFDYTGYDASLSPAWFEALKMVLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNLIIRTLLLKTYKGIDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLKRFFRADEKYPFLIHPVMPMKEIHESIRWTKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPIGRALLLPEYSTLYRRWLDSF

To fold a single chain this is all you will need in your my_prot directory.

Run ColabFold on a Monomer

There are many options available when running colabfold_batch which you can see with the --help flag.

colabfold_batch --help

If you just want to use the default settings it’s as simple as:

colabfold_batch my_prot.fasta output_dir

This will read your fasta sequence, calculate an MSA using MMseqs2, perform AlfaFold2 inference, and output all results to the output_dir directory.

If you want to use amber to relax the model provided by AF2 and use the A100 GPUs to make relaxation even faster you would provide the --amber and --use-gpu-relax command line options:

Warning

With the latest colabfold_batch the --amber and --use-gpu-relax command line options are not working.

colabfold_batch --amber --use-gpu-relax --model-type auto my_prot.fasta output_dir

Run ColabFold on a Multimer

Under the hood ColabFold uses the inference models from AlphaFold2 to predict a 3D structure from your sequence. There are four different AlphaFold2 models available, including alphafold2_multimer_v1, alphafold2_multimer_v2, and alphafold2_multimer_v3. The default is auto (which uses alphafold2_ptm for monomers and alphafold2_multimer_v3 for complexes.)

If you are predicting a multimer there are some gotchas when preparing the fasta file. Talk to me if you run into errors. Essentially you need to create your fasta file like this (with a : after each chain, but not after the last chain)

An example of a multimer.fasta file to predict a homo hexamer.
> 1BJP_homohexamer
PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR:
PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR:
PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR:
PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR:
PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR:
PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR

And then fire off your colabfold_batch job:

colabfold_batch --amber --use-gpu-relax --model-type alphafold2_multimer_v3 multimer.fasta output_dir_for_multimer

Monitoring the GPU status

You can use gpustat to see the status of our two A100s which should output something like this:

(colabfold) [17:14]username@epyc:~$gpustat
epyc Thu Jul 20 17:26:13 2023  535.54.03
[0] NVIDIA A100 80GB PCIe | 35'C,   0 % |  1007 / 81920 MB | gdm(63M) gdm(47M)
[1] NVIDIA A100 80GB PCIe | 35'C,   0 % |   874 / 81920 MB |

The default GPU that colabfold_batch will use is 0, but if multiple jobs pile up on the first GPU and the second one (1) is unused then that is not very good. You can specify which GPU you would like to use by setting the CUDA_VISIBLE_DEVICES environment variable in your shell just before submitting the job.

export CUDA_VISIBLE_DEVICES=1

This would make the second GPU the target for jobs.

Note

0 = first GPU 1 = second GPU

Using Microsoft Visual Studio Code

The benefit of using VSCode is that you have a nice environment for editing files (rather than using vim in a terminal).

I’ll write these instructions up later.