Cartesian_ddG计算蛋白复合物单点突变自由能

　　本文记录使用 Cartesian_ddG 计算蛋白-配体复合物的单点突变自由能的过程，软件版本为 rosetta_bin_linux_2020.28.61328_bundle。

1.蛋白-配体复合物预处理

1.1 分离蛋白和配体分子

根据实际情况进行分离，一般可使用如下指令进行分离。

grep ^ATOM complex.pdb > pro.pdb
grep ^HETATM complex.pdb > lig.pdb

　　若上面指令不能达到所需效果，另行修改即可，本文在提取分离时所用的指令如下，只分离出 A 链和 A 链所对应的配体。

$ROSETTA/main/tools/protein_tools/scripts/clean_pdb.py pro.pdb A
cat 3v4i.pdb | grep AZT | awk '($4=="A")' > lig.pdb

1.2 配体分子加氢

　　使用 babel 对配体分子加氢，并转化为下一步所需的 mol2 格式文件，下方指令根据实际情况选择其一即可。

babel -h lig.pdb lig.mol2
obabel lig.pdb -O lig.mol2 -h

1.3 生成小分子拓扑参数文件

使用 lig.mol2 生成 Rosetta 所需要小分子拓扑参数文件。

$ROSETTA/main/source/scripts/python/public/molfile_to_params.py -n LIG --extra_torsion_output lig.mol2

1.3 生成新的复合物

cat pro.pdb LIG_0001.pdb > complex.pdb

2.Relax预处理

　　进行两步 relax 对初始结构进行优化，根据 nstruct 设置的个数判断时长，数目越多所需时间越长。

2.1 第一次relax

relax.static.linuxgccrelease \

  -s complex.pdb \

  -extra_res_fa LIG.params \

  -relax:constrain_relax_to_start_coords \

  -ramp_constraints false \

  -relax:coord_constrain_sidechains \

  -nstruct 10 \

  -ex1 \

  -ex2 \

  -use_input_sc \

  -flip_HNQ \

  -no_optH false

将 score.sc 文件中 total_score 列最小的 pdb 作为下一步 relax 的输入。

# 将第一次的能量最低的pdb文件名存储在relax.log中
cat score.sc | tail -n +3 | awk 'BEGIN {min = 1000} {if ($2 < min) {min=$2 ;pdb=$23} } END {print "min_total_score_pdb: "pdb".pdb ("min")"}' > relax.log

# 将第一次的能量最低的pdb文件名存储在变量lowest_energy中，以供下步使用
lowest_energy=`cat score.sc | tail -n +3 | awk 'BEGIN {min = 1000} {if ($2 < min) {min=$2 ;pdb=$23} } END {print pdb".pdb"}'`

2.1 第二次relax

relax.static.linuxgccrelease \

  -s ${lowest_energy} \

  -extra_res_fa LIG.params \

  -score:extra_improper_file LIG.tors \

  -relax:constrain_relax_to_start_coords \

  -ramp_constraints true \

  -relax:coord_constrain_sidechains \

  -nstruct 10 \

  -ex1 \

  -ex2 \

  -use_input_sc \

  -flip_HNQ \

  -no_optH false \

  -relax:cartesian \

  -score:weights ref2015_cart \

  -crystal_refine

继续将 score.sc 文件中 total_score 列最小的 pdb 作为下一步计算 ∆∆G 的输入。

# 将第二次的能量最低的pdb文件名追加在relax.log中
cat score.sc | tail -n +3 | awk 'BEGIN {min = 1000} {if ($2 < min) {min=$2 ;pdb=$24} } END {print "min_total_score_pdb: "pdb".pdb ("min")"}' >> relax.log

# 将第二次的能量最低的pdb文件名存储在变量lowest_energy中，以供下步使用
lowest_energy=`cat score.sc | tail -n +3 | awk 'BEGIN {min = 1000} {if ($2 < min) {min=$2 ;pdb=$24} } END {print pdb".pdb"}'`

注意：两步 relax 产生的结构打分信息均存储于 score.sc 中，但是结构名所在的列数是不同的。

3.计算蛋白-配体复合物∆∆G

3.1 准备mutfile文件

　　官方给出的 mutfile 文件示例及说明如下，比较详细。在此稍对第二行的 2 进行说明，此处表示当前轮的突变有 2 个，也就说计算出的 ∆∆G 是这 2 个突变叠加后的结果，其他以此类推。

total 3 #this is the total number of mutations being made.
2 # the number of mutations made
G 1 A # the wild-type aa, the residue number, and the mutant aa
W 6 Y # the wild-type aa, the residue number, and the mutant aa
1 # the number of mutations
F 10 Y # the wild-type aa, the residue number, and the mutant aa

本次准备四个 mutfile 文件，其中一个 S163T.mutfile 内容如下。

total 1
1
S 163 T

将所有 mutfile 路径放入 mutfile.list 文件中。

ls mutfiles/*.mutfile > mutfile.list

3.2 运行cartesian_ddg

　　使用参考资料2 中的 run_mutations_ddg.py 脚本生成每个突变的 cartesian_ddg 指令，并存储在 job.list 中。

run_mutations_ddg.py \

  ${lowest_energy} mutfile.list \

  [cartesian_ddg.static.linuxgccrelease] \

  [-extra_res_fa LIG.params] \

  [-score:extra_improper_file LIG.tors] > job.list

并行运行 job.list 中的脚本。

mut_number=`cat job.list | wc -l`
cat job.list | parallel -j ${mut_number}

3.3 结果处理

　　使用参考资料2 中的 parse_data_ddg.py 处理 cartesian_ddg 的结果，并存储在 results.txt 中。

# 将所有突变的ddg文件路径存放在list中
ls *ddg > list

# 统计每个突变的结果，并格式化输出
parse_data_ddg.py | awk 'BEGIN {print "Mutation\tΔG_fold\tΔG_bound\tΔΔG\tG_bound\tG_unbound"}{print $1,$2,$3,$4,$5,$6}' OFS='\t' > results.txt

　　最终结果如下，其中第四列数据的 ΔΔG 就是我们所需要的信息，蛋白与小分子之间的结合能变化。

Mutation	ΔG_fold	ΔG_bound	ΔΔG	G_bound	G_unbound
I136L	0.62	0.63	0.00	-2631.221	-2630.595
I179V	0.50	0.49	-0.02	-2637.268	-2636.78
S163T	-2.68	-2.76	-0.08	-2634.173	-2636.934
K174A	0.11	0.10	-0.01	-2632.656	-2632.557

4.脚本修改说明

　　主要使用参考资料2中 run_mutations_ddg.py 和 parse_data_ddg.py 这两个脚本，很不幸的是本次测试时均报错，我的 python 版本为 3.7，下面记录对应修改的地方。

① run_mutations_ddg.py 脚本只需将 19 行修改为 print(cmd)，即 print 后加上括号，当然也可以根据需求对 18 行的 cartesian_ddg 中参数进行修改。
② parse_data_ddg.py 脚本中除了将 39 行 print 后加上括号外，还需要分别将 22 行的BEFORE_JUMP 和 27 行的 AFTER_JUMP 分别改为 COMPLEX 和 OPT_APART 即可。

parse_data_ddg.py 进行上述修改是基于参考资料4 和该脚本中的注释信息来确定的。
脚本中可获得如下对应信息：
WT – mono WT – AFTER_JUMP
MUT – mono MUT – AFTER_JUMP
B_WT – bind WT –BEFORE_JUMP
B_MUT – bind MUT – BEFORE_JUMP

参考资料4 中可获得如下信息：
ddG_bind = avrg(COMPLEX MUT totalscore) - avrg(COMPLEX WT totalscore)
ddG_mono = avrg(OPT_APART MUT totalscore) - avrg(OPT_APART WT totalscore)

由此可知，BEFORE_JUMP 对应 COMPLEX，AFTER_JUMP 对应 OPT_APART。

5.相关报错处理

　　第三步的 cartesian_ddg 中出现如下报错信息，这是由于 mutfile 文件中记录的野生型碱基同输入蛋白的结构中对应位置的碱基不一致造成的。

ERROR: Assertion `pose.residue(resnum).name1() == wt` failed.

官方链接 (https://www.rosettacommons.org/node/3443) 中给出了如下解释和解决方案。

Your test.pdb has zero occupancies for chain A residue 1 atoms. By default, Rosetta discards atoms with zero occupancies. The loaded pose will therefore not have a chain A residue 1 atom. This will cause the resfile reader to throw an error when you try to specify that residue, and other portions of the code will have issues if there are sequence mismatches.
There’s two ways of fixing it. The first is to edit the PDB file to place a value of 1.00 in the occupacy column. The second is to pass the flag “-ignore_zero_occupancy false” to the commandline, telling Rosetta that it shouldn’t ignore atoms with zero occupancy.

6.整体流程总结

　　整体流程可以分为两部分。第一部分为文件准备，包括蛋白-配体复合物预处理和准备 mutfile 文件；第二部分为 Relax 预处理和运行cartesian_ddg。其中前一部分较为灵活需要修改，而第二部分只需根据需求修改相应参数即可。

参考资料：
1.Cartesian_ddG: 更快更准确的单点突变自由能预测方法
 2.Computer Aided and Genetically Encoded PROXimal decaging (CAGE-Prox)
3.Calculate Protein Protein ΔΔG
4.cartesian_ddg application
5.Rosetta研习社
 6.Rosetta Tutorials, Demos, and Protocol Captures
7.Getting started
8.Preparing Ligands
9.Ligand Docking
10.How to prepare structures for use in Rosetta
11.ddg_monomer application