Blogpost #2: Week 3 and 4: Because Modeller programmers are using it to make their coffee
During the last two weeks, I spent a lot of time learning how to use a software name Modeller. We use modeller to do modeling of proteins structure by homology. It just means that if you have two proteins sequences – of the same family – and you know the structure of one of them, you can try to guess the structure of the other one. That technical is based on the hypothesis that the residues that are at the same place in the sequence will also be in the same place in the structure.
In our case, we take the multiple alignment of a lot of pLGICs sequences and we extract from that two sequences: one of a protein for which we know the structure (they are very few) and another we don’t know the structure of. Then, we improve this alignment, but not by reapplying an alignment algorithm because then we would lose the information from the multiple alignment. Indeed, in the multiple alignment, the residues that will be aligned are more likely the one conserved among all the sequences. Whereas when you have only two sequences, you don’t have access to this information. And that’s important, because some parts of the sequences are highly different, and thus maybe not that important for the molecule so that’s not very important to align correctly the residues of this part, whereas it’s very important to align correctly the residues in the conserved part. We are doing two things: we suppress the gaps at the position for which both sequences have a gap and prefer to align two different residues that having two residues in front of two gaps: that means if we have seq1 = ‘CT-F‘ and seq2 = ‘C-WF’, we will change that to seq1 = ‘CTF’ and seq2=’CWF’. I am also currently investigating the impact of this improvement of the pair alignment on the final structure.
The problem – or the good point, it’s a question of point of view – is that modeller can do more things than just modeling by homology. It can also search in databases proteins that have a sequence similar to the one for which you are trying to build a structure, it can align the sequence of the most similar protein with yours … So, if you do everything with modeller, all is fine. But if you only want to do the modelling, its slightly more difficult because the file modeller want as an input is very specific. So, I spent time writing a python script that takes as input a fasta file in which there’s the alignment of two proteins and gives as output the file in the format supported by modeller.
This picture represents the format that modeller wants, with the “>P1;” signaling new sequence, the second line being informations about whether it’s a sequence we want the structure of or a protein for which we know the structure. Then there’s the actual sequence, with ‘-‘ for insertions in the alignment and a ‘*’ to point the end of the sequence.
I’ve also coded a script that will apply this modeling procedure at each sequence in a big alignment of pLGICs files, producing 10 possible structure for each sequence and keeping the best one of them. We’ll produce 10 possible structures each time because the process of modeling is not exact, it produces possible structures and launch it several times will lead to several different structures. Thus, a good way to reduce that variation is to modeling multiple times (here 10, because if we do more, it will take way too much time to compute) and to keep the best one. Modeller can compute a score for each structure produced and thus determine the best one. Tooday, I finished a code allowing to launch the precedent one (build 10 model and keep the best one) for all the sequences of a pLGICs database. I launched this code this afternoon and hopefully it won’t crash during the week-end and I will have the result on Monday. After that, I will do structural analysis of the different structures I just produced, for example I will compute the diameter of the pore at different places in the protein.
In this picture, you can see an hypothetical structure of the first sequence in the database (the best model among 10 generated) that was computed while I finished to write this post. On the top part of the molecule, you can see the general structure of a pLGIC and on the bottom part of it, long chains that makes the molecule look like an octopus. This is the intracellular domain of the protein, we don’t know much about it, and the known structure we used to compute this representation didn’t had one (you can see this structure on the very last picture), so the shape of that intracellular domain does not give us any biological information.
A representation of the 4HFI structure of the GLIC protein, a bacterial pLGIC
Again, if you have any questions or comment, do not hesitate to post them below 🙂