Getting chemical structure information into Python

How do you get protein database (PDB) information into Python?

Have you ever tried to get PDB coordinate data into Python to either manipulate the atoms or perform chemical calculations? There are a few methods of importing. There is the ProDy (http://prody.csb.pitt.edu/) project. This project parses PDB files for use within the ProDy system so using the data outside of ProDy requires some effort. Another is BioPython (https://biopython.org/wiki/Reading_large_PDB_files). Again this parser works with the BioPython suite of programs and requires programming effort to use the data outside of BioPython.

The project I was working on required just a straightforward parsing of a PDB file putting the data into Python lists. Reading the current specifications for PDB files shows the data are structured precisely. That is there are specific data in specific columns in a text file. This easily lends itself to a FORTRAN format. Luckily there is a Python FORTRAN formatter library: fortranformat (https://bitbucket.org/brendanarnold/py-fortranformat). This is a straightforward application of FORTRAN formats.

Typically Python code:

Import fortranformat
# FormatStr = ‘typical format such as I3, 1F2, A10’ For PDB files it would be:
FormatStr = ‘A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,3X,3F8.3,2F6.2,10X,A2,A2’
Line = FortranRecordReader(FormatStr)

The manipulation of the data is straightforward and in this case the data is imported using the following code:


def DataIn(fn):
 Atom_line=ff.FortranRecordReader('A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,3X,3F8.3,2F6.2,10X,A2,A2')
    Hetatm_line=ff.FortranRecordReader('A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,3X,3F8.3,2F6.2,10X,A2,A2')
     n = -1
     j = 0
     f = open(fn,'r')
     line = f.readline()
     while line != "":
         if line[:4] == 'ATOM':
             n = n + 1
             VarIn =Atom_line.read(line)
             Ser_no.append(VarIn[1])
             Type.append(VarIn[2])
             Alt.append(VarIn[3])
             Residue.append(VarIn[4])
             Chain.append(VarIn[5])
             Res_seq.append(VarIn[6] + j)
             X.append(VarIn[8])
             Y.append(VarIn[9])
             Z.append(VarIn[10])
             Occ.append(VarIn[11])
             TempF.append(VarIn[12])
 #            Segment.append(VarIn[13])
             Element.append(VarIn[13])
             Charge.append(VarIn[14])
             ID.append(n)
             line = f.readline()
         elif line[:6] == 'HETATM':
             n = n + 1
             VarIn =Hetatm_line.read(line)
             Ser_no.append(VarIn[1])
             Type.append(VarIn[2])
             Alt.append(VarIn[3])
             Residue.append(VarIn[4])
             Chain.append(VarIn[5])
             Res_seq.append(VarIn[6] + j)
             X.append(VarIn[8])
             Y.append(VarIn[9])
             Z.append(VarIn[10])
             Occ.append(VarIn[11])
             TempF.append(VarIn[12])
 #            Segment.append(VarIn[13])
             Element.append(VarIn[13])
             Charge.append(VarIn[14])
             ID.append(n)
             line = f.readline()
         elif line[:3] == 'TER':
             j = max(Res_seq)
             line = f.readline()
         else:
             Header.append(line)
             line = f.readline()
     f.close()                                                                                                   
     return

After the data are read into you list just manipulate them as you would any other list. The line is read into the VarIn list with the columns as the PDB columns. Those are then appended to the specific lists needed for your use. The example above is for the “ATOM” or “HETATM” codes of the PDB file. Others could be read in as well.

With the information parsed in this fashion it is straightforward to write the information as a PDB file for use in any program that can read PDB data files.