Python training exercise 7
Contents
Introduction
More often than not the data you need for your program will come from somewhere else - either from user input or a file. Especially for more complex data it becomes essential to be able to read in data files, do something with the data, and write out a new file with modified information or a set of analysis results.
Exercises
Reading a file
To read in a file you have to create a file handle. This is a sort of live connection to the file on disk that you can use to pull data from it. You create a connection to a file by using the open() command.
Before we do this, download [this fake PDB coordinate file for a 5 residue peptide and save it in the directory you are working in. Then create the new program below in the same directory - Python has to know where the file is in order to access it.
# Open the file fileHandle = open("TestFile.pdb") # Read all the lines in the file (as separated by a newline character), and store them in the lines list # Each element in this list corresponds to one line of the file! lines = fileHandle.readlines() # Close the file fileHandle.close() # Print number of lines in the file print(len(lines)) # Loop over the lines, and do some basic string manipulations for line in lines: line = line.strip() # Remove starting and trailing spaces/tabs/newlines # Only do something if it's not an empty line if line: cols = line.split() # Split the line by white spaces; depending on the format this could be commas, ... # Now you can do many other things with the data in the file...
If all is well, the file has 263 lines.
Building on this example, use a for: loop to check all the lines in the file, print out only the lines that contain the string ' C ' (note the use of spaces!), and count them. [click on show more for answer) |
---|
# Open the file fileHandle = open("TestFile.pdb") # Read all the lines in the file (as separated by a newline character), and store them in the lines list # Each element in this list corresponds to one line of the file! lines = fileHandle.readlines() # Close the file fileHandle.close() # Initialise the line counter lineCount = 0 # Loop over the lines for line in lines: if line.count(" C "): # Alternatively, use "if ' C ' in line:" print(line, end='') # Using the 'end' argument in the print because the line already contains a newline at the end # otherwise will get double spacing. lineCount += 1 print("Number of lines with ' C ': {}".format(lineCount)) You should find 75 lines - note that in this case, for those who know the PDB format a bit, you're finding all carbon atoms. |
Writing a file
Writing a file is very similar, except that you have to let Python know you are writing this time. Try this:
# Open the file for writing - by using the extra 'w' argument. # Be careful - if the file exists already it will be overwritten without warning! fileHandle = open("testFile.txt",'w') # Write a header line for the data we will be writing. Don't forget the newline at the end!!! fileHandle.write("LineNumber Value Divided_five Rest_divided_five\n") # Create some data to write out myData = list(range(50,100)) myDivider = 5 for dataIndex in range(len(myData)): myNumber = myData[dataIndex] divided = myNumber / myDivider restDivided = myNumber % myDivider fileHandle.write("{:6d} {:5d} {:5d} {:5d}\n".format(dataIndex+1,myNumber,divided,restDivided)) # Close the file fileHandle.close() # Print number of lines in the file print(len(lines))
The file is written to the directory you're executing the program in - have a look!
Read in the file from the previous example, and write out all lines that contain 'VAL' to a new file. [click on show more for answer) |
---|
# Open the file fileHandle = open("TestFile.pdb") # Read all the lines in the file (as separated by a newline character), and store them in the lines list # Each element in this list corresponds to one line of the file! lines = fileHandle.readlines() # Close the file fileHandle.close() # Track the lines with VAL linesToWrite = [] # Loop over the lines for line in lines: if line.count("VAL"): # Alternatively, use "if ' C ' in line:" linesToWrite.append(line) # Write out the lines fileHandle = open("fileWithVAL.pdb",'w') for line in linesToWrite: fileHandle.write(line) fileHandle.close() |
Advanced file reading and interpretation exercise
Read in the TestFile.pdb atom coordinate file, print out the title of the file, and find all atoms that have coordinates closer than 2 angstrom to the (x,y,z) coordinate (-8.7,-7.7,4.7). Print out the model number, residue number, atom name and atom serial for each; the model is indicated by:
MODEL 1
lines, the atom coordinate information is in:
ATOM 1 N ASP A 1 -10.341 -9.922 9.398 1.00 0.00 N
lines, where column 1 is always ATOM, column 2 is the atom serial, the column 3 the atom name, column 4 the residue name, column 5 the chain code, column 6 the residue number, followed by the x, y and z coordinates in angstrom in columns 7, 8 and 9.
Note that the distance between two coordinates is calculated as the square root of (x1-x2)²+(y1-y2)²+(z1-z2)².
Click on show for an answer |
---|
# Open the file fileHandle = open("TestFile.pdb") # Read all the lines in the file (as separated by a newline character), and store them in the lines list # Each element in this list corresponds to one line of the file! lines = fileHandle.readlines() # Close the file fileHandle.close() # Initialise some information searchCoordinate = (-8.7,-7.7,4.7) modelNumber = None # Loop over the lines, and do some basic string manipulations for line in lines: line = line.strip() # Remove starting and trailing spaces/tabs/newlines # Only do something if it's not an empty line if line: cols = line.split() # Split the line by white spaces; depending on the format this could be commas, ... # Print off the title if cols[0] == 'TITLE': title = line.replace(cols[0],'') title = title.strip() print("The title is '{}'".format(title)) # Track the model number elif cols[0] == 'MODEL': modelNumber = int(cols[1]) # For atom lines, calculate the distance elif cols[0] == 'ATOM': # Set some clear variable names and convert to the right type atomSerial = int(cols[1]) atomName = cols[2] residueNumber = int(cols[5]) x = float(cols[6]) y = float(cols[7]) z = float(cols[8]) # Calculate the distance distance = ((x - searchCoordinate[0]) ** 2 + (y - searchCoordinate[1]) ** 2 + (z - searchCoordinate[2]) ** 2 ) ** 0.5 if distance < 2.0: print("Model {}, residue {}, atom {} (serial {}) is {:.2f} away from reference.".format(modelNumber,residueNumber,atomName,atomSerial,distance)) |