Loading...

Summer Research Fellowship Programme of India's Science Academies

Python based tool for processing, analyzing, motif finding in FASTA sequence files

Nandita Kalita

Tezpur University, Napaam, Sonitpur, Assam 784028

Dr. Dinesh Gupta

ICGEB, Aruna Asaf Ali Marg, Jawaharlal Nehru University, New Delhi, Delhi 110067

Abstract

FASTA files are minimal meta information flat files used for representing nucleotide and protein sequences. The format is quite popular because of its simplicity and it supports multiple files too. The file format consists of nucleotide bases or amino acids represented in single letter codes. A sequence in a FASTA file is preceded by a title which is a one line description for the nucleotide or amino acid sequence. The title starts with a greater than symbol (>). In this project, we have developed a tool to search for the location/s of a motif or a subsequence in nucleotide or protein sequences in a FASTA formatted file. The tool also represents the occurrences of user defined motifs spanning the sequences in a graphical image. The tool is designed to consist of three modules. In the 1st module, a user can upload a multi-sequence FASTA file and a query expression to search in the form of a string or a regular expression. The tool will return a list of the locations of the matched subsequence along with the respective titles of the sequences where it is present. In the 2nd module, the user may enter a protein ID and an expression of interest to search in the form of a string or a regular expression. The tool retrieves the FASTA file of the given protein ID from the UniProt database (www.uniprot.org) and returns the location of the matched subsequence along with the title of the sequence and a graphical representation of the matches overlapped on the sequence span. In the 3rd module too, a user can query a protein/nucleotide accession ID and an expression to search in the form of a string or a regular expression but here, the tool retrieves the FASTA file of the given ID from the NCBI database and returns the location of the matched subsequence along with the title of the sequence.  Since Python makes it easy to parse and manipulate FASTA files, we have used the Django web framework available in Python for the backend. Since the NCBI database does not allow direct retrieval of the FASTA file, we had to do web scraping using Beautiful Soup package of Python and Selenium. The tool has been further developed to identify the transcription binding sites in given DNA sequences.

Keywords: FASTA file, protein ID, nucleotide accession ID, UniProt database, NCBI database, transcription factor binding sites

More
Written, reviewed, revised, proofed and published with