The Central Dogma of Molecular Biology (DNA RNA Protein); Nucleotides, Codons, and the Genetic Code as an information encoding scheme; Biological sequences as discrete strings over a finite alphabet ( = A, T, G, C );...
The Central Dogma of Molecular Biology (DNA RNA Protein); Nucleotides, Codons, and the Genetic Code as an information encoding scheme; Biological sequences as discrete strings over a finite alphabet ( = A, T, G, C ); FASTA and FASTQ file formats as data structures for sequence storage; Introduction to biological databases: NCBI, UniProt, and the PDB as large-scale structured repositories; Querying and retrieving biological records programmatically.
The biological motivation for sequence comparison: finding functional and evolutionary similarity; Pairwise alignment: global alignment (Needleman-Wunsch) and local alignment (Smith-Waterman) as dynamic programming problems; Scoring matrices: PAM and BLOSUM as probabilistic substitution models; Gap penalties and their effect on alignment quality; Heuristic alignment with BLAST: seed-and-extend strategy, E-values, and statistical significance; Applications in identifying homologous genes and annotating unknown sequences.
The genome sequencing problem: reads, coverage, and the assembly challenge; Overlap-Layout-Consensus (OLC) paradigm for long-read assembly; De Bruijn graphs as the foundational data structure for short-read assembly: k-mers, nodes, and edges; Eulerian path formulation of the assembly problem; Challenges: repeats, sequencing errors, and heterozygosity; Introduction to reference genomes and read mapping as an alternative to de-novo assembly.
The gene prediction problem: identifying coding regions within a raw genome sequence; Hidden Markov Models (HMMs) as the canonical tool for gene structure modeling: states for exons, introns, and intergenic regions; The Viterbi algorithm for decoding the most likely gene structure; Motif finding: representing regulatory signals (promoters, binding sites) as Position Weight Matrices (PWMs); Searching for motifs using information-theoretic scoring; Introduction to sequence logos as a visualization of motif conservation.
Phylogenetic trees as data structures representing evolutionary history: leaves, internal nodes, branch lengths; Distance-based tree construction: UPGMA and Neighbor-Joining algorithms; Character-based methods: Maximum Parsimony as a combinatorial optimization problem; Bootstrapping as a statistical validation technique for tree topology; Multiple Sequence Alignment (MSA) with ClustalW and MUSCLE as a prerequisite for phylogenetics; Biological insight: tracing the origin of pathogens and studying gene family evolution.