CS 234: “Computational Methods for the Analysis of
Biomolecular Data”
Homework 3
Problem 1: Computing products of probabilities (50 points)
A typical subroutine in several of the methods discussed in class require the computation of the
product of the probabilities of the symbols in a string. Assume that we are dealing with a Bernoulli
source, and therefore that symbols are independently generated. Suppose we are given a text x of
size n over the alphabet Σ = {A, T,C,G} and that we know pA, pT , pC , pG in advance.
Q (20pts): Design an algorithm to compute
p(y) =
|y|∏
i=1
py[i]
in O(1) for any of the substrings y of the text x. A substring y is identified by a pair (i, j) where
i is the start of the substring, and j is the end of the substring. You are allowed to spend O(n)
time in a preprocessing phase, just once. After that, your algorithm should be able to return p(y)
in constant time for any substring y in the text.
Q (10pts): How would you avoid cancellation problems induced by multiplying long sequences of
very small real numbers?
Q (20pts): Discuss how to extend your algorithm for a Markov model of order h > 0. What is
the time complexity of your algorithm? Is it possible to obtain the same time complexity as for
the Bernoulli case, that is O(n) preprocessing once and then constant time for any substring y of
the text x?
Problem 2: Maximum likelihood (30 points)
We have seen in class that the maximum likelihood estimate for the probabilities of the Bernoulli
model corresponds to the intuitive idea of computing the proportion of the occurrences in the text
for each symbol of the alphabet.
Formally, the ML estimate for a Bernoulli model is
pMLA =
f(A)
f(A) + f(C) + f(G) + f(T )
and similarly for pMLC , p
ML
G and p
ML
T . Recall we denote by f(y) the number of occurrence of y.
Q: Prove that P (x|pML) > P (x|p) for any other choice of p ̸= pML.
Hint: show that log(P (x|pML)/P (x|p)) > 0 and use the fact that the relative entropy is always
positive.
Problem 3: Profile HMM (20 points)
Build a profile HMM on the alignment below. Use one state for each column 1,2,3,7,8,9 and one
insert state for columns 4,5,6. You will need to use a total of 7 states.
123456789
1 ACAA–ACC
2 AGAAGTCTC
3 ACA—ATG
4 TCAC–ATC
Comments are closed.