Please refer to the HMMER documentation for
a complete explanation of how bit score and E-values are calculated. The following is
excerpted directly from the HMMER 2.3.2 User's Guide:
Executive summary
- The best criterion of statistical significance is the E-value. The
E-value is calculated from the bit
score. It tells you how many false positives you would have expected
to see at or above this bit score.
Therefore a low E-value is best; an E-value of 0.1, for instance,
means that there's only a 10% chance
that you would've seen a hit this good in a search of nonhomologous
sequences. Typically, I trust
the results of HMMER searches at about E=0.1 and below, and I examine
the hits manually down to
E=10 or so.
- HMMER bit scores are a stricter criterion: they reflect whether the
sequence is a better match to the
profile model (positive score) or to the null model of nonhomologous
sequences (negative score). A
HMMER bit score above log2 of the number of sequences in the target
database is likely to be a true
homologue. For current NR databases, this rule-of-thumb number is on
the order of 20 bits. Whereas
the E-value measures how statistically significant the bit score is,
the bit score itself is telling you how
well the sequence matches your HMM. Because these things should be
strongly correlated, usually,
true homologues will have both a good bit score and a good E-value.
However, sometimes (and
these are the interesting cases), you will find remote homologues
which do not match the model well
(and so do not have good bit scores ? possibly even negative), but
which nonetheless have significant
E-values, indicating that the bit score, though "bad", is still better
than you would've expected by
chance, so it is suggestive of homology.
- What does it mean when I have a negative bit score, but a good
E-value? The negative bit score
means that the sequence is not a good match to the model. The good
E-value means that it's still a better
score than you would've expected from a random sequence. The usual
interpretation is that the sequence
is homologous to the sequence family modeled by the HMM, but it's not
"within" the family - it's a
distant homologue of some sort. This happens most often with HMMs
built from "tight" families of
high sequence identity, aligned to remote homologues outside the
family. For example, an actin HMM
aligned to an actin-related protein will show this behavior - the bit
score says the sequence isn't an actin
(correct) but the E-value says it is significantly related to the
actin family (also correct).