Any alignment of biological sequences depends on the substitution matrix and gap penalties. The most common approach to construct a substitution matrix uses empirical frequencies of substitutions in large sets of multiple sequence alignments. However, in many applications, the "ground truth" multiple sequence alignments are either not available or require extensive curatorial work.
We present a general statistical framework and an algorithm for estimation of optimal substitution matrices and gap penalties from unaligned pairs of biological sequences with binary labels. The algorithm finds parameters that provide an optimal separation between the positive and negative pairs. We present some preliminary results about the statistical error of estimation.