Orthologs are genomic strings evolutionarily derived from a common ancestor and having the same biological function. Ortholog detection is biologically interesting since it indirectly informs us about protein divergence through evolution, and, in our particular study, has potentially important agricultural applications.
In our context (to be explained in the talk), we indicate how we arrived at particular (fixed size) attribute vectors to represent genomic string data and show how we conceive ortholog detection as an analogy problem. The attributes are based on both the typical string similarity measures from bioinformatics and on a large number of differential measures or metrics, many new to bioinformatics. Many of these differential metrics are based on evolutionary considerations, both theoretical and empirically observed, in some cases observed by the authors.
Our study employed Quinlan's machine learning algorithm called C5.0. It 1. fits a sequence of decision trees to known classification data by means of an information-theoretic heuristic, where each tree beyond the first is concentrated on correcting the errors of its predecessor; and 2. employing so-called boosting to make the final classification decisions by weighted majority vote of the trees, where higher voting weights are assigned to trees with fewer errors. We applied C5.0 to sets of known ortholog classification data and tried it out on larger sets of such data.
The results we will report are encouraging for complete cDNA strings, and we describe our hopes for future extensions of this study.
This is joint work with Ming Ouyang and Joan Burnside.