With the progress of modern sequencing technologies a number of complete genomes is now available. Traditional motif discovery tools cannot handle this massive amount of data, therefore the comparison of complete genomes can be carried out only with ad hoc methods. In this work we propose a distance function based on subword compositions, which extends the Average Common Subword approach (ACS) of Ulitsky et al. [17]. ACS is closely related to the cross entropy estimated between two entire genome sequences, and thus to some set of “independent” subwords, namely the irredundant common subwords. Then, we filter the irredundant common subwords by means of underlying-paired motifs, which relate to each other regions of two genome sequences. This set of motifs is, by construction, linear in the size of input and without overlap; we call the selected motifs, underlying-paired irredundant common subwords, or simply "unic" subwords. Preliminary results show the validity of our method, and suggest novel computational approaches for analyzing the evolution of genomes.

Whole-Genome Phylogeny by virtue of Unic Subwords

VERZOTTO D
2012-01-01

Abstract

With the progress of modern sequencing technologies a number of complete genomes is now available. Traditional motif discovery tools cannot handle this massive amount of data, therefore the comparison of complete genomes can be carried out only with ad hoc methods. In this work we propose a distance function based on subword compositions, which extends the Average Common Subword approach (ACS) of Ulitsky et al. [17]. ACS is closely related to the cross entropy estimated between two entire genome sequences, and thus to some set of “independent” subwords, namely the irredundant common subwords. Then, we filter the irredundant common subwords by means of underlying-paired motifs, which relate to each other regions of two genome sequences. This set of motifs is, by construction, linear in the size of input and without overlap; we call the selected motifs, underlying-paired irredundant common subwords, or simply "unic" subwords. Preliminary results show the validity of our method, and suggest novel computational approaches for analyzing the evolution of genomes.
2012
9781467326216
Alignment-free sequence analysis
Phylogenomics
Motif analysis
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14252/1327
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact