Clustered Regularly Interspaced Short Palindromic Repeats or CRISPR are a genetic immune system of bacteria . CRISPR systems consist of two major parts:
Our work concentrates on finding CRISPR arrays.
Additional info from CRISPR-Cas++:
There are many tools that are using bacterial genomes for finding CRISPR Arrays.
However, viruses are underrepresented in reference databases and have high mutation rates. Additionally:
Raw metagenomic data offer the potential to increase the diversity of known CRISPR-Cas systems. Metagenomic data represent a snapshot of habitat at a defined timepoint.
Goal adjustment: find CRISPR arrays in metagenomic data.
Schema above(on the left) demonstrates two different approaches:
We propose the new method based on de Bruijn graph.
Definitions:
Let us see how the CRISPR arrays look like on de Bruijn graph. Suppose we have following CRISPR array sequence: AACTTAAACCGGAAC
From above we see that CRISPR arrays are forming multicycles with following parameters:
Find cycles limited to 46-94 nodes in de Bruijn graph constructed from metagenomic datasets, with k-mer size 22.
The rough structure of the algorithm is as follows:
We applied our approach on genomes with known CRISPR systems. Here are the characteristics of the genomes: