Graph Algorithms for CRISPR

Graph Based Strategies to Finding CRISPR Arrays in Metagenomic Data

Fikrat Talibli & Björn Voss

Introduction

Clustered Regularly Interspaced Short Palindromic Repeats or CRISPR are a genetic immune system of bacteria . CRISPR systems consist of two major parts:

CRISPR arrays
CRISPR associated proteins - CAS genes.

The GOAL

Our work concentrates on finding CRISPR arrays.

Additional info from CRISPR-Cas++:

Repeats and Spacers: A CRISPR is a succession of 23-47bp sequences called repeats separated by unique sequences of a similar length (spacers).
A leader sequence: the CRISPR locus is AT-rich leader sequence of 100-350 bp, acting as a promoter for the pre-crRNA synthesis.
Together with a set of genes called cas for “CRISPR-associated”, they constitute an immune system.

Challenges of modern tools

There are many tools that are using bacterial genomes for finding CRISPR Arrays.

However, viruses are underrepresented in reference databases and have high mutation rates. Additionally:

Most organisms cannot be cultivated in labs
A lot of data is lost in assembly

Raw metagenomic data offer the potential to increase the diversity of known CRISPR-Cas systems. Metagenomic data represent a snapshot of habitat at a defined timepoint.

Goal adjustment: find CRISPR arrays in metagenomic data.

Chap3

Motivation

Schema above(on the left) demonstrates two different approaches:

We took a raw metagenomic dataset with 425 mil. Illumina reads;
We applied 2 different approaches

Assembly-first approach:

Assemble metagenome using e.g. MegaHIT
Use one of the tools for CRISPR analysis: CRT, CRISPRDetect

Analysis approach:

Read in raw metagenomic dataset
Use a tool for CRISPR analysis: CRASS

Venn diagram with predicted CRISPR systems using three tools: CRT,CRISPRDetect and CRASS .

CRISPRDetect predicted 188 CRISPR systems;
CRT predicted 72 CRISPR systems;
CRASS detected 290 CRISPR systems.

The overlap between these three tools is very small - only 5 CRISPR systems.

We propose the new method based on de Bruijn graph.

Chap4

Central idea

Definitions:

k-mers are slices of reads of length k at every index of a read sequence.
De Bruijn Graph is a directed graph, where Edges are k-mers and nodes are k-1-mers. E.g.: edges: {ACTG,CTGT} Nodes: {ACT,CTG}

Let us see how the CRISPR arrays look like on de Bruijn graph. Suppose we have following CRISPR array sequence: AACTTAAACCGGAAC

TEAM OTTO

Repeat: AAC

Spacers: TTA, CGG

CYCLES AND CRISPR ARRAYS

From above we see that CRISPR arrays are forming multicycles with following parameters:

k is less than minimum length of repeat
minimum length of repeat is 23 bp => k=22
length of each cycle:

Sum of spacer and repeat lengths
they vary from 46-94 nodes.

Computationally Formulated Goal

Find cycles limited to 46-94 nodes in de Bruijn graph constructed from metagenomic datasets, with k-mer size 22.

TEAM Wo Ist?

The algorithm

STRUCTURE OF THE ALGORITHM

The rough structure of the algorithm is as follows:

Generate a de Bruijn graph.
Pre-processing: this algorithm masks linear paths in de Bruijn graph to optimize the next two steps.
Use Tarjan's algorithm to isolate all strongly connected components, i.e. the area where any chosen node reaches itself through any taken route.
Use restricted Johnson's algorithm to find cycles 46-94 nodes. Johnson's algorithm finds every circle/cycle in any given graph. The time complexity can be reduced given that the number of nodes in each circle is restricted to certain quantity.
Future outlook: post-processing algorithmto remove false positives.

TEAM Wo Ist?

Results

Proof of concept:

We applied our approach on genomes with known CRISPR systems. Here are the characteristics of the genomes:

35 genomes from crispr-cas-db
The genome length varied anywhere from 1-9 mil. bp.
The quantity of CRISPR arrays varied from 1-110 per genome.
Number of spacers in those CRISPR arrays was 3050.
Number of CRISPR systems in total was 230.

Results:

100% of known systems
~96 % perfect spacer matches

TEAM Wo Ist?

Future Outlook and Take-Home-Message

Future Outlook

Optimization of de Bruijn graph as a data structure
Algorithm for removal of false positives
Detection of imperfect repeats
Graph reduction
Parallel computation

Take-home Message

We introduced a new approach, based on a de Bruijn graph, which is the basic data structure used by many (meta-)genome assemblers.
We proved that CRISPR arrays are multicycles of length 46-94 nodes in a de Bruijn graph with k-mer size 22.
As a proof of concept, we used our approach on known genomic datasets and benchmarked against state-of-the-art methods, where we achieved 96% exact matches in spacer detection.

Graph Based Strategies to Finding CRISPR Arrays in Metagenomic Data

Graph Based Strategies to Finding CRISPR Arrays in Metagenomic Data

Fikrat Talibli & Björn Voss

Introduction

Challenges of modern tools

Chap3

Motivation

Chap4

Central idea

TEAM OTTO

Repeat: AAC Spacers: TTA, CGG

CYCLES AND CRISPR ARRAYS

Computationally Formulated Goal

TEAM Wo Ist?

The algorithm

STRUCTURE OF THE ALGORITHM

TEAM Wo Ist?

Results

Proof of concept:

Results:

TEAM Wo Ist?

Future Outlook and Take-Home-Message

Future Outlook

Take-home Message

Repeat: AAC

Spacers: TTA, CGG