Graph Based Strategies to Finding CRISPR Arrays in Metagenomic Data

Graph Based Strategies to Finding CRISPR Arrays in Metagenomic Data

Fikrat Talibli & Björn Voss


Introduction


Clustered Regularly Interspaced Short Palindromic Repeats or CRISPR are a genetic immune system of bacteria . CRISPR systems consist of two major parts:



The GOAL

Our work concentrates on finding CRISPR arrays.

Additional info from CRISPR-Cas++:

  • Repeats and Spacers: A CRISPR is a succession of 23-47bp sequences called repeats separated by unique sequences of a similar length (spacers).
  • A leader sequence: the CRISPR locus is AT-rich leader sequence of 100-350 bp, acting as a promoter for the pre-crRNA synthesis.
  • Together with a set of genes called cas for “CRISPR-associated”, they constitute an immune system.

Challenges of modern tools


There are many tools that are using bacterial genomes for finding CRISPR Arrays.

However, viruses are underrepresented in reference databases and have high mutation rates. Additionally:

  • Most organisms cannot be cultivated in labs
  • A lot of data is lost in assembly

Raw metagenomic data offer the potential to increase the diversity of known CRISPR-Cas systems. Metagenomic data represent a snapshot of habitat at a defined timepoint.

Goal adjustment: find CRISPR arrays in metagenomic data.

Chap3


Motivation



Schema above(on the left) demonstrates two different approaches:

  • We took a raw metagenomic dataset with 425 mil. Illumina reads;
  • We applied 2 different approaches
    • Assembly-first approach:
      1. Assemble metagenome using e.g. MegaHIT
      2. Use one of the tools for CRISPR analysis: CRT, CRISPRDetect
    • Analysis approach:
      1. Read in raw metagenomic dataset
      2. Use a tool for CRISPR analysis: CRASS
  • Venn diagram with predicted CRISPR systems using three tools: CRT,CRISPRDetect and CRASS .
    • CRISPRDetect predicted 188 CRISPR systems;
    • CRT predicted 72 CRISPR systems;
    • CRASS detected 290 CRISPR systems.
  • The overlap between these three tools is very small - only 5 CRISPR systems.

We propose the new method based on de Bruijn graph.

Chap4


Central idea




Definitions:

Let us see how the CRISPR arrays look like on de Bruijn graph. Suppose we have following CRISPR array sequence: AACTTAAACCGGAAC

TEAM OTTO
  • Repeat: AAC
  • Spacers: TTA, CGG
CYCLES AND CRISPR ARRAYS

From above we see that CRISPR arrays are forming multicycles with following parameters:

  • k is less than minimum length of repeat
  • minimum length of repeat is 23 bp => k=22
  • length of each cycle:
    • Sum of spacer and repeat lengths
    • they vary from 46-94 nodes.


Computationally Formulated Goal

Find cycles limited to 46-94 nodes in de Bruijn graph constructed from metagenomic datasets, with k-mer size 22.


TEAM Wo Ist?


The algorithm


STRUCTURE OF THE ALGORITHM

The rough structure of the algorithm is as follows:

  • Generate a de Bruijn graph.
  • Pre-processing: this algorithm masks linear paths in de Bruijn graph to optimize the next two steps.
  • Use Tarjan's algorithm to isolate all strongly connected components, i.e. the area where any chosen node reaches itself through any taken route.
  • Use restricted Johnson's algorithm to find cycles 46-94 nodes. Johnson's algorithm finds every circle/cycle in any given graph. The time complexity can be reduced given that the number of nodes in each circle is restricted to certain quantity.
  • Future outlook: post-processing algorithmto remove false positives.

TEAM Wo Ist?


Results


Proof of concept:

We applied our approach on genomes with known CRISPR systems. Here are the characteristics of the genomes:

  • 35 genomes from crispr-cas-db
  • The genome length varied anywhere from 1-9 mil. bp.
  • The quantity of CRISPR arrays varied from 1-110 per genome.
  • Number of spacers in those CRISPR arrays was 3050.
  • Number of CRISPR systems in total was 230.


Results:
  • 100% of known systems
  • ~96 % perfect spacer matches

TEAM Wo Ist?


Future Outlook and Take-Home-Message



Future Outlook

  • Optimization of de Bruijn graph as a data structure
  • Algorithm for removal of false positives
  • Detection of imperfect repeats
  • Graph reduction
  • Parallel computation


Take-home Message
  • We introduced a new approach, based on a de Bruijn graph, which is the basic data structure used by many (meta-)genome assemblers.
  • We proved that CRISPR arrays are multicycles of length 46-94 nodes in a de Bruijn graph with k-mer size 22.
  • As a proof of concept, we used our approach on known genomic datasets and benchmarked against state-of-the-art methods, where we achieved 96% exact matches in spacer detection.