Preparing analysis data directory

To run crispr-screens on your own data, first prepare a data directory (where the workflow code is located) as follows:

$ cd my_experiment
$ mkdir reads resources

Copy the fastq files to the reads directory and the resources to the resources directory.

Note

All fastq files must have the extension .fastq.gz or .cram

Copy the the fasta file with sgRNA sequences to the resources directory. If no fasta file is available, provide a csv file with sgRNA sequences and gene names in separate columns. Set the column numbers in config/config.yml (see below).

The final directory structure should look like this:

.
├── config
│   ├── config.yml
│   └── stats.csv
├── reads
│   ├── HT_1.fastq.gz
│   ├── HT_2.fastq.gz
│   ├── noHT_1.fastq.gz
│   └── noHT_2.fastq.gz
├── resources
│   └── bassik.csv
└── workflow
    ├── envs
    │   └── stats.yaml
    ├── report
    │   ├── alignment-rates.rst
    │   ├── bagel2_plots.rst
    │   ├── drugz.rst
    │   ├── gini-index.rst
    │   ├── lfc_neg.rst
    │   ├── lfc_pos.rst
    │   ├── mageck.rst
    │   ├── missed-rgrnas.rst
    │   ├── multiqc.rst
    │   ├── pathway_analysis.rst
    │   ├── plot-coverage.rst
    │   ├── sample-correlation.rst
    │   ├── sgrank.rst
    │   └── workflow.rst
    ├── rules
    │   ├── bagel2.smk
    │   ├── count.smk
    │   ├── drugz.smk
    │   ├── mageck.smk
    │   ├── qc.smk
    │   └── trim.smk
    ├── schemas
    │   ├── config.schema.yaml
    │   └── stats.schema.yaml
    ├── scripts
    │   ├── aggregate_counts.py
    │   ├── bagel2bf.py
    │   ├── bagel2pr.py
    │   ├── cnv_cell_lines.txt
    │   ├── count.sh
    │   ├── crisprcleaner.R
    │   ├── csv_to_fasta.py
    │   ├── general_functions.smk
    │   ├── gprofiler.R
    │   ├── mageck.py
    │   ├── plot_alignment_rate.R
    │   ├── plot_bf.R
    │   ├── plot_coverage.R
    │   ├── plot_drugz_results.R
    │   ├── plot_gini_index.R
    │   ├── plot_lfc.R
    │   ├── plot_missed_sgrnas.R
    │   ├── plot_pr.R
    │   └── plot_sgrank.R
    └── Snakefile

9 directories, 50 files

Experiment meta data

Experiment meta data is described in config/config.yml:

  lib_info:
      library_file: resources/bassik.csv # Path to library file with sgRNA sequences and gene names

      cutadapt:
          g: "" # 5' adapter sequence to trim
          a: "" # 3' adapter sequence to trim
          u: 0 # trim u bases (before a/g trimming)
          l: 20 # shorten reads to l bases
          extra: "" # Extra arguments for cutadapt

  species: human

  csv:
      # 0-based column numbers
      name_column: 0 # Column number with sgRNA names
      gene_column: 1 # Column number with gene names
      sequence_column: 2 # Column number with sgRNA sequences

  mismatch: 0 # Mismatches allowed during alignment

  stats:
  crisprcleanr:
      # For BAGEL2, crisprcleanr is always run as it allows for combining replicates better
      # It is optional for MAGeCK and DrugZ
      # Path to library file for crisprcleanr with sgRNA annotations
      # With column names: GENE,seq,CODE,CHRM,STARTpos,ENDpos,EXONE(optional),STRAND
      # If library name is one of the following, the library info is loaded from the crisprcleanr library database:
      # AVANA_Library (https://doi.org/10.1038/ng.3984)
      # Brunello_Library (https://doi.org/10.1038/nbt.3437)
      # GeCKO_Library_v2 (https://doi.org/10.1038/nmeth.3047)
      # KY_Library_v1.0 (https://doi.org/10.1016/j.celrep.2016.09.079)
      # KY_Library_v1.1 (https://doi.org/10.1016/j.celrep.2016.09.079)
      # MiniLibCas9_Library (https://doi.org/10.1186/s13059-021-02268-4)
      # Whitehead_Library (https://doi.org/10.1126/science.aac7041)
      library_name: TKOv3
      min_reads: 30 # Keep sgRNAs with at least this many reads in control sample

  bagel2:
      run: True # Perform bagel2 analysis
      custom_gene_lists:
      # Paths to custom gene lists for bagel2 analysis
      # Use "none" to use BAGEL2 default gene lists
          essential_genes: none
          non_essential_genes: none
      extra_args: # Extra arguments for bagel2 subcommands
      bf: ""
      pr: ""

  mageck:
      run: True # Perform mageck analysis
      command: test # test or mle
      mle:
          design_matrix: ["config/matrix.txt"] # Design matrix for mageck mle
      # It is recommended to disable crisprcleanr when using non-genome-wide sgRNA libraries
      apply_crisprcleanr: False
      extra_mageck_arguments: ""
      mageck_control_genes: all # All or file with control genes
      apply_CNV_correction: False # Apply CNV correction to mageck results
      cell_line: K562_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE # Cell line for CNV correction

drugz:
  run: True # Perform drugZ analysis
  # It is recommended to disable crisprcleanr when using non-genome-wide sgRNA libraries
  apply_crisprcleanr: False
  extra: "" # Extra arguments for drugZ

pathway_analysis:
  run: False # Perform pathway analysis on mageck results
  data: both # enriched, depleted, or both
  fdr: 0.25 # FDR threshold for significant genes
  top_genes: 50 # Number of top genes to consider for pathway analysis (overrides fdr, use 0 to disable)

FASTQ trimming

The workflow uses cutadapt to trim the reads. The trimming parameters can be set in the config/config.yml file. The default parameters are:

cutadapt:
    g: "" # 5' adapter sequence to trim
    a: "" # 3' adapter sequence to trim
    u: 0 # trim u bases (before a/g trimming)
    l: 20 # shorten reads to l bases
    extra: "" # Extra arguments for cutadapt

The trimming parameters are passed to cutadapt as command line arguments. The default parameters are:

  • g: Sequence of an adapter ligated to the 5’ end. The adapter and any preceding bases are trimmed. Partial matches at the 5’ end are allowed. If a ‘^’ character is prepended (‘anchoring’), the adapter is only found if it is a prefix of the read.

  • a: Sequence of an adapter ligated to the 3’ end. The adapter and subsequent bases are trimmed. If a ‘$’ character is appended (‘anchoring’), the adapter is only found if it is a suffix of the read.

  • u: Remove LEN bases from each read (or R1 if paired; use -U option for R2). If LEN is positive, remove bases from the beginning. If LEN is negative, remove bases from the end. Can be used twice if LENs have different signs. Applied before adapter trimming.

  • l: Shorten reads to LENGTH. Positive values remove bases at the end while negative ones remove bases at the beginning. This and the following modifications are applied after adapter trimming.

  • extra: Extra arguments for cutadapt

BAGEL2 analysis

BAGEL2 can be used to perform gene essentiality analysis. It requires a file containing the pair-wise comparisons of the samples. The file should be in the config directory and have the following format:

Example stats.csv file

test

control

HT_1

noHT_1

HT_2

noHT_2

HT_1;HT_2

noHT_1;noHT_2

Note

Replicate samples can be used by separating the sample names with a semicolon (see above). The sample names must match the fastq file names in the reads directory.

CRISPRcleanR is used to create a count table as input for BAGEL2.

BAGEL2 can be customized as follows:

bagel2:
    run: True # Perform bagel2 analysis
    custom_gene_lists:
    # Paths to custom gene lists for bagel2 analysis
    # Use "none" to use BAGEL2 default gene lists
        essential_genes: none
        non_essential_genes: none
    extra_args: # Extra arguments for bagel2 subcommands
        bf: ""
        pr: ""

Custom essential and non-essential gene lists that BAGEL2 requires for Bayes Factor calculation can be provided in the resources directory. The files should be in the following format:

Example essential gene list

A1BG

A1CF

A2M

A3GALT2

A4GALT

A4GNT

A4GNTL1

A4GNTL2

If no custom gene lists are provided, the default gene lists from BAGEL2 will be used. This can be done by setting the essential_genes and non_essential_genes parameters to none.

Extra arguments for the BAGEL2 bf and pr subcommands can be provided in the extra_args section.

MAGeCK analysis

MAGeCK can be used to perform multiple types of analyses: RRA and MLE.

RRA

To perform pair-wise analyses, MAGeCK RRA can be run with the test command. When the test command is used, a file (config/stats.csv) with these pairwise comparisons of the samples must be provided. The file format is described in the BAGEL2 section.

MLE

For more complex designs, the mle command can be used. A design matrix must be provided in the config directory. An example of a design matrix (tab-delimited) is shown below. :

Example design matrix

Samples

baseline

noHT_vs_HT

HT_1

1

0

HT_2

1

0

noHT_1

1

1

noHT_2

1

1

The MAGeCK analysis can be customized as follows:

mageck:
    run: True # Perform mageck analysis
    command: test # test or mle
    mle:
        design_matrix: ["config/matrix.txt"] # Design matrix for mageck mle
    # It is recommended to disable crisprcleanr when using non-genome-wide sgRNA libraries
    apply_crisprcleanr: False
    extra_mageck_arguments: ""
    mageck_control_genes: all # All or file with control genes
    apply_CNV_correction: False # Apply CNV correction to mageck results
    cell_line: K562_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE # Cell line for CNV correction

CRISPRcleanR can be used to create a normalised count table as input for MAGeCK. This will disable normalisation by MAGeCK.

A list of control genes can be provided in the resources directory. The file should be in the following format:

Example control gene list

A1BG

A1CF

A2M

A3GALT2

A4GALT

A4GNT

A4GNTL1

A4GNTL2

These genes will then be used for normalisation and for generating the null distribution of RRA.

Extra arguments for the MAGeCK test and mle commands can be provided in the extra_mageck_arguments section.

drugZ analysis

drugZ can be used to perform chemogenetic analysis of CRISPR screens. It requires a file containing the pair-wise comparisons of the samples. The file format is described in the BAGEL2 section.

The drugZ analysis can be customized as follows:

drugz:
    run: True # Perform drugZ analysis
    # It is recommended to disable crisprcleanr when using non-genome-wide sgRNA libraries
    apply_crisprcleanr: False
    extra: "" # Extra arguments for drugZ

CRISPRcleanR can be used to create a normalised count table as input for MAGeCK.

Extra arguments for the drugZ command can be provided in the extra section.

CRISPRcleanR

When CRISPRcleanR is applied, the library csv file should be formatted as follows (exon column is optional):

Example library csv file for use with CRISPRcleanR

CODE

GENES

seq

CHRM

STARTpos

ENDpos

EXONE

STRAND

chr19:58864777-58864796_A1BG_+

A1BG

CAAGAGAAAGACCACGAGCA

chr19

58864777

58864796

ex1

+

chr19:58864319-58864338_A1BG_+

A1BG

GCTCAGCTGGGTCCATCCTG

chr19

58864319

58864338

ex3

+

chr19:58863885-58863904_A1BG_+

A1BG

ACTGGCGCCATCGAGAGCCA

chr19

58863885

58863904

ex4

+

chr19:58862759-58862778_A1BG_-

A1BG

GTCGAGCTGATTCTGAGCGA

chr19

58862759

58862778

ex5

-

When CRISPRcleanR is not applied, the library csv file can be less extensive and should only contain columns for the gene names and sgRNA sequences and names, e.g.:

Example library csv file

sgRNA

Gene

sequence

ENSG00000121410_A1BG_PROT_195964.1

A1BG

GCTGACGGGTGACACCCA

ENSG00000121410_A1BG_PROT_195965.2

A1BG

GACTTCCAGCTGTTCAAGAA

ENSG00000121410_A1BG_PROT_195966.3

A1BG

GCAGGTGAGTCAAGGTGCAC

ENSG00000121410_A1BG_PROT_195967.4

A1BG

GCCGCTCGGGCTTGTCCAC

In both cases, the column numbers (0-based) for the gene names, sgRNA sequences, and sgRNA names must be set in config/config.yml (under the csv section).

Note

If library_name matches an internal CRISPRcleanR library, the information will be auto-loaded.

Setup global Snakemake profile

To setup a profile for custom command line arguments, create a new profile (config.yaml) in $HOME/.config/snakemake/standard/:

cores: 40
latency-wait: 10
use-conda: True # Recommended
rerun-incomplete: True
printshellcmds: True
show-failed-logs: True
use-apptainer: True # Recommended
apptainer-args: "--bind '/parent_dir/of/analysis'" # If analysis is not in /home/$USER
# For execution on a SLURM cluster add:
executor: slurm
jobs: 100 # Maximum number of jobs to run in parallel
local-cores: 2 # Limit core usage for local rules
default-resources:
    slurm_partition: icelake
    slurm_account: <ACCOUNT>

With the above settings, Snakemake will download a Docker image from Docker Hub and convert it to an Apptainer image. In this image all required Conda environments are pre-installed. The –bind option is required when the analysis directory is not in /home/$USER. If the analysis directory is in /home/$USER, this option can be omitted.

Dry-run on experimental data

To test if the workflow is working with your data and settings, run a dry-run:

$ snakemake -np

Execution of workflow

To execute the workflow:

$ snakemake --profile $HOME/.config/snakemake/standard/

Report of analysis

When the workflow has finished, a report of the results can be generated (HTML format):

$ snakemake --report report.html