Preparing analysis data directory
To run crispr-screens on your own data, first prepare a data directory (where the workflow code is located) as follows:
$ cd my_experiment
$ mkdir reads resources
Copy the fastq files to the reads directory and the resources to the resources directory.
Note
All fastq files must have the extension .fastq.gz or .cram
Copy the the fasta file with sgRNA sequences to the resources directory. If no fasta file is available, provide a csv file with sgRNA sequences and gene names in separate columns. Set the column numbers in config/config.yml (see below).
The final directory structure should look like this:
.
├── config
│ ├── config.yml
│ └── stats.csv
├── reads
│ ├── HT_1.fastq.gz
│ ├── HT_2.fastq.gz
│ ├── noHT_1.fastq.gz
│ └── noHT_2.fastq.gz
├── resources
│ └── bassik.csv
└── workflow
├── envs
│ └── stats.yaml
├── report
│ ├── alignment-rates.rst
│ ├── bagel2_plots.rst
│ ├── drugz.rst
│ ├── gini-index.rst
│ ├── lfc_neg.rst
│ ├── lfc_pos.rst
│ ├── mageck.rst
│ ├── missed-rgrnas.rst
│ ├── multiqc.rst
│ ├── pathway_analysis.rst
│ ├── plot-coverage.rst
│ ├── sample-correlation.rst
│ ├── sgrank.rst
│ └── workflow.rst
├── rules
│ ├── bagel2.smk
│ ├── count.smk
│ ├── drugz.smk
│ ├── mageck.smk
│ ├── qc.smk
│ └── trim.smk
├── schemas
│ ├── config.schema.yaml
│ └── stats.schema.yaml
├── scripts
│ ├── aggregate_counts.py
│ ├── bagel2bf.py
│ ├── bagel2pr.py
│ ├── cnv_cell_lines.txt
│ ├── count.sh
│ ├── crisprcleaner.R
│ ├── csv_to_fasta.py
│ ├── general_functions.smk
│ ├── gprofiler.R
│ ├── mageck.py
│ ├── plot_alignment_rate.R
│ ├── plot_bf.R
│ ├── plot_coverage.R
│ ├── plot_drugz_results.R
│ ├── plot_gini_index.R
│ ├── plot_lfc.R
│ ├── plot_missed_sgrnas.R
│ ├── plot_pr.R
│ └── plot_sgrank.R
└── Snakefile
9 directories, 50 files
Experiment meta data
Experiment meta data is described in config/config.yml:
lib_info:
library_file: resources/bassik.csv # Path to library file with sgRNA sequences and gene names
cutadapt:
g: "" # 5' adapter sequence to trim
a: "" # 3' adapter sequence to trim
u: 0 # trim u bases (before a/g trimming)
l: 20 # shorten reads to l bases
extra: "" # Extra arguments for cutadapt
species: human
csv:
# 0-based column numbers
name_column: 0 # Column number with sgRNA names
gene_column: 1 # Column number with gene names
sequence_column: 2 # Column number with sgRNA sequences
mismatch: 0 # Mismatches allowed during alignment
stats:
crisprcleanr:
# For BAGEL2, crisprcleanr is always run as it allows for combining replicates better
# It is optional for MAGeCK and DrugZ
# Path to library file for crisprcleanr with sgRNA annotations
# With column names: GENE,seq,CODE,CHRM,STARTpos,ENDpos,EXONE(optional),STRAND
# If library name is one of the following, the library info is loaded from the crisprcleanr library database:
# AVANA_Library (https://doi.org/10.1038/ng.3984)
# Brunello_Library (https://doi.org/10.1038/nbt.3437)
# GeCKO_Library_v2 (https://doi.org/10.1038/nmeth.3047)
# KY_Library_v1.0 (https://doi.org/10.1016/j.celrep.2016.09.079)
# KY_Library_v1.1 (https://doi.org/10.1016/j.celrep.2016.09.079)
# MiniLibCas9_Library (https://doi.org/10.1186/s13059-021-02268-4)
# Whitehead_Library (https://doi.org/10.1126/science.aac7041)
library_name: TKOv3
min_reads: 30 # Keep sgRNAs with at least this many reads in control sample
bagel2:
run: True # Perform bagel2 analysis
custom_gene_lists:
# Paths to custom gene lists for bagel2 analysis
# Use "none" to use BAGEL2 default gene lists
essential_genes: none
non_essential_genes: none
extra_args: # Extra arguments for bagel2 subcommands
bf: ""
pr: ""
mageck:
run: True # Perform mageck analysis
command: test # test or mle
mle:
design_matrix: ["config/matrix.txt"] # Design matrix for mageck mle
# It is recommended to disable crisprcleanr when using non-genome-wide sgRNA libraries
apply_crisprcleanr: False
extra_mageck_arguments: ""
mageck_control_genes: all # All or file with control genes
apply_CNV_correction: False # Apply CNV correction to mageck results
cell_line: K562_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE # Cell line for CNV correction
drugz:
run: True # Perform drugZ analysis
# It is recommended to disable crisprcleanr when using non-genome-wide sgRNA libraries
apply_crisprcleanr: False
extra: "" # Extra arguments for drugZ
pathway_analysis:
run: False # Perform pathway analysis on mageck results
data: both # enriched, depleted, or both
fdr: 0.25 # FDR threshold for significant genes
top_genes: 50 # Number of top genes to consider for pathway analysis (overrides fdr, use 0 to disable)
FASTQ trimming
The workflow uses cutadapt to trim the reads. The trimming parameters can be set in the config/config.yml file. The default parameters are:
cutadapt:
g: "" # 5' adapter sequence to trim
a: "" # 3' adapter sequence to trim
u: 0 # trim u bases (before a/g trimming)
l: 20 # shorten reads to l bases
extra: "" # Extra arguments for cutadapt
The trimming parameters are passed to cutadapt as command line arguments. The default parameters are:
g: Sequence of an adapter ligated to the 5’ end. The adapter and any preceding bases are trimmed. Partial matches at the 5’ end are allowed. If a ‘^’ character is prepended (‘anchoring’), the adapter is only found if it is a prefix of the read.
a: Sequence of an adapter ligated to the 3’ end. The adapter and subsequent bases are trimmed. If a ‘$’ character is appended (‘anchoring’), the adapter is only found if it is a suffix of the read.
u: Remove LEN bases from each read (or R1 if paired; use -U option for R2). If LEN is positive, remove bases from the beginning. If LEN is negative, remove bases from the end. Can be used twice if LENs have different signs. Applied before adapter trimming.
l: Shorten reads to LENGTH. Positive values remove bases at the end while negative ones remove bases at the beginning. This and the following modifications are applied after adapter trimming.
extra: Extra arguments for cutadapt
BAGEL2 analysis
BAGEL2 can be used to perform gene essentiality analysis. It requires a file containing the pair-wise comparisons of the samples. The file should be in the config directory and have the following format:
test |
control |
|---|---|
HT_1 |
noHT_1 |
HT_2 |
noHT_2 |
HT_1;HT_2 |
noHT_1;noHT_2 |
Note
Replicate samples can be used by separating the sample names with a semicolon (see above). The sample names must match the fastq file names in the reads directory.
CRISPRcleanR is used to create a count table as input for BAGEL2.
BAGEL2 can be customized as follows:
bagel2:
run: True # Perform bagel2 analysis
custom_gene_lists:
# Paths to custom gene lists for bagel2 analysis
# Use "none" to use BAGEL2 default gene lists
essential_genes: none
non_essential_genes: none
extra_args: # Extra arguments for bagel2 subcommands
bf: ""
pr: ""
Custom essential and non-essential gene lists that BAGEL2 requires for Bayes Factor calculation can be provided in the resources directory. The files should be in the following format:
A1BG |
A1CF |
A2M |
A3GALT2 |
A4GALT |
A4GNT |
A4GNTL1 |
A4GNTL2 |
If no custom gene lists are provided, the default gene lists from BAGEL2 will be used. This can be done by setting the essential_genes and non_essential_genes parameters to none.
Extra arguments for the BAGEL2 bf and pr subcommands can be provided in the extra_args section.
MAGeCK analysis
MAGeCK can be used to perform multiple types of analyses: RRA and MLE.
RRA
To perform pair-wise analyses, MAGeCK RRA can be run with the test command. When the test command is used, a file (config/stats.csv) with these pairwise comparisons of the samples must be provided. The file format is described in the BAGEL2 section.
MLE
For more complex designs, the mle command can be used. A design matrix must be provided in the config directory. An example of a design matrix (tab-delimited) is shown below. :
Samples |
baseline |
noHT_vs_HT |
|---|---|---|
HT_1 |
1 |
0 |
HT_2 |
1 |
0 |
noHT_1 |
1 |
1 |
noHT_2 |
1 |
1 |
The MAGeCK analysis can be customized as follows:
mageck:
run: True # Perform mageck analysis
command: test # test or mle
mle:
design_matrix: ["config/matrix.txt"] # Design matrix for mageck mle
# It is recommended to disable crisprcleanr when using non-genome-wide sgRNA libraries
apply_crisprcleanr: False
extra_mageck_arguments: ""
mageck_control_genes: all # All or file with control genes
apply_CNV_correction: False # Apply CNV correction to mageck results
cell_line: K562_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE # Cell line for CNV correction
CRISPRcleanR can be used to create a normalised count table as input for MAGeCK. This will disable normalisation by MAGeCK.
A list of control genes can be provided in the resources directory. The file should be in the following format:
A1BG |
A1CF |
A2M |
A3GALT2 |
A4GALT |
A4GNT |
A4GNTL1 |
A4GNTL2 |
These genes will then be used for normalisation and for generating the null distribution of RRA.
Extra arguments for the MAGeCK test and mle commands can be provided in the extra_mageck_arguments section.
drugZ analysis
drugZ can be used to perform chemogenetic analysis of CRISPR screens. It requires a file containing the pair-wise comparisons of the samples. The file format is described in the BAGEL2 section.
The drugZ analysis can be customized as follows:
drugz:
run: True # Perform drugZ analysis
# It is recommended to disable crisprcleanr when using non-genome-wide sgRNA libraries
apply_crisprcleanr: False
extra: "" # Extra arguments for drugZ
CRISPRcleanR can be used to create a normalised count table as input for MAGeCK.
Extra arguments for the drugZ command can be provided in the extra section.
CRISPRcleanR
When CRISPRcleanR is applied, the library csv file should be formatted as follows (exon column is optional):
CODE |
GENES |
seq |
CHRM |
STARTpos |
ENDpos |
EXONE |
STRAND |
|---|---|---|---|---|---|---|---|
chr19:58864777-58864796_A1BG_+ |
A1BG |
CAAGAGAAAGACCACGAGCA |
chr19 |
58864777 |
58864796 |
ex1 |
+ |
chr19:58864319-58864338_A1BG_+ |
A1BG |
GCTCAGCTGGGTCCATCCTG |
chr19 |
58864319 |
58864338 |
ex3 |
+ |
chr19:58863885-58863904_A1BG_+ |
A1BG |
ACTGGCGCCATCGAGAGCCA |
chr19 |
58863885 |
58863904 |
ex4 |
+ |
chr19:58862759-58862778_A1BG_- |
A1BG |
GTCGAGCTGATTCTGAGCGA |
chr19 |
58862759 |
58862778 |
ex5 |
- |
When CRISPRcleanR is not applied, the library csv file can be less extensive and should only contain columns for the gene names and sgRNA sequences and names, e.g.:
sgRNA |
Gene |
sequence |
|---|---|---|
ENSG00000121410_A1BG_PROT_195964.1 |
A1BG |
GCTGACGGGTGACACCCA |
ENSG00000121410_A1BG_PROT_195965.2 |
A1BG |
GACTTCCAGCTGTTCAAGAA |
ENSG00000121410_A1BG_PROT_195966.3 |
A1BG |
GCAGGTGAGTCAAGGTGCAC |
ENSG00000121410_A1BG_PROT_195967.4 |
A1BG |
GCCGCTCGGGCTTGTCCAC |
In both cases, the column numbers (0-based) for the gene names, sgRNA sequences, and sgRNA names must be set in config/config.yml (under the csv section).
Note
If library_name matches an internal CRISPRcleanR library, the information will be auto-loaded.
Setup global Snakemake profile
To setup a profile for custom command line arguments, create a new profile (config.yaml) in $HOME/.config/snakemake/standard/:
cores: 40
latency-wait: 10
use-conda: True # Recommended
rerun-incomplete: True
printshellcmds: True
show-failed-logs: True
use-apptainer: True # Recommended
apptainer-args: "--bind '/parent_dir/of/analysis'" # If analysis is not in /home/$USER
# For execution on a SLURM cluster add:
executor: slurm
jobs: 100 # Maximum number of jobs to run in parallel
local-cores: 2 # Limit core usage for local rules
default-resources:
slurm_partition: icelake
slurm_account: <ACCOUNT>
With the above settings, Snakemake will download a Docker image from Docker Hub and convert it to an Apptainer image. In this image all required Conda environments are pre-installed. The –bind option is required when the analysis directory is not in /home/$USER. If the analysis directory is in /home/$USER, this option can be omitted.
Dry-run on experimental data
To test if the workflow is working with your data and settings, run a dry-run:
$ snakemake -np
Execution of workflow
To execute the workflow:
$ snakemake --profile $HOME/.config/snakemake/standard/
Report of analysis
When the workflow has finished, a report of the results can be generated (HTML format):
$ snakemake --report report.html