Input query sequences are searched against the NCBI’s BLAST API in remote mode. BLAST hits are filtered according to user defined quality thresholds. Each hit is then queried against the NCBI’s Identical Protein Groups (IPG) resource, which, as its name suggests, groups proteins sharing identical amino acid sequence as an anti-redundancy measure. The resulting IPG table contains source genomic coordinates for each hit protein sequence, which cblaster uses to group them by their corresponding organism, scaffold and subject sequences.
In order to run a cblaster-search, you will need to provide input sequences which will be used as queries. These can be provided in two ways:
To run a domain search, you need to specify the search mode as HMM, provide query Pfam domain profile names and select a genus-specific database (consisting of all representative/reference genomes of the selected genus stored at NCBI) to search in. This will extract the specified domain profiles from the Pfam database and search for the sequences in the selected genus database for any domain hits.
The default values for each filter are pretty generous, and may need changing based on your data. The search thresholds should be fairly self explanatory; any hit not meeting them are discarded from the BLAST search results.
The clustering thresholds, however, are a bit more interesting. These determine what conditions a candidate hit cluster must satisfy in order to be detected by cblaster. The most important parameter here is Maximum Intergenic Gap, which determines how far (in base pairs) any two hits in a cluster can be from one another. This parameter could vary wildly based on your data set. For example, in bacterial or fungal secondary metabolite gene clusters where genes are typically found very close together, a low value could be used. Conversely, plant clusters, which may involve a collection of key genes spread out over the entire chromosome, would require a much higher value. The Gene Neighbourhood Estimation module can used to calibrate this parameter based on your results
By default, a complete summary is generated and saved to a file after the search has finished. This reports all clusters, as well as the scores and positions of each gene hit, found during the search, organized by the organisms and genomic scaffolds they belong to.
An easier way to digest all of the information that cblaster will produce is by using the binary table output. This generates a matrix which shows the absence/presence of query sequence (columns) hits in each result cluster (rows). By default, the binary table will only report the total number of hits per query sequence in each cluster. However, you can instead change this to some value calculated from the actual scores of hits in the clusters.
This is controlled by two additional parameters: Hit attribute, which determines which score attribute (‘identity’, ‘coverage’, ‘bitscore’ or ‘evalue’) to use when calculating cell values, and Key function, which determines the function (‘len’, ‘max’, ‘sum’) applied to the score attribute.
Each cell in the matrix refers to multiple hit sequences within each cluster. For every cell, the chosen score attribute is extracted from each hit corresponding to that cell. Then, the key function is applied to the extracted scores. The ‘len’ function calculates the length of each score list - essentially just counting the number of hits in that cell. The ‘max’ and ‘sum’ functions calculate the maximum and sum of each score list, respectively.
Given that searches can take a significant time to run (i.e. as long as any normal batch BLAST job will take), cblaster is capable of saving a search session to file, and loading it back later for further filtering and visualization. In CAGECAT, cblaster search sessions are used to connect jobs. Any subsequent jobs will make cblaster try to load the session file of the precious job, instead of performing a new search.
The default output for cblaster is the cluster heatmap, which shows the absence or presence of your query sequences. While we find this is generally the easiest way to pick up on patterns of cluster conservation, we also like to be able to visualize our results in their own genomic contexts so we can see the differences in gene order, orientation, size and so on.
For this reason, we added integration to the clinker tool (https://github.com/gamcil/clinker), which can generate highly interactive gene cluster comparison plots. However, in a regular cblaster-search, we do not have access to any information about the genes between the BLAST hits shown in the heatmap. This means that if you were to run the Visualize with query clusters module on a previous job, you would produce a figure where most of the clusters are missing genes!
To get around this, you can use the Find intermediate genes parameter when performing a cblaster search. After the search has completed, genomic regions corresponding to the detected gene clusters are retrieved from the NCBI servers, and used to fill in the missing genes.
You can recompute a previous cblaster-search job using new filter thresholds to filter previous results.
In cblaster, the most important parameter when detecting hit clusters is the maximum intergenic-hit gap parameter. This determines how far cblaster will look between any two hits before terminating a given cluster. By default, this parameter is set to 20,000 bp; if no new hit is found within 20,000 bp of the previous hit in a cluster, cblaster will terminate extension of that cluster. Though the 20 kb cutoff has worked quite well for us when looking at fungi or bacteria, where gene density within clusters is quite high, it may not work for all datasets. For example, plant gene clusters may have key biosynthetic genes spread out over large stretches of the chromosome, with many genes in between; this is where the GNE module comes in.
The Neighbourhood estimation module lets you robustly detect an appropriate value to use for this parameter by continually re-running cluster detection on a previous cblaster-search or cblaster-recompute job at different gap values over some interval. It then generates plots of the mean and median cluster sizes (bp), as well as the total number of predicted clusters, at each value.
The Neighbourhood estimation module generates a list of gap values (total number determined by the samples parameter) from 0 to some upper limit (determined by the Max. intergenic distance parameter). These numbers can be chosen in two ways. By default, Neighbourhood estimation will take evenly spaced (i.e. linear) values over the range 0-100,000 bp. Alternatively, you can choose to generate these values via a log scale, which will result in more samples at lower values than at higher ones. This can be specified using the sampling space parameter. As these plots typically resemble logarithmic growth (i.e. rise steeply, then level off), it can make sense to sample more heavily in the more unstable region of the curve.
After a cblaster-search has been performed, it can be useful to retrieve sequences matching a certain query for further analyses (e.g. sequence comparisons for phylogenies). This is easily accomplished using the Extract sequences module.
A preceding job is used as input, and sequences matching any filters you have specified will be extracted If no filters are specified, ALL hit sequences will be extracted. By default, only sequence names are extracted. This is because cblaster stores no actual sequence data for hit sequences during it’s normal search workflow, only their coordinates. However, sequences can automatically be retrieved from the NCBI by specifying the Download sequences parameter. cblaster will then write them to a file in FASTA format.
The Extract sequences module can also filter based on the organism that each hit sequence is on.
Advanced: The organism filter uses regular expression patterns based on organism names, but it is not obligatory to use regular expressions. Multiple patterns can be provided, and are additive (i.e. any organism matching any of the patterns will be saved).
A common next step after a cblaster-search job is to retrieve the identified gene clusters so we can perform additional analysis. cblaster provides the Extract clusters module precisely for this purpose, allowing you to generate GenBank files of specific gene clusters directly from a previous cblaster-search job. Extracting all clusters from a job could take a long time for searches with many results.
By default, the visualisation offered by cblaster shows only a heatmap of query hits per result cluster. While this is very useful for quickly identifying patterns in large datasets, we generally still want to see how these clusters compare in a more biologically relevant way.
The Visualize with query clusters module allows you to do precisely this. Given a previous cblaster-search or cblaster-recompute job and some selected clusters, this module will automatically extract the clusters, then generate an interactive visualisation showing each cluster to-scale using clinker (doi: 10.1093/bioinformatics/btab007, https://github.com/gamcil/clinker).