Hybran is a hybrid reference-based and ab initio genome annotation pipeline for prokaryotic genomes. It uses the Rapid Annotation Transfer Tool (RATT) to transfer as many annotations as possible from your reference genome annotation based on conserved synteny between the nucleotide genome sequences. Hybran then supplements unannotated regions with ab initio predictions from Prokka. Then, all coding sequence annotations are clustered together and additional reference gene names are assigned based on amino acid sequence identity and alignment coverage.
This can be executed on one or many genomes. The more reference annotations included, the more accurate the annotation will be and less ambiguity will exist for the target genomes. Input can be a FASTA, a list of FASTAs (space-separated), a directory containing FASTAs, or a File Of FileNames (FOFN) of FASTAs.
hybran \ --genomes /dir/to/FASTAs | in.fasta [in2.fasta in3.fasta ...] | fastas.fofn \ --references /dir/to/reference/annotation(s) \ --output ./ \ --organism "Genus species strain" \ --nproc 2
hybran without specifying a subcommand is the same as calling
Except to see the help menu, you must do
hybran annotate --help.
Final annotations are created in Genbank and GFF formats in the output directory. The output directory also contains intermediate files and informative logs:
outdir/ ├── sample1/ │ ├── annomerge/ │ │ ├── sample1.gbk │ │ ├── sample1.gff │ │ ├── coord_corrections.tsv │ │ ├── prokka_unused.tsv │ │ ├── pseudoscan_report.tsv │ │ └── ratt_unused.tsv | ├── ratt/ │ │ └── ... | ├── ratt-postprocessed/ │ │ ├── sample1.*.final.gbk │ │ ├── coord_corrections.tsv │ │ ├── invalid_features.tsv │ │ └── pseudoscan_report.tsv | ├── prokka/ │ │ └── ... | ├── prokka-postprocessed/ │ │ ├── sample1.gbk │ │ ├── coord_corrections.tsv │ │ ├── invalid_features.tsv │ │ └── pseudoscan_report.tsv ├── sampleN/ │ └── ... │ ├──unified-refs/ │ ├── unifications.tsv | ├── unique_ref_cdss.faa │ ├── reference1.gbk │ ├── ... │ └── referenceN.gbk ├── clustering/ │ ├── multigene_clusters.txt │ ├── novelty_report.tsv │ ├── onlyltag_clusters.txt │ └── singleton_clusters.txt | ├── sample1.gbk ├── sample1.gff ├── ... ├── ... ├── sampleN.gbk └── sampleN.gff
Logs and Reports¶
hybran generates revised reference annotations in the
These annotations differ from the original in that each set of conserved (>=99% amino acid identity and alignment coverage) or duplicated genes is assigned a single name used for all instances.
The original name is retained as a
gene_synonym qualifier in the annotation file.
unifications.tsv will list duplicate genes found in the reference annotations and the name they were assigned.
Columns in this file are
- reference name
- reference locus tag
- reference gene name
- unified name
A multi-fasta file of the representative amino acid sequences for each unique reference CDS.
Depending on the sequence identity and alignment coverage thresholds used, Hybran will name candidate novel genes. This novelty report allows you to examine whether these genes are truly unique based on how close they came to meeting the thresholds.
- nearest_ref_match The top hit among the reference or other candidate novel genes.
nearest_ref_matchis the top hit according to the metric specified in this column. Its values for all three metrics are shown in the next columns.
- pct_aa_ident : Percent amino acid sequence identity
- pct_sub_covg : Percent subject (reference) alignment coverage
- pct_qry_covg : Percent query alignment coverage
- locus_tag: Locus tag of the rejected feature from the source indicated by the file name or parent directory.
- gene_name: Assigned gene name of the rejected feature (lifted over from reference annotation). Same as locus tag if none was assigned.
- rival_locus_tag: Locus tag of the prevailing feature.
- rival_gene_name: Assigned gene name of the prevailing feature (lifted over from reference annotation).
- evidence_codes: Summary of the reason for rejecting the feature.
- remark: A more verbose explanation of the rejection reason.
Evidence Codes Assigned during RATT Postprocessing¶
- no_coordinates : RATT sometimes outputs malformed feature locations (see, for example, RATT#18 and RATT#19). Hybran intercepts these during parsing of the results and sets an empty location to enable continuity of the pipeline. Since the malformed feature could not be properly parsed, however, there may not be a name to refer to in the log here.
- categorical : Currently, rRNAs and tRNAs are only taken from the ab initio annotation, so these are categorically rejected from RATT.
: When using
--filter-ratt, annotations not meeting the blastp thresholds are rejected and have this evidence code applied.
Evidence Codes Assigned by fissionfuser¶
fissionfuser is only applied during postprocessing of the ab initio annotations.
: This scenario arises as a result of postprocessing ab initio annotations.
When a CDS has an internal stop, the ab initio annotation often reports what looks like a tandem duplication.
Start coordinate correction by pseudoscan often extends the downstream fragment to overlap with the upstream fragment and
fissionfuseridentifies this fission event signature.
Evidence Codes Assigned by fusionfisher¶
Evidence Codes Assigned by annomerge¶
- forfeit : When postprocessed RATT and ab initio annotations are equally valid, RATT is preferred since its name assignment derives from synteny.
- unnamed : When an ab initio annotation for which a name could not be assigned using blastp hits conflicts with a RATT annotation, the ab initio annotation is rejected for this reason.
- locus_tag: Locus tag of the annotated feature from the source indicated by the parent directory.
- gene_name: Assigned gene name (lifted over from reference annotation)
- og_start Original start position
- og_end Original end position
- new_start Updated start position
- new_end Updated end position
- fixed_start_codon Whether the start codon was corrected ('true' or 'false')
- fixed_stop_codon Whether the stop codon was corrected ('true' or 'false')
- gene_length_diff The percent difference in gene length between the original and updated locations
- status Whether the correction was accepted or rejected
new_end, "start" always corresponds to the low number on the genome and "stop" corresponds to the high number, regardless of strand.
new_end are not necessary modified from the original coordinates.
fixed_stop_codon indicate whether they have changed, but these correspond to the strand-adjusted start and stop positions, hence the reference to codons.
A summary of the characteristics of "interesting" features found by pseudoscan.
Such features include all genes to which the
pseudo tag was applied, but also includes non-pseudo genes if they had signatures consistent with a
pseudo but had a redeeming attribute.
- locus_tag: Locus tag of the annotated feature. In the hybran pipeline, pseudoscan is run before final locus tags have been determined, so the locus tags in those reports will still correspond to those assigned by RATT and the ab initio annotation.
- gene_name: Assigned gene name (lifted over from reference annotation)
Whether the feature has been called
Summary of the reason(s) for the
pseudodetermination. If multiple evidence codes apply to a single feature, they will be semicolon-delimited. See below for a description of the possible evidence codes.
- summary: Semicolon-delimited string of (abbreviated) pseudo attributes and their respective boolean values.
- div_by_3: Whether the feature sequence is divisible by three.
- valid_start_codon: Whether the feature sequence begins with a valid start codon according to the detected genetic code table.
- valid_stop_codon: Whether the feature sequence ends with a valid stop codon according to the detected genetic code table and has no internal stops.
- ref_corr_start: Whether the feature sequence's beginning aligns to the reference sequence's beginning. Start refers to the part of the sequence containing the start codon, even for genes on the minus strand.
- ref_corr_end: Whether the feature sequence's end aligns to the reference sequence's end. End here refers to the part of the sequence containing the stop codon, even for genes on the minus strand.
Whether the feature sequence has a passing blastp hit based on the thresholds configured with
evidence_codes, the column values are
1 (true), or
. (not determined).
The reference gene is marked
- alt_start: The feature has a valid start codon that does NOT correspond with the reference.
- alt_end: The feature has a valid stop codon that does NOT correspond with the reference.
- noisy_seq: The feature has a reference-corresponding start and stop, but a poor blastp hit.
- no_rcc: The feature does not have a reference-corresponding start and stop. The feature also has a poor blastp hit.
- not_div_by_3: The feature does not have a valid reading frame.
- internal_stop: The feature contains an internal stop codon.
Elghraoui, A.; Gunasekaran, D.; Ramirez-Busby, S. M.; Bishop, E.; Valafar, F. Hybran: Hybrid Reference Transfer and Ab Initio Prokaryotic Genome Annotation. bioRxiv November 10, 2022, p 2022.11.09.515824. doi:10.1101/2022.11.09.515824.