Release Notes¶
Version 1.7.1¶
Bugs fixed¶
- Fixed issue with
onegene
where unnamed reference genes did not get unified to use the locus tag of the primary reference as a gene name, instead unnecessarily assigning a generic name.
Version 1.7¶
Enhancements¶
- New subcommands:
hybran standardize
: remove generic gene names in the final annotations.hybran onegene
: unify names of highly conserved gene copies- New option
-s
/--organism
to set organism name in the final genbank header (#58). - Intercept malformed features in RATT output to allow continuity of the pipeline (#64).
- The
--dedupe-references
option is now deprecated, being made a core part of the pipeline withonegene
since the generic names that it introduces to collapse reference paralogs can be undone byhybran standardize
if desired afterwards. - Reference unification now uses a different generic prefix than that used for unnamed genes to allow differentiating between them.
hybran standardize
handles these as well. - Full support for multiple reference annotations.
- Postprocessed versions of RATT and ab initio annotations, with associated reports, are now saved.
- Substantial speedups due to parallelization of all postprocessing logic.
- Logging and reporting:
- Comprehensive reporting of pseudoscan results.
- Reorganized the invalid and rejected features reports.
- Removed spaces from column names of novelty report.
Bugs fixed¶
- reference unification (formerly activated by the
--dedupe-references
option): - fixed for multiple references (#55)
- enabled checkpointing on this step in the full pipeline.
- pseudoscan: now identifies delayed stop codons when start coordinate correction changes the reading frame without introducing an internal stop.
- coord_check:
- Fixed issue in coord_check's reporting status when not attempting correction.
- Coordinate correction sometimes resulted in genes with no stop codons. coord_check now checks for this and extends the ORF to the next in-frame stop codon to make a proper correction. These address the root cause of the problem for which the temporary measure from v1.6.1 was taken. That temporary measure has been removed.
- Improved detection of gene fusions when one of the components is derived from a reference pseudogene.
- fissionfuser: fixed issue where apparent complementary fragments are combined despite one of the copies being non-pseudo (#66). Although the issue severity was mitigated in that the combined annotation would be rejected in favor of RATT's, the second gene would have been lost.
- annomerge: reject the
source
feature from RATT. Some final annotations contained two of them: one from RATT and one from Prokka. - Fixed issues that occur when sequence IDs contain "|" character (#62).
- Fixed handling of situations where either RATT or Prokka find no annotations.
- Fixed problem with redundant fusion gene name components when detected using both RATT and Prokka.
- Improved detection of gene fusions due to adjustments of alignment internal gap extension penalty and refined delayed-stop calling criteria.
Housekeeping¶
- Changed default generic ORF prefix to "HYBRA" for greater clarity.
- set default output directory to current directory.
- Now ignoring warnings when generating translations for pseudogenes that aren't multiples of three.
- Enabled setting any of RATT's configured transfer types and fixed names for *.global parameter sets.
Version 1.6.1¶
Bugs fixed¶
- Fixed scenario where hybran crashes when coordinate correction matches to an adjacent locus and attempts correction (#61).
- Added a temporary measure to inclusion criteria to penalize CDSs lacking stop codons (inadequately postprocessed) (#60).
Version 1.6¶
Enhancements¶
- Massive streamlining of the pipeline. Reworked components into new subsystems:
pseudoscan
: identification of anomalous copies of reference genes using new criteria independent of alignment coverage. (#50, #56, and #59)fissionfuser
(formerlyprocess_split_genes()
): improved detection and combining of gene fragments (that ab initio annotations tend to produce) into a single record.fusionfisher
: detection of gene fusion events and putative misannotations.thunderdome
: more aggressive conflict resolution between RATT and ab initio annotations.- Output GFF files no longer include the genome sequence.
Bugs fixed¶
- Fixed handling of conflicting annotations that are differently named (#57).
- Reimplemented coordinate correction and applied to ab initio ORFs as part of
pseudoscan
. This resolves many instances of falsepseudo
CDSs ab initio that were due simply to incorrect start coordinate predictions spuriously shortening the genes. - Fixed handling of compound intervals in reference annotations (#46, #47)
- Resolved issues involving reference annotations with multiple contigs/chromosomes (#48)
- Fixed issue with some gene name assignments being dropped later in the pipeline due to some obsolete code (#43).
- More comprehensive tracking of RATT/ab initio overlaps and conflicts (#49).
- Checking in-frame overlaps with pseudo ORFs containing internal stop codons
- Revamped postprocessing of RATT-introduced compound intervals (#44, #45)
- Updated inclusion criteria for special handling of pseudo ORFs (#42)
Version 1.5.2¶
Bugs fixed¶
- Made a consistent non-CDS policy for RATT: Take everything except rRNA and tRNA (#22)
- Clarified some rejection reasons for RATT/ab initio features.
- Fixed representation of blast results for CDSs when there aren't any hits
- Fixed issue with RATT handoff if sample/contig names contain
.
or|
. - Fixed issue with logging merged genes.
Version 1.5.1¶
Bugs fixed¶
- Fixed newly introduced issue with rejecting RATT annotations.
Housekeeping¶
- Removed some unused code.
Version 1.5¶
Bugs fixed¶
- Prevented overlapping RATT-transferred annotations from automatically being handled as conflicts, leading one of the two to be discarded (#39)
- Corrected distinguishing of ab initio vs reference-transferred annotation in final conflict resolution step (#38)
- Added logging of some missed cases of annotation rejections (for {ratt,prokka}_unused.tsv)
- Fixed the reference gene <=> locus_tag mapping dictionaries used in annomerge (#41)
- Made sure to track and process all ab initio annotations that overlap RATT-transferred CDSs (#40).
- Fixed handling of multi-fasta inputs (#35)
Enhancements¶
- Streamlined Prokka workflow (#33)
- Parallelized BLASTing to reference genes in annomerge.
- Set sequence names in the output annotation files.
Housekeeping¶
- Added exit status checks so pipeline fails as early as possible when things go wrong.
- Switched default evalue to Prokka's current setting of 1e-9 (was 1e-6)
- Slightly streamlined the RATT / Prokka comparison workflow
- Added more unit tests
Version 1.4.1¶
Bugs fixed¶
- Genes split into multiple adjacent fragments used to have a single /gene record but multiple CDS records with the same locus tag. For INSDC compliance, they now only have a single CDS record as well.
- Removed /translation fields for /pseudo CDSs.
Version 1.4¶
Enhancements¶
- Generalized for any prokaryote.
- Genetic code and taxonomy ID detected from reference annotation. RATT configuration is now automatically generated based on the detected genetic code, so a configuration file is no longer bundled.
- Now using "ORF" prefix for generic genes rather than "MTB" by default.
Option
--orf-prefix
added for customizability.
- Removed checking for dnaA as the first gene at the first base position.
- Made eggNOG-mapper step optional.
- Gene fragments are now identified using the corresponding reference gene names, but are distinguished with a /pseudo tag.
- RATT and (some) Prokka options are now under user control.
Bugs fixed¶
- Account for translationless CDSs that are labeled with the 'pseudo' qualifier instead of 'pseudogene'
- Allow input fasta files with alternative standard extensions.
- Fixed handling of reference annotations that may not have /gene qualifiers for all annotations.
- Fixed handling of input genome when it's the same as the reference.
- Set proper field from which to draw eggNOG-mapper annotations.
- Uniform locus tags are now assigned for every sample.
- Better identification of reference and unnamed genes in processing of clusters.
Version 1.3.1¶
Enhancements¶
- Hybran version now recorded in the genbank annotation header.
Bugs fixed¶
- Updated the Prokka reference proteome generation format to enable Prokka to set gene names and product fields rather than leaving it to the final clustering step.
- Fixed installation location of resource file.
Version 1.3.0¶
Enhancements¶
- Added
--dedupe-references
option to assign a single generic gene name to duplicate genes in the provided reference annotations. - Sequence identity and alignment coverage thresholds are no longer applied to RATT-transferred annotations by default. (#28)
The original behavior can be restored by passing the new
--filter-ratt
option. - reference annotations can now be passed as individual file names or file of file names, in addition to a directory name (#21)
Bugs fixed¶
- eggnog-mapper step no longer gets skipped (#23)
- alignment query coverage threshold is now applied directly in Prokka (#24)
- Dropped criterion of excluding hypothetical genes from Prokka-no-reference (#30)
- Fixed calculation of query and reference alignment coverage (#27).
- Corrected selection of top blastp hits for the one-to-one and one-to-many searches. When there were multiple hits in these cases, only the last one output by BLAST was being retained, which actually corresponds to the worst hit (by e-value). We now retain only the first hit.
Version 1.2.0¶
Enhancements¶
- Add thorough logging of gene annotations merged, rejected, and newly-named. (#19)
- Better tolerance of directory name inputs (#18)
- New option
-c
/--coverage-threshold
for tuning gene matching. The sequence identity threshold is now taken through-i
/--identity-threshold
. - Incorporate a RATT configuration file for using codons from translation table 11 (#17)
Bugs fixed¶
- use provided identity/coverage thresholds for all instances of BLAST, CD-HIT (#10)
- ensure that the final gff output file gets updated (#15)
- (#14)
Version 1.1.1¶
- Fixed issue with writing of merged_genes.gbk that sometimes caused hybran to crash during a run.
Version 1.1.0¶
- Renamed to Hybran
- Migrated to Python 3
- Removed limitation of 30 references
- Allowed sequence identity threshold to be user-defined
- Proper handling of temporary files
- Fixed issue preventing clustering step from running
Version 1.0¶
Initial release of AnnoTUB