|
Graphical Abstract
|
|
Figure 1.Introduction of Helitron-like elements (HLEs) groups. (A) Features of HLEs. Autonomous elements are pictured on top, with their non-autonomous derivatives just below. Non-autonomous HLEs share identical left terminal sequences (LTS) and right terminal sequences (RTS) with their autonomous counterparts. All autonomous HLEs encode the Rep (light blue) and Helicase (orange) domains. HLE1s might also carry the RPA domain (green), and HLE2s might have the EN domain (blue), RPA domain, and OTU domain (grey). HLE1s usually insert between A and T nucleotides, while HLE2s usually insert between T and T nucleotides. The scale of this scheme is relative. (B) Maximum likelihood estimation tree of HLE transposases from Repbase (Log Lk = −153572.880). The clade highlighted in red corresponds to the HLE1 group, and the clade highlighted in orange corresponds to the HLE2 group (including Helitron2 and Helentron). As an outgroup, we used a sequence made by concatenating geminivirus catalytic rep and helicase proteins of Myroides phaeus. Blue dots on the tree branches are bootstrap values >0.8.
|
|
Figure 2.Workflow of HELIANO. (A) HMM searches for transposases of HLEs; dm denotes the distance between the Rep and Helicase domains. (B) Scan for significantly co-occurring LTS–RTS pairs; w indicates the length of the RepHel domain flanking sequences; dn denotes the distance between LTS and RTS. (C) Filtration to get representative insertions and make their consensus sequences.
|
|
Figure 3.Benchmarking analysis of HELIANO. (A) Schematic representation of benchmarking metrics. TP: true positive; FN: false negative; FP: false positive. (B) Comparison of benchmarking metrics of all tested software. F1 is the score computed as the harmonic mean between sensitivity and recall. FDR: false discovery rate.
|
|
Figure 4.Multiple alignments of selected HLE1 and HLE2 insertions detected by HELIANO for Xenopus frog and Oryza sativa genomes. (A) Structure of the X. tropicalis autonomous HelenXT233. Two alternative LTSs were detected: LTS1 and LTS2. (B) Sequence alignment of LTS1 and LTS2 from the autonomous X. tropicalis HelenXT233. (C, D) Multiple alignments of insertions from HelenXT233 families (C) for HelenXT233-1 and (D) for HelenXT233-2. (E) A case for multiple alignments of X. laevis HLE1 insertions. (F, G) Cases of multiple alignments of HLE1 insertions in O. sativa genome. The autonomous insertions are labelled as ‘auto’ in each multiple alignment, and others are non-autonomous counterparts. The nucleotide highlighted in purple shows the predicted starts and stops by HELIANO. The down arrows in red indicate the precise insertion sites based on manual curation, using as a rule that HLE2 insert between T and T nucleotides and HLE1 between A and T nucleotides. Note the precise correspondences between the HELIANO annotation and the manual curation for HLE2 in E-F-G and the differences for HLE1 in (C) and (D). The horizontal black arrows indicate terminal inverted repeats and stem-loop structures.
|
|
Figure 5.Reannotation of HELIANO predictions with known TEs. (A) Proportion of HELIANO predictions according to their copy number. (B) Copy number and size of HELIANO predictions reannotated as ‘HLE’ (pink), ‘OtherTE’ (brown) and ‘unannotated’ (cyan). Groups a and b in red circles indicate ambiguous annotations. Bold numbers inside each plot indicate the total number of copies.
|
|
Figure 6.Distribution of HLEs among 404 eukaryote genomes. (A) A species tree obtained from NCBI indicated the phylogenetic relationship of sampled genomes. The fraction in each bracket represents the ratio of the number of species with HLEs to the number of all sampled species in a particular class. The heatmap indicates the presence (red) and absence (grey) of HLE groups in each sampled class. (B) The scatter plot shows the number of detected HLEs in each sampled class. Each point represents the number of corresponding HLE groups in a species. The fraction in each bracket represents the ratio of the number of species with HLEs to the number of all species in a particular class. The y-axis scale is log10 transformed. Species icons were created with BioRender.com.
|
|
Figure 7.Distribution of the distance between RepHel and additional protein domains in HLEs. The zero value on the x-axis indicates the position of RepHel domains, negative values indicate the corresponding domains are upstream of RepHel, and positive value indicates their presence downstream of RepHel. The y-axis shows the count of HLEs.
|
|
Figure 8.Distribution of HLEs and their captured domains in eukaryote genomes. (A) Maximum likelihood estimation tree of HLE transposases from sampled species (Log Lk = −8062114.874). The HLE2 (light yellow block) and HLE1 (light red block) groups were further classified into subgroups: a–d for HLE2 and e–i for HLE1. Unclassified HLEs are in grey. The annotation below the tree entitled Type indicates the classification and source of HLEs. HLEs from Repbase are marked in red, and in black represent HELIANO misclassified HLEs. The annotation entitled host represents the species origin of HLEs. (B) The heatmap shows the presence or absence of additional domains in each corresponding HLE. Red indicates the presence of the domain, and light blue indicates its absence.
|
|
Supplementary Figure S1. Features of HLEs collected from Repbase. (A) Distribution of HLEs length. (B) Distribution of the distance between the 3' end of the REP domain and the 5' end of the Helicase domain. (C) Distribution of the distance between the 5' end of LTS and the 5' end of the REP domain. (D) Distribution of the distance between the 3' end of the Helicase domain and the 3' end of the RTS. Red vertical lines indicate values used as default parameters in HELIANO. Length is expressed in bp.
|
|
Supplementary Figure S2. Phylogenetic relationships of HLEs Rep domains. The clade highlighted in red encompasses the HLE1 group, and the clade highlighted in orange encompasses the HLE2 group. The rep catalytic protein sequence of geminivirus (WP_015060107.1) was set as an outgroup. This maximum likelihood estimation tree of Rep domains has a LogLk = -66062.678. Only bootstrap values greater than 0.8 are shown.
|
|
Supplementary Figure S3. Phylogenetic relationships of HLEs helicase domains. The clade highlighted in red encompasses the HLE1 group, and the clade highlighted in orange encompasses the HLE2 group. Myroides phaeus Helicase protein sequences (WP_090404604.1) and Candidatus Collierbacteria PIF1 helicase (KKT34677.1) were set as outgroups. This maximum likelihood estimation tree of Helicase domains of HLEs from Repbse has a LogLk = -108550.922. Only bootstrap values greater than 0.8 are shown.
|
|
Supplementary Figure S4. Maximum likelihood estimation tree of transposases of HLE1 insertions in Oryza sativa genome (LogLk = - 64380.460). Sequences labelled in red indicate transposases extracted from Repbase and present in the O. sativa genome as complete copies. Sequences labelled in pink indicate transposases extracted from Repbase and absent in the O. sativa genome as complete copies. Sequences labelled in black indicate transposases extracted from HELIANO annotation. The outgroup (Outgroup_RepHel in grey) was created by concatenating the geminivirus rep catalytic protein and the helicase protein of Myroides phaeus.
|
|
Supplementary Figure S5. A schematic diagram showing examples of nested TEs in HLEs. Nested TEs are highlighted in red, and HLEs' ends are shown in blue (known HLEs) or grey (novel prediction). The relative position of nested TEs is scaled.
|
|
Supplementary Figure S6. Impact of selected parameters on reannotation of HELIANO prediction. (A) Different parameter groups and the corresponding full parameters used for HELIANO (v1.2.0). The three parameters represent the insertion preference site ('is0' for unlimited and 'is1' for limited), the distance between LTS and RTS ('dn6k' for limiting the distance to shorter than 6000 bp and 'dn0' for unlimited and will be deduced automatically), and the pairing strategy ('near' for pairing closest LTS and RTS and 'far' for pairing the furthest). (B) Proportion of HELIANO predictions according to their category as 'HLE', 'OtherTE', and 'unannotated' in the reannotation process. The Y-axis indicates the HELIANO parameter used. The three parameters represent the insertion preference site ('is0' for unlimited and 'is1' for limited), the distance between LTS and RTS ('dn6k' for limiting the distance to shorter than 6000 bp and 'dn0' for unlimited and will be deduced automatically), and the pairing strategy ('near' for pairing closest LTS and RTS and 'far' for pairing the furthest).
|
|
Supplementary Figure S7. Variation of genome size (A) and GC content (B) of the 404 sampled eukaryote genomes.
|
|
Supplementary Figure S8. Example of HLE2s in land plants. (A) Multiple alignments of selected HLE2 insertions of HelenSM92 family in Sphagnum magellanicum genome. (B, C, D) Schemes showing the conserved domains for three HLE2 insertions of the HelenSM92 family. The GIY-YIG domains are highlighted in red rectangles, and the HLE Rep-Hel domains are highlighted in orange. The image was obtained from the Conserved Domain Database (CDD) search tool output (Lu et al. 2020).
|
|
Supplementary Figure S9. Domains organization of canonical HLE2 Rep/Hel proteins. (A) Domain organization of an HLE2 Rep/Hel protein containing Herpes_teg_N, DUF6570 and EN domain. (B) Multiple alignment of Herpes_teg_N domain sequence of HLE2 from different species. (C) Domain organization of an HLE2 Rep/Hel protein containing an OTU domain. (D) Multiple alignment of OTU sequences of HLE1s and HLE2s from different species. The asterisk symbol indicates conserved motifs.
|
|
Supplementary Figure S10. The Herpes_teg_N domain in an HLE2s of Dreissena polymorpha. (A) Multiple alignments of selected HLE2 insertions of the HelenUD547 family. (B) Conserved domains for one HLE2 insertion of the HelenUD547 family (CM035916.1:c49143025-49127369). The image was obtained from the Conserved Domain Database (CDD) search tool output (Lu et al. 2020).
|
|
Supplementary Figure S11. The Herpes_teg_N domain in HLE2 of Oreochromis niloticus. (A) Multiple alignments of selected HLE2 insertions of the HelenON151 family. (B) Conserved domains of one HLE2 insertion of the HelenON151 family (CM007482.2:27245403-27261371). The image was obtained from the Conserved Domain Database (CDD) search tool output (Lu et al. 2020).
|