Some (bacilli) like it hot: genomics of Geobacillus species

Biotechnology and Biological Sciences Research Council (BBSRC). Grant Numbers: BB/H016120/1, BB/I024631/1, BB/I025956/1, BB/K003240/2, BB/L012499/1

digesters of lignocellulose, bioremediators of hydrocarbons, producers of bio-fuel, cellular factories for heterologous expression of enzymes and as hosts for directed evolution (Wiegel et al., 1985;Niehaus et al., 1999;Couñago and Shamoo, 2005;Marchant et al., 2006;Cripps et al., 2009;Taylor et al., 2009;Tabachnikov and Shoham, 2013). Industrially important enzymes originating from Geobacillus spp. include lipases (Schmidt-Dannert et al., 1998), glycoside hydrolases (Fridjonsson et al., 1999;Bartosiak-Jentys et al., 2013;Suzuki et al., 2013), N-acylhomoserine lactonase (Seo et al., 2011) and DNA polymerase I (Sandalli et al., 2009) and protease (Chen et al., 2004) among others. The advantages of using thermophilic bacteria as whole-cell biocatalysts were recently discussed in this journal (Taylor et al., 2011) and include reduced risk of contamination, acceleration of biochemical processes and easier maintenance of anaerobic conditions. These bacteria also tend to ferment a wide range of substrates, utilizing both cellobiose and pentose sugars. In the context of bioethanol production, there is the additional advantage of reduced cooling costs and easier removal and recovery of the volatile product by sparging or partial vacuum thus also avoiding ethanol poisoning of the bacteria . Less positively, Geobacillus spp. are common contaminants in the dairy and food industries (Burgess et al., 2010).

Which genomes have been sequenced?
At the time of writing (28 July 2014), 29 Geobacillus genome sequences are available (Table 1). These include representatives of all the major phylogenetic groups within the genus and include representatives of the species G. thermoleovorans, G. kaustophilus, G. thermocatenulatus, G. thermodenitrificans, G. stearothermophilus, G. caloxylosilyticus and G. thermoglucosidans (formerly G. thermoglucosidasius) as well as several strains that have not been assigned to named species (Fig. 2). Genome sequences are also available for some other thermophilic members of the Bacillaceae, such as Paenibacillus lautus (Mead et al., 2012) and Bacillus coagulans (Xu et al., 2013)  Names are given as found in the GenBank sequence database. n.a., not available.
bacteriophage (Marks and Hamilton, 2014), but these will not be discussed here. The team who sequenced the genome of Geobacillus sp. MAS1 described this strain as 'G. thermopakistaniensis', but this is not a validly named species and no justification was provided for its proposal as a new species (Siddiqui et al., 2014). On the basis of its recN sequence, a useful phylogenetic marker for Geobacillus spp. (Zeigler, 2005), strain MAS1 is closely The circles indicate strains whose genomes have been sequenced, as listed in Table 1. The triangles indicate type strains of the various Geobacillus species; recN sequences from these are taken from a previous phylogenetic analysis by Zeigler (2005). The maximum-likelihood tree was generated using MEGA6 (Tamura et al., 2013). related to the type strains of G. kaustophilus and G. thermoleovorans (Fig. 2). Strain NUB3621 was described as 'G. stearothermophilus' but as has been previously noted (Studholme et al., 1999;Zeigler, 2005;Blanchard et al., 2014), this strain is phylogenetically distinct from B. stearothermophilus sensu strictu and is more closely related to G. caldoxylsilyticus and, to a lesser extent, G. thermoglucosidans (Fig. 2). For more than half of the sequenced genomes, papers have been published describing and/or announcing the sequence data and usually indicating the particular features of the strain that motivated its sequencing. An insightful discussion of the biological lessons from Geobacillus genomes was previously published earlier this year, including surveys of genes involved in breakdown of plant-derived lignocellulose (Zeigler, 2014); but at that time, only 10 genome sequences were available. The phylogenetic group within Geobacillus most richly represented by genome sequences is the clade containing G. thermoleovorans, G. kaustophilus and G. thermocatenulatus (see the 'kaustophilus clade' in Fig. 2). Based solely of sequences of the recN phylogenetic marker, it is not possible to precisely resolve relationships among sequenced strains within this group (Fig. 2). However, the availability of complete genome sequence data enables phylogenetic analysis based on single-nucleotide variants over the entire core genome, offering much greater resolution (Fig. 3A). According to the core-genome-wide phylogenetic analysis, the two strains assigned as G. kaustophilus do not form a phylogenetically coherent monophyletic clade. On the other hand, the two strains of G. thermoleovorans are closely related and share 99.4% nucleotide sequence identity [based on MUMMER2 alignments (Delcher et al., 2002)]. Strain FW23 also appears to fall within this clade and, subject to phenotypic characterization, can probably be considered a member of this species too. Geobacillus thermocatenulatus GS-1 is much more divergent, sharing only 94% to 95% identity with the other strains in the clade, which is consistent with the recN-based analysis (Fig. 2). Strains Y412MC52 and YP412MC61 appear to be extremely closely related to each other, sharing 99.8% sequence identity and showing no detectable differences in gene content. Nucleotide sequence identities between clades are much lower; between G. kaustophilus and G. thermoglucosidans, there is approximately 84% identity.
The considerable amount of reticulation in the phylogenetic network (Fig. 3A) suggests significant horizontal genetic transfer within and among these species. This is further illustrated by the extent of variation in the variable component of the genome (Fig. 3B). Out of 3887 genes on the chromosome of G. thermoleovorans CCB US3 UF5, a total of 931 (approximately 24%) are variable (that is, they are absent from at least one of the other sequenced genomes). The global pattern of gene content (Fig. 3B) broadly reflects the phylogenetic relationships ( Fig. 3A): according to gene content, the genomes fall into four main clusters, indicated by four different colours of shading in Fig. 3B, which correspond to four zones of the phylogenetic network, shaded with the same colours in Fig. 3A. However, there are numerous genes whose distribution across the genomes is incongruent with coregenome phylogeny, again suggesting extensive horizontal transfer.

What benefits has the sequencing of Geobacillus genomes brought?
The availability of complete Geobacillus genome sequences has enabled or accelerated the discovery, cloning and exploitation of natural products. For example, the availability of the NG80-2 genome sequence (Feng et al., 2007) enabled the discovery of thermostable homologues of the lantibiotic nisin in G. thermodenitrificans (Begley et al., 2009;Garg et al., 2012), opening the possibility of replacing nisin as a food preservative and veterinary antibiotic with more-stable alternatives. Lantibiotics appear to be widely distributed among sequenced Geobacillus species. For example, the genome of G. kaustophilus HTA426 contains two lantibiotic-biosynthesis gene clusters (centred on the genes for YP_146139 and YP_146147) that are both conserved in the recently sequenced Geobacillus sp. CAMR12739. The NG80-2 genome sequence also enabled discovery of the first nitrous oxide reductase gene from a Gram-positive, and a novel thermophilic long-chain alkane monooxygenase (Feng et al., 2007). Furthermore, the genome sequence enabled proteomics-level confirmation of pathways for catabolism of long-chain alkanes (Feng et al., 2007) and aromatics (Li et al., 2012).
Many of the Geobacillus genome sequencing projects reported genes potentially encoding thermostable homologues of useful enzymes. In some cases, the genome sequences have been used to clone and express the genes of interest and characterize the enzyme for biotechnological potential. For example, the genome of G. kaustophilus HTA426 was recently mined for members of the glycoside hydrolase family 1, which have potential uses in synthesizing therapeutic oligosaccharides (Suzuki et al., 2013). The genome sequence of the alkane-utilizing G. thermoleovorans B23 (Boonmak et al., 2013) revealed a cluster of three long-chain alkane monooxygenase genes with homology to that of NG80-2 that showed activity in vivo when heterologously expressed in Pseudomonas fluorescens (Boonmak et al., 2014). Recently, a novel thermostable endo-xylanase was cloned and expressed from Geobacillus sp. WSUCF1 (Bhalla et al., 2014) following the sequencing of its genome (Bhalla et al., 2013).
Genome sequencing has revealed that interesting traits are often encoded on chromosomes rather than on the chromosome. For example, the biphenyl-degrading pathway of Geobacillus sp. JF8 (Mukerjee-Dhar et al., 2005;Shintani et al., 2014) and the long-chain alkane monooxygenase of G. thermodenitrificans NG80-2 (Feng et al., 2007) are both located on plasmids. The dynamic loss and gain of such mobile elements presumably explains, in part, the physiological differences between natural isolates of Geobacillus spp. and it also suggests that these bacteria might be engineered to express new traits by introduction of recombinant plasmids. Indeed, progress has been made in developing plasmid shuttle vectors for heterologous expression in Geobacillus spp. Fig. 3. Relationships among sequenced genomes within the G. kaustophilus clade resolved using whole-genome sequence data. The phylogenetic network in panel A was based on a concatenation of 1722 variant single-nucleotide sites in 1 874 967 nucleotides of the core genome present in all 15 genomes. The network was generated using the NEIGHBORNET algorithm (Bryant and Moulton, 2004) implemented in the SPLITSTREE software package (Huson, 1998). The heat-map in B indicates the presence (dark blue) and absence (light blue) of each of 931 non-core genes from the genome of G. thermoleovorans CCB US3 UF6 across the same 15 genomes appearing in A. The genecontent clusters are shaded in the same colours in both panels. The heat-map was rendered using Raivo Kolde's pheatmap package in R (R Development Core Team, R, 2013). (Thompson et al., 2008;Bartosiak-Jentys et al., 2013).

A B
The value of genome sequencing goes beyond cataloguing potentially useful enzymes, as exemplified by the recently published genomic study of strain NUB3621 (Blanchard et al., 2014). Some previous attempts to fully exploit the potential of Geobacillus strains as whole-cell catalysts have been frustrated by the paucity of genetic and genomic resources (my own PhD research project in the mid-1990s being a case in point; Studholme, 1998). However, strain NUB3621 is a promising laboratory workhorse strain. It is one of the few Geobacillus strains that has been shown to be readily transformable with plasmid DNA (Wu and Welker, 1989); protocols have been developed for genetic analysis (Chen et al., 1986) and a genetic map has been available for more than two decades (Vallier and Welker, 1990). Strain NUB3621 is a mutant derived from wild-type strain NUB36 that lacks its parent strain's restriction-modification system and this probably contributes to transformation efficiency. Incidentally, and consistent with this, we observed that transformation efficiency was significantly affected by the methylation status of the plasmid DNA (Thompson et al., 2008).
Being one of the most genetically amenable Geobacillus strains, NUB3621 was obviously a high priority for genome sequencing. But rather than simply announcing and describing its genome sequence, the authors went on to show how the genome sequence could be exploited to further develop the strain as a host for heterologous expression and metabolic engineering (Blanchard et al., 2014). Specifically, they used the genome sequence to clone two promoters and incorporated them into plasmid vectors: one for inducible gene expression and one constitutive. The authors also mention that they tried other promoters that did not work so well; presumably, the availability of the genome sequence allowed them to relatively quickly screen a number of candidates until they found the best ones. The combination of a genome sequence, allowing relatively facile construction of expression and/or knock-out constructs and a global view of metabolism, along with transformability and a wide range of growth temperatures [between 39 and 75°C (Wu and Welker, 1991)] make NUB3621 a strong candidate as the preferred thermophilic host for rationally designed metabolic engineering.

What's next?
The availability of complete (or nearly complete) genome sequences for nearly 30 Geobacillus strains (Table 1) as well as large-scale proteomic data for at least one (Feng et al., 2007;Li et al., 2012) should certainly accelerate cloning, expression and characterization of novel thermostable and thermo-active enzymes, at least in an academic research context. However, there has been relatively little industrial uptake of enzymes from thermophiles, with much greater use of proteins originating from mesophiles but engineered for thermo-stability (Haki and Rakshit, 2003;Taylor et al., 2011). The convergence of genomic data and transformability, at least for strain NUB3621, should help to remove the barriers to greater exploitation of thermophiles. However, genome sequences are not yet publicly available for the handful of other readily transformable Geobacillus strains such as G. thermodenitrificans K1041 (Narumi et al., 1992), G. stearothermophilus IFO 12550 (Imanaka et al., 1982), NRRL 1174 (Liao et al., 1986) and G. thermoglucosidasius TN (Thompson et al., 2008). Furthermore, although it is possible to predict the metabolic networks of bacteria from complete genome sequence, there is a need for comprehensive testing of these predictions through metabolomics. Only then can we rationally design genetic interventions to predictably manipulate metabolism. And finally, palaeo-genomics of ancient Geobacillus spores, which may be viable after billions of years of dormancy, might shed light on population-genetics and evolutionary processes over timescales that we previously assumed to be intractable (Nicholson, 2003;Zeigler, 2014).