• head_banner_01(1)

Biosynthetic Potential of the Global Marine Microbiome

Biosynthetic Potential of the Global Marine Microbiome

Thank you for visiting Nature.com. The browser version you are using has limited CSS support. For the best experience, we recommend that you use an updated browser (or disable Compatibility Mode in Internet Explorer). In the meantime, to ensure continued support, we will render the site without styles and JavaScript.
Natural microbial communities are phylogenetically and metabolically diverse. In addition to the understudied biota1, this diversity includes a rich potential for the discovery of ecologically and biotechnologically significant enzymes and biochemical compounds2,3. However, studying this diversity to determine the genomic pathways that synthesize such compounds and bind them to their respective hosts remains a challenge. The biosynthetic potential of microorganisms in the high seas remains largely unknown due to limitations in the analysis of genome resolution data on a global scale. Here, we explore the diversity and novelty of biosynthetic gene clusters in the ocean by integrating approximately 10,000 microbial genomes from culture and single cells with more than 25,000 newly reconstructed draft genomes from over 1,000 seawater samples. These efforts have identified about 40,000 putative, predominantly novel, biosynthetic gene clusters, some of which have been found in previously unsuspected phylogenetic groups. Among these groups, we have identified a biosynthetic gene cluster enriched lineage (“Candidatus Eudoremicrobiaceae”) belonging to an uncultivated bacterial phylum that includes some of the most biosynthetically diverse microorganisms in this environment. Of these, we have characterized the phosphopeptidine and pytonamide pathways, identifying instances of unusual bioactive compound structure and enzymology, respectively. In conclusion, this study demonstrates how microbiome-based strategies can enable the study of previously undescribed enzymes and natural products in understudied microbiomes and environments.
Microbes drive global biogeochemical cycles, maintain food webs, and keep plants and animals healthy5. Their enormous phylogenetic, metabolic and functional diversity represents a rich potential for the discovery of new taxa, enzymes and biochemical compounds, including natural products. In ecological communities, these molecules provide micro-organisms with a variety of physiological and ecological functions, from communication to competition2,7. In addition to their original functions, these natural products and their genetically coded production pathways provide examples for biotechnological and therapeutic applications2,3. The identification of these pathways and connections has been greatly facilitated by the study of cultured microorganisms. However, taxonomic studies of the natural environment show that the vast majority of microbial life has not yet been cultivated. This cultural bias limits our ability to utilize much of the functional diversity encoded by microorganisms4,9.
To overcome these limitations, technological advances over the past decade have allowed researchers to sequence microbial DNA fragments directly (i.e., without prior culture) from entire communities (metagenomics) or from individual cells. The ability to assemble these fragments into larger genomic fragments and reconstruct multiple metagenomic assembled genomes (MAGs) or single amplified genomes (SAGs), respectively, provides the basis for previous taxonomy-driven (i.e. given environment) microbiome studies 10,11,12. In fact, recent studies have greatly expanded the phylogenetic characterization of microbial diversity on Earth1,13 and have shown that much of the functional diversity in different microbiomes has not previously been captured by cultured microorganism reference genome sequences (REFs)14. The ability to place undiscovered functional diversity in the context of the host genome (i.e., genome resolution) is critical for predicting as yet uncharacterized strains of microorganisms putatively encoding novel natural products15,16 or for tracing such compounds back to their original producers. For example, a combined metagenomic and single-cell genomic profiling approach has led to the identification of a group of bacteria, Candidatus Entotheonella, related to richly metabolized sponges, as producers of several drug candidates. However, despite recent efforts to establish a genomic survey of various microbiomes,16,19 for the ocean, the largest ecosystem on Earth, more than two-thirds of the global metagenomic data remain unaccounted for16,20. Thus, in general, the biosynthetic potential of the marine microbiome and its potential as a repository for new enzymology and natural products remain largely underexplored.
To explore the biosynthetic potential of the marine microbiome on a global scale, we first combined marine microbial genomes obtained using culture-dependent and culture-independent methods to create an extensive database of phylogenetics and gene function. Exploring this database, we found a wide variety of biosynthetic gene clusters (BGCs), most of which belong to as yet uncharacterized gene cluster (GCF) families. In addition, we identified an unknown bacterial family that exhibited the highest BGC diversity known to date in the open ocean. We chose two ribosomal synthesis and post-translationally modified peptide (RiPP) pathways for experimental validation because they differ from currently known genes. The functional characterization of these pathways has revealed unexpected examples of enzymology as well as structurally unusual compounds with protease inhibitory activity.
At first, we aimed to create a global genome-resolution data resource focusing on its bacterial and archaeal components. To this end, we pooled metagenomic data and 1038 seawater samples from 215 globally distributed sampling points (latitude range = 141.6°) and several deep layers (from 1 to 5600 m in depth, covering the upper, middle and abyssal regions). samples 21, 22, 23 (Fig. 1a, expanded data, Fig. 1a and Supplementary Table 1). In addition to providing broad geographical coverage, these size-selectively filtered samples enabled us to compare different components of the ocean microbiome, including virus-enriched (<0.2 μm), prokaryote-enriched (0.2–3 μm), particle-enriched (0.8–20 μm) and virus-depleted (>0.2 μm) communities. In addition to providing broad geographical coverage, these size-selectively filtered samples enabled us to compare different components of the ocean microbiome, including virus-enriched (<0.2 µm), prokaryote-enriched (0.2–3 µm), particle-enriched (0.8 –20 μm) and virus-depleted (>0.2 μm) communities. В дополнение к обеспечению широкого географического охвата эти выборочно отфильтрованные по размеру образцы позволили нам сравнить различные компоненты микробиома океана, в том числе обогащенные вирусами (<0,2 мкм), обогащенные прокариотами (0,2–3 мкм), обогащенные частицами (0,8 мкм). In addition to providing a broad geographic coverage, these selectively filtered samples allowed us to compare different components of the ocean microbiome, including those enriched in viruses (<0.2 µm), enriched in prokaryotes (0.2–3 µm), enriched in particles (0. 8 µm). –20 мкм) и обедненные вирусом (>0,2 мкм) сообщества. –20 µm) and virus-poor (>0.2 µm) communities.除了提供广泛的地理覆盖范围外,这些尺寸选择性过滤的样本使我们能够比较海洋微生物组的不同成分,包括富含病毒(<0.2 μm)、富含原核生物(0.2-3 μm)、富含颗粒(0.8 –20 μm) 和病毒耗尽(>0.2 μm) 群落。除了 提供 广泛 的 地理 覆盖 范围 , 这些 尺寸 选择性 的 样本 使 我们 能够 比较 海洋 微生物组 的 成分 , 包括 富含 病毒 (<0.2 μm) 富含 生物 ((0.2-3 μm) 富含 富含 、 、 、 、 、 、颗粒(0.8 –20 µm) 和病毒耗尽(>0.2 µm) 群落。 В дополнение к обеспечению широкого географического охвата эти выборочно отфильтрованные по размеру образцы позволили нам сравнить различные компоненты морского микробиома, в том числе богатые вирусами (<0,2 мкм), богатые прокариотами (0,2–3 мкм), частицы (0,8–20 мкм) и обедненные вирусом (>0,2 мкм) колонии. In addition to providing a broad geographic coverage, these selectively filtered samples allowed us to compare various components of the marine microbiome, including virus-rich (<0.2 µm), prokaryotic-rich (0.2–3 µm), particles (0.8 –20 µm) and virus-depleted (>0.2 µm) colonies.
a, A total of 1038 public marine microbial community genomes (metagenomes) were collected from 215 locations (62°S to 79°N and 179°W to 179°E) distributed throughout the world. Map tiles © Esri. Source: GEBCO, NOAA, CHS, OSU, UNH, CSUMB, National Geographic, DeLorme, NAVTEQ, and Esri. b, these metagenomes were used to reconstruct MAGs (methods and additional information), which differed in quantity and quality (methods) in different datasets (marked in color). The reconstructed MAGs complemented publicly available (external) genomes, including handcrafted MAG26, SAG27, and REF. 27 Compile OMD. c, OMD improved the genomic representation of marine microbial communities (metagenomic read frequency mapping; methods) by a factor of 2–3 with more consistent representation across depth and breadth. <0.2, n = 151; <0.2, n = 151; <0,2, п = 151; <0.2, n=151; <0.2,n = 151; <0.2,n=151; < 0,2, n = 151; < 0.2, n = 151; 0.2–0.8, n=67, 0.2–3, n=180, 0.8–20, n=30; >0.2, n = 610; >0.2, n = 610; >0,2, п = 610; >0.2, n = 610; >0.2,n = 610; >0.2,n=610; >0,2, п = 610; >0.2, n = 610; <30°, n = 132; <30°, n = 132; <30°, n = 132; <30°, n = 132; <30°,n = 132; <30°,n=132; <30°, n = 132; <30°, n = 132; 30-60°, n = 73; >60°, n = 42; >60°, n = 42; >60°, n = 42; >60°, n = 42; >60°,n = 42; >60°,n=42; >60°, n = 42; >60°, n = 42; EPI, n=174, MES, n=45, BAT, n=28. d, grouping OMDs into species-level clusters (95% mean nucleotide identity), a total of about 8300 species were identified, more than half of which had not previously been characterized according to taxonomic annotation using GTDB (version 89). species by genome type revealed a high degree of complementarity between MAG, SAG, and REF, reflecting the phylogenetic diversity of the marine microbiome. In particular, 55%, 26% and 11% of the species were specific for MAG, SAG and REF, respectively. BATS, Bermuda Atlantic Time Series; GEM, genomes of the Earth’s microbiome; GORG, global ocean reference genome; HOT, Hawaiian Ocean time series.
Using this dataset, we reconstructed a total of 26,293 MAGs that were predominantly bacterial and archaeal (Fig. 1b and expanded data, Fig. 1b). We created these MAGs based on assemblies from separate rather than pooled metagenomic samples to prevent the collapse of natural sequence variation between samples from different locations or time points (methods). In addition, we grouped genomic fragments according to abundance correlations across a large number of accessions (from 58 to 610 accessions, depending on survey; methods). We found that this was a time-consuming but important step 24 that was skipped 16, 19, 25 in several large-scale MAG reconstruction efforts and significantly improved both quantitative (average, 2.7x) and quality measures (average, + 20%). genomes reconstructed from the marine metagenomes studied here (extended data, Fig. 2a and additional information). Overall, these efforts resulted in a 4.5-fold increase in microbial MAGs in seawater (6-fold if high quality MAGs are considered) compared to the most comprehensive MAG resource available today16 (Methods). This set of newly created MAGs was then combined with 830 hand-picked MAG26s, 5969 SAG27s and 1707 REFs. 27 marine bacteria and archaea were integrated into 34,799 genomes (Fig. 1b).
We then assessed newly created resources for their ability to represent marine microbial communities and assess the impact of integrating different genome types. On average, we found that it covers approximately 40-60% of marine metagenomic data (Fig. 1c), which is two to three times the coverage of previous reports based solely on the MAG, both in depth and in latitude More serial 16 or SAG20. In addition, to obtain a systematic assessment of taxonomic diversity in established collections, we annotated all genomes using the Genome Taxonomy Database (GTDB) toolkit (methods) and linked them to an average nucleotide identity limit of 95% across the genome. They were grouped28 to identify 8,304 species. clusters (kinds) of the level. Two-thirds of these species (including new clades) had not previously appeared in the GTDB, of which 2790 were discovered using the MAG reconstructed in this study (Fig. 1d). In addition, we found that different types of genomes are highly complementary: 55%, 26%, and 11% of species are composed entirely of MAG, SAG, and REF, respectively (Fig. 1e). In addition, MAG covers all 49 gates found in the water column, while SAG and REF represent only 18 and 11 of them, respectively. However, SAGs better represent the diversity of the most common clades (expanded data, Fig. 3a), such as the order Pelagibacterales (SAR11), where SAG covers nearly 1300 species and MAG only 390 species. Notably, REFs rarely overlapped with either MAGs or SAGs at the species level and represented >95% of the approximately 1,000 genomes that were not detected in the set of open ocean metagenomes studied here (Methods), mostly owing to representatives that were isolated from other types of marine samples (such as sediments or host-associated). Notably, REFs rarely overlapped with either MAGs or SAGs at the species level and represented >95% of the approximately 1,000 genomes that were not detected in the set of open ocean metagenomes studied here (Methods), mostly owing to representatives that were isolated from other types of marine samples (such as sediments or host-associated). Примечательно, что REF редко пересекались с MAG или SAG на уровне видов и представляли> 95% из примерно 1000 геномов, которые не были обнаружены в изученном здесь наборе метагеномов открытого океана (методы), в основном за счет представителей, которые были изолированы от других. Notably, REFs rarely overlapped with MAGs or SAGs at the species level and represented >95% of the approximately 1000 genomes that were not found in the set of open ocean metagenomes studied here (methods), largely due to representatives that were isolated from others. marine sample types (eg sediment or host-related). Notably, REF rarely overlaps with MAG or SAG at the species level and represents more than 95% of the ~1000 genomes not found in the set of high seas metagenomes studied here (methods), mainly due to the type of representative marine sample isolated. from other species (e.g. sediment or host-related). To make it widely available to the scientific community, this marine genomic resource, which also includes fragments that are not clustered (for example, from predicted phages, genomic islands, and genomic fragments for which there is not enough data for MAG reconstruction), can be annotated with Access and contextual parameters in the Marine Microbiomics Database (OMD; https://microbiomics.io/ocean/), as well as taxonomy and gene function.
We then set out to explore the abundance and novelty of biosynthetic potential in the high seas microbiome. To this end, we first used antiSMASH for all MAGs, SAGs, and REFs found in 1038 marine metagenomes (methods) to predict a total of 39,055 BGCs. We then grouped them into 6,907 non-repeating GCFs and 151 gene clusters (GCC; Supplementary Table 2 and methods) to account for inherent redundancy in metagenomic datasets (i.e., the same BGC can be encoded in multiple genomes) and fragmentation. Incomplete BGCs are not resulted in a significant increase in the number of GCFs and GCCs, if any (Supplementary Information), that contained at least one full BGC member in 44% and 86% of cases, respectively.
At the GCC level, we found a wide variety of predicted RiPPs and other natural products (Fig. 2a). Among them, arylpolyenes, carotenoids, tetrahydropyrimidines, and siderophores belong to HCAs with a wide phylogenetic distribution and high abundance in marine metagenomes, which may indicate a wide adaptation of microorganisms to the marine environment, including resistance to reactive oxygen species, oxidative and osmotic stress, or iron. . absorption (more information). This functional diversity contrasts with a recent analysis of approximately 1.2 million BGCs from approximately 190,000 genomes stored in the NCBI RefSeq database (BiG-FAM/RefSeq, hereafter RefSeq)29 showing that nonribosomal peptide synthase (NRPS) and polyketide synthase ( PKS) BGC (additional information). We also found that 44 (29%) GCCs were only remotely related to any RefSeq BGCs (\(\bar{d}\)RefSeq > 0.4; Fig. 2a and Methods), and 53 (35%) GCCs were encoded only in MAGs, highlighting the potential for discovery of previously undescribed chemistry within the OMD. We also found that 44 (29%) GCCs were only remotely related to any RefSeq BGCs (\(\bar{d}\)RefSeq > 0.4; Fig. 2a and Methods), and 53 (35%) GCCs were encoded only in MAGs, highlighting the potential for discovery of previously undescribed chemistry within the OMD. Мы также обнаружили, что 44 (29%) GCC были лишь отдаленно связаны с какими-либо BGC RefSeq (\(\bar{d}\)RefSeq > 0,4; рис. 2а и методы), а 53 (35%) GCC были закодированы только в MAG, подчеркивая потенциал открытия ранее неописанной химии в OMD. We also found that 44 (29%) GCCs were only distantly related to any BGC RefSeq (\(\bar{d}\)RefSeq > 0.4; Fig. 2a and methods), and 53 (35%) GCCs were only coded in MAG, highlighting the potential for discovering previously undescribed chemistry in OMD.我们还发现44 个(29%) GCC 仅与任何RefSeq BGC 远程相关(\(\bar{d}\)RefSeq > 0.4;图2a 和方法),53 个(35%) GCC 仅在MAGs,突出了在OMD 中发现以前未描述的化学物质的潜力。我们还发现44个(29%) GCC 仅与任何RefSeq BGC 远程相关(\(\bar{d}\)RefSeq > 0.4;图2a OMD 中发现以前未描述的化学物质的潜力。 Мы также обнаружили, что 44 (29%) GCC были лишь отдаленно связаны с любыми BGC RefSeq (\(\bar{d}\)RefSeq > 0,4; рис. 2a и методы), а 53 (35%) GCC были только в MAG. We also found that 44 (29%) GCCs were only distantly related to any BGC RefSeq (\(\bar{d}\)RefSeq > 0.4; Fig. 2a and methods) and 53 (35%) GCCs were only in MAG. , highlighting the potential for detecting previously undescribed chemicals in OMD. Given that each of these GCCs likely represent a highly diverse array of biosynthetic functions, we further analyzed the data at the GCF level in an effort to provide a more detailed grouping of BGCs that are expected to code for similar natural products. A total of 3,861 (56%) of the identified GCFs did not overlap with RefSeq and >97% of the GCFs were not represented in MIBiG, one of the most extensive databases of experimentally validated BGCs30 (Fig. 2b). A total of 3,861 (56%) of the identified GCFs did not overlap with RefSeq and >97% of the GCFs were not represented in MIBiG, one of the most extensive databases of experimentally validated BGCs30 (Fig. 2b). В общей сложности 3861 (56%) идентифицированных GCF не перекрывались с RefSeq, и> 97% GCF не были представлены в MIBiG, одной из самых обширных баз данных экспериментально подтвержденных BGC30 (рис. 2b). A total of 3861 (56%) identified GCFs did not overlap with RefSeq, and >97% of GCFs were not represented in MIBiG, one of the largest databases of experimentally validated BGC30s (Figure 2b). A total of 3861 (56%) identified GCFs did not overlap with RefSeq, and more than 97% of GCFs were not represented in MIBiG, one of the largest databases of experimentally validated BGCs (Figure 2b). Although the discovery of many potential novel pathways in settings poorly represented by the reference genome is not unexpected, our method for replicating BGCs to GCFs before comparative analysis differs from previous reports, allowing us to ensure the objective novelty of sex estimates. Most of the new diversity (3012 GCF, or 78%) corresponded to predicted terpenes, RiPP, or other natural products, and most (1815 GCF, or 47%) were encoded in commonly unknown types due to their biosynthetic potential. Unlike PKS and NRPS clusters, these compact BGCs are less likely to be fragmented during metagenomic assembly and can more easily perform time-consuming and resource-intensive functional characterization of their products.
A total of 39,055 BGCs are merged into 6,907 GCFs and 151 GCCs. a, Data representation (inner outer layer). Hierarchical clustering of BGC distances based on GCC, 53 of which were captured by MAG alone. GCCs include BGCs from different taxa (in strobe switching frequency) and different BGC classes (circle sizes correspond to their frequencies). For each GCC, the outer layer represents the number, prevalence (percentage of samples) and distance (minimum cosine distance between BGCs (min(dMIBiG)) of BGCs with BiG-FAM BGCs GCCs with BGCs closely related to experimentally validated BGCs (MIBiG) are highlighted by arrows . b, Comparing GCFs to computationally predicted (BiG-FAM) and experimentally validated (MIBiG) BGCs uncovered 3,861 new (d– > 0.2) GCFs. b, Comparing GCFs to computationally predicted (BiG-FAM) and experimentally validated (MIBiG) BGCs uncovered 3,861 new (d– > 0.2) GCFs. b, сравнение GCF с предсказанными расчетами (BiG-FAM) и экспериментально подтвержденными (MIBiG) BGC выявило 3861 новый (d–> 0,2) GCF. b, comparison of GCF with predicted calculations (BiG-FAM) and experimentally validated (MIBiG) BGC revealed 3861 new (d–>0.2) GCFs. b,将GCF 与计算预测(BiG-FAM) 和实验验证(MIBiG) BGC 进行比较,发现了3,861 个新的(d– > 0.2) GCF。 b,将GCF 与计算预测(BiG-FAM) 和实验验证(MIBiG) BGC 进行比较,发现了3,861 个新的(d–>0.2) GCF。 b, сравнивая GCF с предсказанными расчетами (BiG-FAM) и экспериментально подтвержденными (MIBiG) BGC, был обнаружен 3861 новый (d-> 0,2) GCF. b Comparing GCF with predicted calculations (BiG-FAM) and experimentally validated (MIBiG) BGC, 3861 new (d->0.2) GCFs were discovered. Most of them (78%) coded for RiPPs, terpenes and other putative natural products. c, all genomes in the OMD found in 1038 marine metagenomes were placed in the GTDB base tree to show the phylogenetic coverage of the OMD. Clades without any genomes in the OMD are shown in grey. The number of BGCs corresponds to the maximum predicted number of BGCs per genome in a given clade. For clarity, the last 15% of the nodes are collapsed. The arrows denote BGC-rich clades (>15 BGCs) with the exception of Mycobacteroides, Gordonia (next to Rhodococcus) and Crocosphaera (next to Synechococcus). The arrows denote BGC-rich clades (>15 BGCs) with the exception of Mycobacteroides, Gordonia (next to Rhodococcus) and Crocosphaera (next to Synechococcus). Стрелки обозначают клады, богатые BGC (> 15 BGC), за исключением Mycobacteroides, Gordonia (рядом с Rhodococcus) и Crocosphaera (рядом с Synechococcus). Arrows indicate clades rich in BGC (>15 BGC), with the exception of Mycobacteroides, Gordonia (next to Rhodococcus) and Crocosphaera (next to Synechococcus).箭头表示富含BGC 的进化枝(>15 BGC),但分枝杆菌属、戈多尼亚属(红球菌属旁边) 和Crocosphaera (聚球藻属旁边) 除外。 Crocosphaera Стрелки указывают клады, богатые BGC (> 15 BGC), за исключением Mycobacterium, Gordonia (рядом с Rhodococcus) и Crocosphaera (рядом с Synechococcus). Arrows indicate clades rich in BGC (>15 BGC), with the exception of Mycobacterium, Gordonia (next to Rhodococcus) and Crocosphaera (next to Synechococcus). d, Unknown c. Eremiobacterota showed the highest biosynthetic diversity (Shannon index based on natural product type). Each band represents the genome with the most BGCs in the species. T1PKS, PKS type I, T2/3PKS, PKS type II and III.
In addition to abundance and novelty, we also explored the biogeographic structure of the biosynthetic potential of the marine microbiome. Grouping of samples by average metagenomic GCF copy number distribution (Methods) showed that low-latitude, upper, prokaryotic-rich and virus-depleted communities, mainly from surface or deeper sunlit waters, were enriched in RiPP and terpene BGCs. In contrast, polar, deep-sea, virus- and particle-rich communities were associated with higher abundances of NRPS and PKS BGC (expanded data, Fig. 4 and additional information). Finally, we found that the well-studied tropical and marine communities were the most promising sources of new terpenoids (expanded data, Fig. 5a,b), while the least studied communities (polar, deep-sea, viral, and particle-enriched) had NRPS, PKS , RiPPs, and other natural products have the greatest potential for discovery (expanded data, Fig. 5a).
To complement the study of the biosynthetic potential of the marine microbiome, we aimed to map its phylogenetic distribution and identify new BGC-rich clades. To this end, we placed the genomes of marine microbes into a normalized GTDB13 bacterial and archaeal phylogenetic tree and overlaid the putative biosynthetic pathways they encode (Fig. 2c). We readily detected in ocean water samples (Methods) several BGC-rich clades (representatives with >15 BGCs) that are either well known for their biosynthetic potential, such as Cyanobacteria (Synechococcus) and Proteobacteria (such as Tistrella)32,33, or that have recently garnered attention for their natural products, such as Myxococcota (Sandaracinaceae), Rhodococcus and Planctomycetota34,35,36. We readily detected in ocean water samples (Methods) several BGC-rich clades (representatives with >15 BGCs) that are either well known for their biosynthetic potential, such as Cyanobacteria (Synechococcus) and Proteobacteria (such as Tistrella)32,33, or that have recently garnered attention for their natural products, such as Myxococcota (Sandaracinaceae), Rhodococcus and Planctomycetota34,35,36. Мы легко обнаружили в пробах океанской воды (Методы) несколько богатых BGC клад (представителей с >15 BGC), которые либо хорошо известны своим биосинтетическим потенциалом, такие как Cyanobacteria (Synechococcus) и Proteobacteria (такие как Tistrella)32,33, либо которые недавно привлекли внимание своими натуральными продуктами, такими как Myxococcota (Sandaracinaceae), Rhodococcus и Planctomycetota34,35,36. We have readily identified several BGC-rich clades (members with >15 BGC) in ocean water samples (Methods) that are either well known for their biosynthetic potential, such as Cyanobacteria (Synechococcus) and Proteobacteria (such as Tistrella)32,33, or that have recently attracted attention for their natural products such as Myxococcota (Sandaracinaceae), Rhodococcus and Planctomycetota34,35,36.我们很容易在海水样本(方法)中检测到几个富含BGC 的进化枝(具有>15 BGC 的代表),它们要么以其生物合成潜力而闻名,例如蓝细菌(Synechococcus)和变形菌(例如Tistrella)32,33,要么最近因其天然产物而受到关注,例如粘球菌属(山德拉属)、红球菌属和Planctomycetota34,35,36。我们 很 容易 在 海水 样 本 方法) 中 到 几 个 富含 bgc 的 化枝 (具有 具有> 15 bgc 的 代表) , 要么 其 生物 潜力 而 闻名 例如 蓝细菌 (syNECHOCOCUCUS) 和 菌 例如 例如 ((((((((((HIP Tistrella 32.33 Мы легко обнаружили несколько богатых BGC клад (с >15 представителями BGC) в образцах морской воды (методы), либо известных своим биосинтетическим потенциалом, таких как Cyanobacteria (Synechococcus) и Proteobacteria (например, Tistrella)32,33, либо недавно привлекших внимание для их натуральных продуктов, таких как Myxococcus (Sandra), Rhodococcus и Planctomycetota34,35,36. We easily detected several BGC-rich clades (with >15 BGC members) in seawater samples (methods), either known for their biosynthetic potential, such as Cyanobacteria (Synechococcus) and Proteobacteria (e.g., Tistrella)32,33, or recently attracted attention for their natural products such as Myxococcus (Sandra), Rhodococcus and Planctomycetota34,35,36. Interestingly, we found several previously unexplored lineages in these clades. For example, those species with the richest biosynthetic potential in the phyla Planctomycetota and Myxococcota belong to uncharacterized candidate orders and genera, respectively (Supplementary Table 3). Overall, this suggests that the OMD provides access to previously unknown phylogenetic information, including microorganisms, which may represent new targets for the discovery of enzymes and natural products.
In addition, we characterized BGC-enriched clades not only by calculating the maximum number of BGCs encoded by their members, but also by assessing the diversity of these BGCs, which explains the frequency of different types of natural candidate products (Fig. 2c and methods). We found that the bacterial MAGs specifically recreated in this study represent the most biosynthetically diverse species. These bacteria belong to the uncultivated phylum Candidatus Eremiobacterota, which is largely unexplored apart from a few genomic studies37,38. Notably, ‘Ca. Eremiobacterota’ species have only been analyzed in terrestrial environments39 and are not known to include any BGC-enriched members. Here we initially reconstructed eight MAGs from the same species (with a nucleotide identity of >99%) from deep (between 2,000 m and 4,000 m) and particle-enriched (0.8–20 µm) ocean metagenomes collected by the Malaspina expedition23. Here we initially reconstructed eight MAGs from the same species (with a nucleotide identity of >99%) from deep (between 2,000 m and 4,000 m) and particle-enriched (0.8–20 µm) ocean metagenomes collected by the Malaspina expedition23. Здесь мы первоначально реконструировали восемь MAG одного и того же вида (с идентичностью нуклеотидов >99%) из глубоких (от 2000 м до 4000 м) и обогащенных частицами (0,8–20 мкм) метагеномов океана, собранных экспедицией Маласпина23. Here, we initially reconstructed eight MAGs of the same species (with >99% nucleotide identity) from deep (2000 m to 4000 m) and particle-enriched (0.8–20 µm) ocean metagenomes collected by the Malaspina expedition23.在这里,我们最初从Malaspina 远征队收集的深海(2,000 m 和4,000 m 之间)和富含粒子(0.8-20 µm)的海洋宏基因组中重建了来自同一物种(核苷酸同一性> 99%)的8 个MAG。在 这里 , 我们 从 从 malaspina 远征队 的 ((((((((和 4,000 m 之间) 富含 粒子 ((0.8-20 µm) 海洋 宏基 组 重建 了 来自 同 一 物种 (同 一 性> 99%% 99% )的8个MAG。 Здесь мы реконструируем метагеномы одного и того же вида (идентичность нуклеотидов > 99%), первоначально из глубоководных (от 2000 м до 4000 м) и богатых частицами (0,8–20 мкм) морских метагеномных групп, собранных экспедицией Маласпина) из 8 МАГ. Here we reconstruct metagenomes of the same species (nucleotide identity > 99%), originally from deep-sea (2000 m to 4000 m) and particle-rich (0.8–20 µm) marine metagenomic groups collected by the Malaspina Expedition) from 8 MAHs . Therefore, we propose to name the species “Candidatus Eudoremicrobium malaspinii”, from the siren (sea nymph) of the beautiful gift and expedition in Greek mythology. calcium. According to the phylogenetic annotation, E. malaspinii’ has no previously known relatives below the order level and thus belongs to our proposed new bacterial family ‘Ca.E. malaspinii’ as the type species, while ‘Ca. Eudoremicrobiaceae” as the official name (Supplementary Information). “Brief metagenomic reconstruction of Ca.” The E. malaspinii genome project was validated by ultra-low data entry, long-read metagenomic sequencing, and targeted assembly (methods) of one sample into a single linear chromosome of 9.63 Mb with a 75 kb repeat. as the only remaining ambiguity.
To establish the phylogenetic background of this species, we searched for closely related species by performing targeted genome reconstructions in other eukaryotic-enriched metagenomic samples from an expedition to the Tara Ocean. Briefly, we mapped metagenomic reads to genomic fragments associated with Ca. E. malaspinii” and suggested that an increased recruitment rate in this sample indicates the presence of other relatives (methods). As a result, we found 10 MAGs, and the combination of 19 MAGs represented five species of three genera in a newly defined family (i.e. “Ca. Eudoremicrobiaceae”). After manual inspection and quality control (expanded data, Fig. 6 and additional information), we found that the species “Ca. Eudoremicrobiaceae” were closely related to other clades “Ca. Eremiobacterota” (up to 7 BGC) (Fig. 3a–3c).
a, Phylogenetic arrangement of five’Ca. Eudoremicrobiaceae species showed a BGC richness unique to the marine lines identified in this study. The phylogenetic tree included all Eremiobacterota MAGs presented in the CA GTDB (89th edition) and members of other phyla (number of genomes in brackets) for evolutionary context (methods). The outermost layers represent taxonomies of the family (‘Ca. Eudoremicrobiaceae’ and ‘Ca. Xenobiaceae’) and class (‘Ca. Eremiobacteria’). The five species described in this study are represented by alphanumeric codes and proposed binomial names (Supplementary Information). b, Calcium. Species of Eudoremicrobiaceae form the basis of the seven BGCs. The absence of the A2 clade BGC was due to the incompleteness of the representative MAG (Supplementary Table 3). Specifically for ‘Ca. Amphithomicrobium” and “C. do not detect Amphithomicrobium” (clades A and B). c, All BGCs encoded as “Ca. Eudoremicrobium taraoceanii was found to be expressed in 623 metatranscriptomes selected from the oceans of Tara. Solid circles indicate active transcription. Orange circles indicate values ​​below or above the fold change in housekeeping gene expression (methods) with log2 transformation. d, relative abundance distribution (methods) showing ‘Ca. Species of Eudoremicrobiaceae are numerous and widely distributed in most ocean basins and throughout the entire water column (from the surface to a depth of at least 4000 m). Based on these estimates, we found that “Ca.E. malaspinii contains up to 6% prokaryotic cells in deep-sea particle-associated communities. If a species is found in any size fraction for a given depth layer, we consider it to be present at the site. IO – Indian Ocean, NAO – North Atlantic, NPO – North Pacific, RS – Red Sea, SAO – South Atlantic, SO – Southern Ocean, SPO – South Pacific.
Explore the abundance and distribution of ‘Ca. Eudoremicrobiaceae’, whose representatives are found everywhere in most ocean basins, as well as in the entire water column (Fig. 3d). Locally, they make up 6% of the marine microbial community, making them an important part of the global marine microbiome. In addition, we found the relative content of Ca. Eudoremicrobiaceae species and their BGCs expression levels were highest in the eukaryotic enriched fraction (Fig. 3c and extended data, Fig. 7), indicating a possible interaction with particulate matter, including plankton. This observation and some homology to ‘Ca. Eudoremicrobium BGCs in known pathways that produce cytotoxic natural products may indicate predatory behavior (Supplementary Information and Extended Data, Figure 8) similar to other specialized metabolite-producing predators such as Myxococcus. Discovery of Ca. Eudoremicrobiaceae in inaccessible (deep water) or eukaryotic but not prokaryotic-rich samples may explain why these bacteria and their unanticipated BGC diversity remain unclear in the context of natural food research.
Ultimately, we aim to experimentally validate the promise of our microbiome-based work in discovering new pathways, enzymes, and natural products. Among the different classes of BGCs, the RiPP pathway is known to encode a rich chemical and functional diversity due to various post-translational modifications established on the core peptide by mature enzymes. So we chose two ‘Ca. The RiPP BGCs of Eudoremicrobium (Figures 3b and 4a-e) are based on their comparison with any known BGC (\(\bar{d}\)MIBiG and \(\bar{d}\)RefSeq above 0.2).
ac, In vitro heterologous expression and in vitro enzymatic assays of a novel (\(\bar{d}\)RefSeq = 0.29) RiPP biosynthetic cluster specific for deep sea species ‘Ca. E. malaspinii results in the production of diphosphorylated products. c, Modifications were identified using high resolution MS/MS (HR) (fragments represented by b and y ions in the chemical structure) and NMR (expanded data, Fig. 9). d, this phosphorylated peptide exhibits low micromolar inhibition of mammalian neutrophil elastase, while no such inhibition was found for the control peptide and the dehydration peptide (chemical removal induced dehydration). The experiment was repeated 3 times with similar results. For example, heterologous expression of the second novel \(\bar{d}\)RefSeq = 0.33) cluster of aberrant protein biosynthesis revealed the function of four mature enzymes that modify the 46 amino acid core peptide. The residues are stained according to the putative sites of modification by HR-MS/MS, isotopic labeling and NMR analysis (Supplementary Information). Dashed line coloring indicates that the modification occurred in either of the two residues. The figure is a compilation of many heterologous constructs to show the activity of all mature enzymes in the same nucleus. h, NMR data insert for N-methylation of backbone amides. Full results are shown in fig. 10 with extended data. i. Among all the FkbM domains found in the MIBiG 2.0 database, the phylogenetic position of the mature FkbM enzymes of the proteusin cluster indicates that the enzymes of this family have N-methyltransferase activity (Supplementary Information). A schematic diagram of BHC (a, e), the structure of the precursor peptide (b, f) and the proposed chemical structure of the natural product (c, g) are shown.
The first RiPP pathway (\(\bar{d}\)MIBiG = 0.41, \(\bar{d}\)RefSeq = 0.29) is present only in the deep-sea species Ca. E. malaspinii and encodes the modification with a single mature enzyme, the precursor peptide (Fig. 4a, b). In this mature enzyme, we found a single functional domain homologous to the dehydration domain of wool peptide synthase, which normally catalyzes phosphorylation and subsequent removal (Supplementary Information). Therefore, we predict that the modification of the precursor peptide involves such a two-step dehydration. However, using tandem mass spectrometry (MS/MS) and nuclear magnetic resonance spectroscopy (NMR), we identified a polyphosphorylated linear peptide (Fig. 4c). Although unexpected, we found several lines of evidence to support its being the end product: no dehydration in two different heterologous hosts, and in vitro assays identified mutations in the dehydration catalytic site of the mature enzyme. E. malaspinii in all reconstructed genomes (expanded data, Fig. 9 and additional information) are ultimately phosphorylated products of biological activity, rather than chemically synthesized dehydrated forms (Fig. 4d). In fact, we found that it exhibited low micromolar protease inhibitory activity against neutrophil elastase over a concentration range comparable to other related natural products (IC = 14.3 μM), despite the fact that the ecological role of this unusual natural product remains to be elucidated. . Based on these results, we propose to name this pathway “phosphorylase”.
The second case is a complex RiPP pathway specific to Ca. deformed natural protein product (Fig. 4e). These pathways are of particular biotechnological interest because of the expected density and variety of unusual chemical modifications established by enzymes encoded by relatively short BGCs. We found that this proteusin differed from the previously characterized one in that it lacked the core NX5N motif of polytheonamides and the lanthionine ring of landornamides. To overcome the limitations of common heterologous expression patterns, we used them together with a custom Microvirgula aerodenitrificans system to characterize the four mature enzymes of this pathway (methods). Using a combination of MS/MS, isotopic labeling and NMR, we found that these mature enzymes have a core peptide of 46 amino acids (Fig. 4f, g, expanded data, Fig. 10-12 and additional information). Among mature enzymes, we characterized the first appearance of a FkbM family member of O-methyltransferases in the RiPP pathway and surprisingly found that this mature enzyme introduces backbone N-methylation (Fig. 4h, i and additional information). ). Although this modification is known in natural NRP48 products, enzymatic N-methylation of amide bonds is a complex but biotechnologically significant reaction49 that has so far been used for the RiPP50,51 borohydride family. The identification of this activity in other enzymes and in the RiPP family may provide opportunities for new applications and expand the functional diversity of the amoeba and its chemical diversity. Based on the identified modifications and the unusual length of the proposed product structure, we propose to name the pathway “pythonamide”.
The discovery of an unexpected enzymology in a functionally characterized family of enzymes 47 illustrates the promise of environmental genomics for new discoveries, as well as the limited power of functional inferences drawn from sequence homology alone. Thus, together with a report on the non-classically bioactive polyphosphorylated RiPP, our results demonstrate the resource-intensive but crucial value of synthetic biology efforts to fully unlock the functional richness, diversity, and unusual structures of biochemical compounds.
Here we demonstrate the extent of microbial-encoded biosynthetic potential and its genomic context in the global marine microbiome, contributing to the future by making the resulting resources available to the scientific community (https://microbiomics.io/ocean/) Research. We found that most of their phylogenetic and functional innovations can only be obtained by reconstructing MAGs and SAGs, especially in poorly understood microbial communities that could guide future bioprospecting efforts. Although we are focusing on ‘Ca here. Eudoremicrobiaceae” as a lineage especially “gifted” for biosynthesis, many of the predicted BGCs in another undiscovered microbiota likely code for previously undescribed enzymologies that provide the compound’s biosynthetic and/or biotechnologically significant activity.
Include metagenomic datasets from major oceanographic and time series studies with sufficient sequencing depth to maximize coverage of global marine microbial communities in ocean basins, deep layers, and time. These datasets (Supplementary Table 1 and Figure 1) include metagenomes from samples collected in the oceans of Tara (viral enriched, n=190; prokaryotic enriched, n=180)12,22 from the BioGEOTRACES expedition (n=480), Hawaii. Oceanic Time Series (HOT, n = 68), Bermuda-Atlantic Time Series Studies (BATS, n = 62) 21 and Malaspina Expedition (n = 58) 23. Sequencing reads from all metagenomes were filtered for quality using BBMap (v .38.71) by removing sequencing adapters from reads, removing reads mapped to quality control sequences (PhiX gene), and using trimq = 14, maq = 20. Discard low quality reads, maxns = 0 and min length = 45. Post analysis was performed on QC reads or, if specified, pooled QC reads (bbmerge.sh minoverlap = 16). Quality controlled reads were normalized (bbnorm.sh target=40, mindepth=0) before build with metaSPAdes (v.3.11.1 or v.3.12 if needed). The resulting scaffold contigs (hereinafter referred to as scaffolds) were finally filtered by length (≥1 kb).
The 1038 metagenomic samples were divided into groups, and for each group of samples, quality-controlled metagenomic reads from all samples were individually matched to the frameworks of each sample, resulting in the following number of paired sets of reads to match the frameworks: Tara Oceans Virus – Enrichment (190 x 190), Prokaryotic Enrichment (180 x 180), BioGEOTRACES, HOT and BATS (610 x 610) and Malaspina (58 x 58). Mapping was done using Burrows-Wheeler-Aligner (BWA) (v.0.7.17-r1188)54 which allows readings to be mapped at secondary sites (using the -a flag). Alignments were filtered to be at least 45 bases long, have ≥97% identity, and cover ≥80% reads. The resulting BAM files were processed using the jgi_summarize_bam_contig_depths script for MetaBAT2 (v.2.12.1)55 to provide intra- and inter-sample coverage for each framework. Finally, brackets were grouped to improve sensitivity by running MetaBAT2 separately on all samples with –minContig 2000 and –maxEdges 500. We use MetaBAT2 instead of the ensemble binning method, as it has proven to be the most efficient single binner56 in independent tests and proved to be 10–50 times faster than other widely used binners57. To test for the effect of abundance correlations, randomly selected metagenomic subsamples (10 for each of the two Tara Ocean datasets, 10 for BioGEOTRACES, 5 for each time series, and 5 for Malaspina) additionally used only in-sample coverage information (Supplementary Information). .
Additional (external) genomes were included in subsequent analyses, namely 830 handpicked MAGs from the Tara Oceans dataset 26, 5287 SAGs from the GORG 20 dataset, and 1707 isolated REFs and 682 from the MAR database (MarDB v. 4)27. MarDB data set, if the pattern type matches the following regular expression, the genome is selected based on the available metadata: ‘[S|s]ingle.?[C|c]ell|[C |c]ulture|[I |i ] isolated ‘.
The quality of each metagenomic bin and outer genome was assessed using CheckM (v.1.0.13) and Anvi’o (v.5.5.0) Lineage Workflow58,59. If CheckM or Anvi’o reported ≥50% completeness/completeness and ≤10% contamination/redundancy, then save metagenomic cells and external genomes for later analysis. These scores were then combined into mean completeness (mcpl) and mean contamination (mctn) to classify genome quality according to community criteria as follows60: high quality: mcpl ≥ 90% and mctn ≤ 5%, good quality: mcpl ≥ 70% , mctn≤10%; Medium quality: mcpl ≥ 50% and mctn ≤ 10%; Medium quality: mcpl ≤ 90% or mctn ≥ 10%. Filtered genomes are further correlated with quality scores (Q and Q’) as follows: Q = mcpl – 5 × mctn; Q’ = mcpl – 5 × mctn + mctn × (strain heterogeneity)/100 + 0.5 × log[N50] (implemented in dRep61).
dRep (v.2.5.4)61 with a 95% ANI cutoff of 28.62 (-comp 0 -con 1000 -sa 0.95) was used to perform comparative analysis between different data sources and genome types (MAG, SAG and REF). – nc 0.2) and single-copy marker genes using SpecI63, providing genome clustering at the species level. A representative genome was selected based on the maximum quality score (Q’) of each dRep cluster defined above, which was considered representative of the species.
To evaluate the mapping speed, BWA (v.0.7.17-r1188, -a) was used to map all 1038 sets of metagenomic reads with 34,799 genomes contained in the OMD. Quality-controlled reads were mapped in single-ended mode and the resulting alignments were filtered to retain only alignments ≥45 bp in length. and identity ≥95%. The display rate for each sample is the percentage of readings remaining after filtering divided by the total number of quality control readings. Using the same approach, each of the 1038 metagenomes was reduced to 5 million inserts (expanded data, Fig. 1c) and matched to the OMD and all GORG SAGs in GEM16. The amount of MAG extracted from seawater in the GEM 16 catalog was determined based on a keyword query of metagenomic origin, selecting seawater samples (eg, as opposed to marine sediments). Specifically, we select “aquatic” as “ecosystem_category”, “marine” as “ecosystem_type”, and filter “habitat” as “deep ocean”, “marine”, “maritime oceanic”, “pelagic marine”, “marine water” , “Ocean”, “Sea Water”, “Surface Sea Water”, “Surface Sea Water”. This resulted in 5903 MAGs (734 high quality) distributed over 1823 OTUs (views here).
Taxonomic annotation of prokaryotic genomes was performed using GTDB-Tk (v.1.0.2)64 with default parameters for GTDB r89 version 13. Anvi’o was used to identify eukaryotic genomes based on ≥50% domain prediction and completion and ≤10% redundancy . The taxonomic annotation of a species is defined as one of its representative genomes. With the exception of eukaryotes (148 MAG), each genome was functionally annotated by calling the full gene with prokka (v.1.14.5)65 and specifying the Archaea or Bacteria parameter as needed, which also reports non-coding genes. and CRISPR, as well as other genomic features. Annotate predicted genes by identifying universal single-copy marker genes (uscMG) using fetchMG (v.1.2)66, assign orthologous groups and run eggNOG-based queries (v.5.0)68 using emapper (v.2.0.1)67. KEGG database (published February 10, 2020) 69. The last step was performed by matching proteins to the KEGG database using DIAMOND (v.0.9.30)70 with a query and topic coverage of ≥70%. The results were further filtered based on the maximum expected bit core (self-reference) bit core ≥ 50% according to the NCBI Prokaryotic Genome Annotation Pipeline71. Gene sequences were also used as input for the identification of BGCs in the genome using antiSMASH (v.5.1.0)72 with default parameters and various cluster explosions enabled. All genomes and annotations have been compiled with contextual metadata in the OMD, which is available online (https://microbiomics.io/ocean/).
Similar to the methods described previously12,22, we clustered the >56.6 million protein-coding genes from the bacterial and archaeal genomes of the OMD at 95% identity and 90% coverage of the shorter gene using CD-HIT (v.4.8.1)73 into >17.7 million gene clusters. Similar to the methods described previously12,22, we clustered the >56.6 million protein-coding genes from the bacterial and archaeal genomes of the OMD at 95% identity and 90% coverage of the shorter gene using CD-HIT (v.4.8.1 )73 into >17.7 million gene clusters. Подобно методам, описанным ранее12, 22, мы кластеризовали > 56,6 миллионов кодирующих белок генов из бактериальных и архейных геномов OMD с 95% идентичностью и 90% охватом более короткого гена с использованием CD-HIT (v.4.8.1). Similar to previously described methods12, 22, we clustered >56.6 million protein-coding genes from bacterial and archaeal OMD genomes with 95% identity and 90% coverage of shorter genes using CD-HIT (v.4.8.1). )73 in more than 17.7 million gene clusters.与之前描述的方法类似12,22,我们使用CD-HIT (v.4.8.1) 以95% 的同一性和90% 的较短基因覆盖率对来自OMD 的细菌和古细菌基因组的>5660 万蛋白质编码基因进行聚类。 Similar to the previously described method12,22, we use CD-HIT (v.4.8.1) to obtain 95% identity and 90% short gene coverage rate for OMD bacteria and 古 bacterium genes >5660 million. The protein coding gene is carried out. Подобно ранее описанным методам12, 22, мы использовали CD-HIT (v.4.8.1) для количественной оценки > 56,6 миллионов бактериальных и архейных геномов из OMD с 95% идентичностью и 90% более коротким охватом генов Гены, кодирующие белок, были сгруппированы. Similar to previously described methods12, 22, we used CD-HIT (v.4.8.1) to quantify >56.6 million bacterial and archaeal genomes from OMD with 95% identity and 90% shorter gene coverage Protein-coding genes were grouped together. )73 成>1770 万个基因簇。 )73 成>1770 万个 gene cluster。 )73 in more than 17.7 million gene clusters. The longest sequence was selected as a representative gene for each gene cluster. The 1,038 metagenomes were then mapped to the >17.7 million cluster representatives with BWA (-a) and the resulting BAM files were filtered to retain only alignments with a percentage identity of ≥95% and ≥45 bases aligned. The 1,038 metagenomes were then mapped to the >17.7 million cluster representatives with BWA (-a) and the resulting BAM files were filtered to retain only alignments with a percentage identity of ≥95% and ≥45 bases aligned. The 1038 metagenomes were then aligned to more than 17.7 million cluster representatives with BWA(-a), and the resulting BAM files were filtered to retain only alignments with a percentage identity ≥95% and ≥45 aligned bases.然后将1,038 个宏基因组映射到具有BWA (-a) 的>1770 万簇代表,并过滤生成的BAM 文件以仅保留百分比同一性≥95% 和≥45 个碱基对齐的对齐。 Then 将1,038 个宏彅组用在个BWA(-a)的>1770万福上上个个个个电影的BAM体育下下小小小个宏刅组和≥45 个生基对齐的对齐. Затем 1038 метагеномов были сопоставлены с >17,7 миллионами представителей кластеров с помощью BWA (-a), а полученный файл BAM был отфильтрован, чтобы сохранить только выравнивания с процентной идентичностью ≥95% и ≥45 выравниваний оснований. The 1038 metagenomes were then aligned to >17.7 million representative clusters using BWA(-a), and the resulting BAM file was filtered to retain only alignments with a percentage identity ≥95% and ≥45 aligned bases. Gene abundance, normalized by length, is calculated by first counting the insertions from the best unique alignment and then, for insertions that have not been explicitly mapped, adding the fractional counts to the corresponding gene of interest and the abundance ratio of its unique insertions.
The genomes in the extended OMD (with additional MAGs from “Ca. Eudoremicrobiaceae”, see below) were added to the database (v.2.5.1) of the mOTUs74 metagenomic analysis tool to create an extended mOTU reference database. Only single-copy genomes were conserved for at least 6 out of 10 uscMGs (23,528 genomes). Database expansion yielded 4,494 additional species-level clusters. 1038 metagenomes were analyzed using the default parameters for mOTU (v.2). According to the mOTU profile, a total of 989 genomes (95% REF, 5% SAG and 99.9% belong to MarDB) contained in 644 mOTU clusters were not found. This reflects various additional sources of marine isolation of the MarDB genome (most of the undetected genomes were associated with organisms isolated from, for example, sediment, marine hosts). To continue to focus on open ocean environments in this study, we excluded them from the downstream analysis if they were not detected or included in the extended mOTU database created in this study.
All BGCs from MAG, SAG and REF in OMD (see above) were combined with BGCs identified in all metagenomic scaffolds (antiSMASH v.5.0, default parameters) and features were processed using BiG-SLICE (v.1.1) (PFAM domain) Extract 75. Based on these features, we computed all cosine distances between BGCs and grouped them (mean links) into GCF and GCC using distance thresholds of 0.2 and 0.8 respectively. These thresholds are previous cosine distance adjustments using a Euclidean distance threshold of 75, which alleviates some of the error in the original BiG-SLICE clustering strategy (Supplementary Information).
The BGCs were then filtered to leave only those encoded on scaffolds ≥5 kb to reduce the risk of fragmentation as previously described and to exclude REF and SAG MarDB that were not found in 1038 metagenomes (see above). ). This resulted in a total of 39,055 BGCs being encoded by the OMD genome, with an additional 14,106 identified on metagenomic fragments (i.e., not included in the MAG). These “metagenomic” BGCs were used to estimate the proportion of the biosynthetic potential of the marine microbiome not captured in the database (Supplementary Information). The performance of each BGC is based on a predictive product type as determined by antiSMASH, or more specifically, a product category as defined in BiG-SCAPE76. To prevent sampling bias in quantification (taxonomic and functional composition of GCC/GCF, distance of GCF and GCC from reference databases, and metagenomic abundance of GCF), further de-identification was carried out by retaining only the longest BGC for each species on GCF39 055. BGCs were reproduced , resulting in 17,689 BGCs being created.
The novelty of GCC and GCF is assessed by the distance databases predicted by computer (RefSeq database in BiG-FAM)29 and experimentally validated (MIBIG 2.0)30 BGC. For each of the 17,689 representative BGCs, we chose the smallest cosine distance to the respective database. These minimum distances are then averaged (averaged) for each GCF or GCC, as appropriate. A GCF is considered new if the distance from the database is greater than 0.2, which corresponds to (on average) complete separation of the GCF and the reference. For GCC, we chose 0.4, which is twice the threshold defined by GCF, in order to capture long-range communication with the reference.
The metagenomic abundance of BGCs was estimated as the median abundance of their biosynthetic genes (determined by anti-SMASH) available from gene-level profiles. The metagenomic abundance of each GCF or GCC was then calculated as the sum of representative BGCs (out of 17,689). These abundance distributions were then normalized across cells using the mOTU per sample count, which also explains the sequencing effort22,74 (expanded data, Fig. 1d). The prevalence of a GCF or GCC was computed as the percentage of samples with an abundance of >0. The prevalence of a GCF or GCC was computed as the percentage of samples with an abundance of >0. Распространенность GCF или GCC рассчитывали как процент образцов с численностью >0. The prevalence of GCF or GCC was calculated as the percentage of samples with an abundance >0. The prevalence of GCF or GCC was calculated as the percentage of samples with an abundance greater than zero.
The Euclidean distance between samples is calculated from the normalized GCF curve. These distances were reduced in size using UMAP77 and the resulting embeddings were used for unsupervised density-based clustering using HDBSCAN78. The optimal minimum number of points (and hence the number of clusters) for clusters used by HDBSCAN is determined by maximizing the cumulative probability of cluster membership. The identified clusters (and random balanced subsamples of these clusters to account for bias in permutational multivariate analysis of variance (PERMANOVA)) were tested for significance against unreduced Euclidean distances using PERMANOVA. The average genome size of the samples was calculated based on the relative abundance of mOTU and the estimated genome size of the members of the genomes. Specifically, the average genome size of each mOTU was estimated as the average of the integrity-adjusted sizes of the genomes within it (e.g., adjusted size of 4 Mb for 75% of total 3 Mb genomes) for an average integrity of ≥70%. genome). Then, the average genome size of each sample was calculated as the sum of mOTU genome sizes, weighted by relative abundance.
A filtered set of BGCs encoded by genomes in OMD (in ≥5 kb frameworks, excluding those not found in 1038 MarDB REF and SAG metagenomes, see above) and their predicted product classes are shown in the genome-based phylogenetic placement of GTDBTk on bacterial and archaeal GTDB trees. (see above). We first reduced the data by species, using the genome with the majority of BGCs in that species as a proxy. For visualization, representatives were further classified by tree, and again, for each classified clade, the genome containing the most BGCs was selected as a representative. BGC-rich species (at least one genome with >15 BGCs) were further analysed by computing the Shannon diversity index of the product types encoded in these BGCs. BGC-rich species (at least one genome with >15 BGCs) were further analyzed by computing the Shannon diversity index of the product types encoded in these BGCs. Виды, богатые BGC (по крайней мере, один геном с> 15 BGC), были дополнительно проанализированы путем вычисления индекса разнообразия Шеннона для типов продуктов, закодированных в этих BGC. Species rich in BGCs (at least one genome with >15 BGCs) were further analyzed by calculating the Shannon Diversity Index for the food types encoded in those BGCs.通过计算在这些BGC 中编码的产品类型的香农多样性指数,进一步分析了富含BGC 的物种(至少一个具有>15 BGC 的基因组)。通过 计算 在 这些 bgc 中 的 产品 类型 的 香农 指数 , 进一步 分析 了 富含 bgc 的 物种 至少 一 个 具有> 15 bgc 的 基因组。。。。)))))))) Обогащенные BGC виды (по крайней мере, один геном с >15 BGC) были дополнительно проанализированы путем расчета индекса разнообразия Шеннона для типов продуктов, закодированных в этих BGC. BGC-enriched species (at least one genome with >15 BGC) were further analyzed by calculating the Shannon Diversity Index for the food types encoded in those BGCs. If all predicted product types are the same, regardless of their order in the cluster (eg, proteobactin-bacteriocin hybrid versus bacteriocin-protein hybrid).
Leftover DNA (an estimated 6 ng) from the sample Malaspina MP1648, corresponding to the biosample SAMN05421555 and matching the short-read Illumina metagenomic readset SRR3962772, was processed for an ultralow input PacBio sequencing protocol to produce a >20 Gb Hifi Pacbio metagenome using the PacBio kits SMRTbell gDNA Sample amplification kit (100-980-000) and the SMRTbell Express Template Prep kit 2.0 (100-938-900). Leftover DNA (an estimated 6 ng) from the sample Malaspina MP1648, corresponding to the biosample SAMN05421555 and matching the short-read Illumina metagenomic readset SRR3962772, was processed for an ultralow input PacBio sequencing protocol to produce a >20 Gb Hifi Pacbio metagenome using the PacBio kits SMRTbell gDNA Sample amplification kit (100-980-000) and the SMRTbell Express Template Prep kit 2.0 (100-938-900). Оставшаяся ДНК (примерно 6 нг) из образца Malaspina MP1648, соответствующая биообразцу SAMN05421555 и совпадающая с метагеномным ридсетом Illumina SRR3962772 с коротким считыванием, была обработана для протокола секвенирования PacBio со сверхнизкими входными данными для получения метагенома Hifi Pacbio > 20 Гб с использованием Наборы PacBio Набор для амплификации образцов гДНК SMRTbell (100-980-000) и набор для подготовки шаблонов SMRTbell Express 2.0 (100-938-900). The remaining DNA (approximately 6 ng) from Malaspina sample MP1648, corresponding to biosample SAMN05421555 and matching short-read Illumina SRR3962772 metagenomic readset, was processed for the PacBio sequencing protocol with ultra-low input data to obtain a Hifi Pacbio metagenome >20 Gb using PacBio kits. amplification of SMRTbell gDNA samples (100-980-000) and SMRTbell Express 2.0 template preparation kit (100-938-900). The remaining DNA from Malaspina sample MP1648 (approximately 6 ng) corresponding to biological sample SAMN05421555 and matching Illumina metagenomic read set SRR3962772 for short read was processed for the PacBio ultra-low-input sequencing protocol using the PacBio kit SMRTbell gDNA Sample. Amplification Kit (100-980-000) and SMRTbell Express 2.0 Template Preparation Kit (100-938-900). Briefly, the remaining DNA was cut using Covaris (g-TUBE, 52104), repaired and purified (ProNex beads). The purified DNA was then library prepped, amplified, purified (ProNex beads) and size-selected (>6 kb, Blue Pippin) before a final purification step (ProNex beads) and sequencing on the Sequel II platform. The purified DNA was then library prepped, amplified, purified (ProNex beads) and size-selected (>6 kb, Blue Pippin) before a final purification step (ProNex beads) and sequencing on the Sequel II platform. Очищенная ДНК затем была подготовлена ​​к библиотеке, амплифицирована, очищена (шарики ProNex) и отобрана по размеру (> 6 т.п.н., Blue Pippin) перед заключительным этапом очистки (шарики ProNex) и секвенирования на платформе Sequel II. The purified DNA was then library prepared, amplified, purified (ProNex beads) and size selected (>6 kb, Blue Pippin) before a final purification step (ProNex beads) and sequencing on the Sequel II platform.然后将纯化的DNA 进行文库制备、扩增、纯化(ProNex 珠子)和大小选择(>6 kb,Blue Pippin),然后进行最终纯化步骤(ProNex 珠子)并在Sequel II 平台上测序。然后 将 纯化 的 dna 进行 制备 、 扩增 纯化 (pronex 珠子 和 大小 选择 (> 6 kb , blue pipin) 然后 最终 纯化 (pronex 珠子 并 在 sequel 平台 测序。。。。。。。。。。。。。。。。。。。。。。。。。。 Затем очищенную ДНК подвергали подготовке библиотеки, амплификации, очистке (гранулы ProNex) и отбору по размеру (>6 т.п.н., Blue Pippin), после чего следовал заключительный этап очистки (гранулы ProNex) и секвенирование на платформе Sequel II. Purified DNA was then subjected to library preparation, amplification, purification (ProNex beads) and size selection (>6 kb, Blue Pippin), followed by a final purification step (ProNex beads) and sequencing on the Sequel II platform.
After the reconstruction of the first two ‘Ca. Eremiobacterota’ MAGs, we identified six additional ones with ANI > 99% (these are included in Fig. 3) that were initially filtered out on the basis of contamination estimates (later identified as gene duplications, see below). Eremiobacterota’ MAGs, we identified six additional ones with ANI > 99% (these are included in Fig. 3) that were initially filtered out on the basis of contamination estimates (later identified as gene duplications, see below). MAG Eremiobacterota, мы идентифицировали шесть дополнительных с ANI > 99% (они включены в рис. 3), которые были первоначально отфильтрованы на основе оценок контаминации (позже идентифицированных как дупликации генов, см. ниже). MAG Eremiobacterota, we identified six additional ones with ANI > 99% (these are included in Fig. 3), which were initially filtered out based on contamination scores (later identified as duplicated genes, see below).在Eremiobacterota 的MAGs 中,我们确定了另外6 个ANI > 99%(这些包含在图3 中)的MAG,最初是根据污染估计过滤掉的(后来被确定为基因重复,见下文)。 Among the MAGs of Eremiobacterota, we identified another 6 MAGs with ANI > 99% (these are included in Figure 3), which were first filtered based on contamination estimation (later identified as gene duplication, 见下文). Среди MAG Eremiobacterota мы идентифицировали еще 6 MAG с ANI> 99% (они включены в рисунок 3), первоначально отфильтрованных на основе оценок загрязнения (позже идентифицированных как дупликации генов, см. ниже). Among the Eremiobacterota MAGs, we identified another 6 MAGs with ANI >99% (these are included in Figure 3 ), initially filtered based on contamination scores (later identified as duplicated genes, see below). We also process material labeled as ‘Ca. Eremiobacterota’ from a different study23 and used them along with the eight MAGs from our study as a reference for subsampled mapping (5 million reads) of metagenomic reads from 633 eukaryote-enriched (>0.8 μm) samples using BWA (v.0.7.17-r1188, -a flag). Eremiobacterota’ from a different study23 and used them along with the eight MAGs from our study as a reference for subsampled mapping (5 million reads) of metagenomic reads from 633 eukaryote-enriched (>0.8 μm) samples using BWA (v.0.7.17 -r1188, -a flag). Eremiobacterota» из другого исследования23 и использовали их вместе с восемью MAG из нашего исследования в качестве эталона для картирования подвыборки (5 миллионов прочтений) метагеномных прочтений из 633 образцов, обогащенных эукариотами (> 0,8 мкм), с использованием BWA (v.0.7.17). Eremiobacterota” from another study23 and used them together with eight MAGs from our study as a reference to map a subsample (5 million reads) of metagenomic reads from 633 samples enriched in eukaryotes (>0.8 μm) using BWA (v.0.7. 17). -r1188, -a flag). Eremiobacterota’ 来自另一项研究23,并将它们与我们研究中的8 个MAG 一起用作使用BWA (v.0.7.17) 对633 个真核生物富集(>0.8 μm) 样本的宏基因组读数进行子采样映射(500 万读数) 的参考-r1188,-a 标志)。 Eremiobacterota’ from another research23, and used them together with 8 个MAGs in our research to use BWA (v.0.7.17)歌词sub-sampling mapping(500万读数)的可以-r1188,-a logo). Eremiobacterota» из другого исследования23 и использовали их вместе с 8 MAG в нашем исследовании в качестве метагеномных прочтений для 633 образцов, обогащенных эукариотами (> 0,8 мкм), с использованием BWA (v.0.7.17) Ссылка -r1188, -a флаг для картирования подвыборки ( 5 миллионов прочтений). Eremiobacterota” from another study23 and used together with 8 MAG in our study as metagenomic reads for 633 samples enriched in eukaryotes (>0.8 μm) using BWA (v.0.7.17) Reference -r1188, -a flag for mapping subsamples (5 million reads). Based on enrichment-specific mapping (filtered to 95% identity alignment and 80% read coverage), 10 metagenomic groups (expected coverage ≥5×) were selected for assembly, and 49 abundance correlations were performed for additional metagenomic groups (expected coverage, ≥1×). Using the same parameters as above, these samples were merged and 10 additional ‘Ca’ were added. MAG Eremiobacterota has been recovered. These 16 MAGs (excluding two already in the database) bring the total number of genomes in the expanded OMD to 34,815. MAGs were assigned to taxonomic ranks based on their genomic similarity and location in the GTDB. The 18 MAGs were dereplicated using dRep into 5 species (within-species ANIs were >99%) and 3 genera (within-genus ANIs ranged between 85% and 94%)79 within the same family. The 18 MAGs were dereplicated using dRep into 5 species (within-species ANIs were >99%) and 3 genera (within-genus ANIs ranged between 85% and 94%)79 within the same family. 18 MAG были дереплицированы с использованием dRep на 5 видов (внутривидовой ANI составлял >99%) и 3 рода (внутриродовой ANI варьировался от 85% до 94%)79 в пределах одного семейства. 18 MAGs were dereplicated using dRep into 5 species (intraspecific ANI was >99%) and 3 genera (intrageneric ANI varied from 85% to 94%)79 within the same family. 18 MAGs were reproduced using dRep for 5 species within the same family (intraspecies ANI greater than 99%) and 3 genera (intragenera ANI 85% to 94%)79. Species representatives are handpicked based on integrity, contamination and N50. A suggested nomenclature can be found in the supplementary information.
Assess the integrity and contamination of Ca. In addition to the line- and domain-specific single-copy marker genomes used by CheckM and Anvi’o, we also assessed the presence of MAG in Eremiobacterota. The identification of 2 repeats from 40 uscMGs was confirmed by phylogenetic reconstruction (see below) to rule out any potential contamination (this corresponds to 5% based on these 40 marker genes). An additional study of representative MAGs for five ‘Ca. Using the Anvi’o interface, Eremiobacterota species confirmed the low levels of contaminants in these reconstructed genomes based on abundance and sequence composition correlations (Supplementary Information).
For phylogenetic analysis, we selected five representative MAGs ‘Ca. Eudoremicrobiaceae”, all genomes of Ca. Eremiobacterota available in GTDB (r89)13 and other types including UBP13, Armatimonadota, Patescibacteria, Dormibacterota, Chloroflexota, Cyanobacteria, Actinobacteria and Planctomycetota. All of these genomes were annotated as previously described for single copy marker gene extraction and BGC annotation. The GTDB genomes were conserved according to the integrity and contamination criteria described above. Phylogenetic analysis was performed using the Anvi’o Phylogenetics workflow. The tree was constructed using IQTREE (v.2.0.3) (default options and -bb 1000)80 aligning 39 tandem ribosomal proteins recognized by Anvi’o (MUSCLE, v.3.8.1551)81. cover at least 50% of the 82 genome, and Planctomycecota was used as an outgroup based on the GTDB tree topology. One tree of 40 uscMGs was built using the same tools and parameters.
We used Traitar (v.1.1.2) with default parameters (phenotype, from nucleotides) of 83 to predict overall microbial traits. We explored a potential predatory lifestyle based on a previously developed predatory index84 that depended on the content of a protein-coding gene in the genome. Specifically, we used DIAMOND to compare proteins in the genome against the OrthoMCL (v.4)85 database using the –more-sensible –id 25 –query-cover 70 –subject-cover 70 –top 20 and count the genes that correspond to the predator and non-predator marker genes. The index is the difference between the number of predatory and non-predatory markers. As an additional control, we also analyzed the genome of ‘Ca. The Entotheonella TSY118 factor is based on its association with Ca. Eudoremicrobium (large genome size and biosynthetic potential). Next, we tested potential links between predator and non-predator marker genes and the biosynthetic potential of ‘Ca. Eudoremicrobiaceae’ and found that at most one gene (from any type of marker gene, i.e. predatory/non-predatory) overlaps with BGC, suggesting that BGC does not confound predator signaling. Disordered replicons were further annotated with genomes using TXSSCAN (v.1.0.2) to specifically study the secretion system, fimbriae, and flagella.
By matching 623 metatranscriptomes from prokaryotic and eukaryotic enrichment sections 22, 40, 87 of the Tara Ocean (using BWA flags, v.0.7.17-r1188, -a) to five representative Ca. The genome of Eudoremicrobiaceae. After 80% read coverage and 95% identity filtering, BAM files were processed using FeatureCounts (v.2.0.1)88 (with options featureCounts –primary -O –fraction -t CDS,tRNA -F GTF -g ID -p ) Count the number of inserts per gene. The resulting profiles were normalized to gene length and mOTU marker gene abundances (median length-normalized insert count of genes with insert count of >0) and log2-transformed22,74 to obtain relative per-cell expression levels of each gene, which also accounts for between-samples differences in sequencing effort. The resulting profiles were normalized to gene length and mOTU marker gene abundances (median length-normalized insert count of genes with insert count of >0) and log2-transformed22.74 to obtain relative per-cell expression levels of each gene, which also accounts for between-samples differences in sequencing effort. Полученные профили были нормализованы по длине гена и обилию маркерных генов mOTU (среднее нормализованное по длине количество вставок генов с количеством вставок >0) и преобразованы по логарифму 22,74 для получения относительных уровней экспрессии каждого гена на клетку, что также учитывает для различий между образцами в усилиях секвенирования. The resulting profiles were normalized for gene length and marker gene abundance mOTU (mean length-normalized number of gene insertions with >0 insertions) and log-transformed by 22.74 to obtain the relative levels of expression of each gene per cell, which is also taken into account for differences between samples in sequencing efforts.将得到的配置文件归一化为基因长度和mOTU 标记基因丰度(插入计数> 0 的基因的中值长度归一化插入计数)和log2-transformed22,74 以获得每个基因的相对每个细胞表达水平,这也说明了用于测序工作中的样本间差异。将 得到 的 配置 文件 归一化为 长度 和 和 标记 基因 丰度 (插入 计数 计数> 0 的 的 中值 归一化 插入) 和 和 和 log2-transformed22.74 以 每 个 基因 相对 每 个 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞 细胞表达水平,这也说明了用于测序工作中的样本间差异。 Полученные профили были нормализованы по длине гена и распространенности маркерного гена mOTU (среднее нормализованное по длине количество вставок для генов с количеством вставок > 0) и преобразованы по логарифму 22, 74 для получения относительных уровней экспрессии на клетку, которые также учитывают соотношение между образцами и клетками. The resulting profiles were normalized for gene length and marker gene abundance mOTU (mean length-normalized number of inserts for genes with inserts > 0) and log-transformed 22.74 to obtain relative expression levels per cell, which also take into account the ratio between samples and cells . – sample variability used in sequencing. When using data on relative abundance, such ratios allow for comparative analysis, solving problems with composition. Only samples with >5 out of the 10 mOTU marker genes were considered for further analyses, ensuring that a large enough fraction of the genome was detected. Only samples with >5 out of the 10 mOTU marker genes were considered for further analyses, ensuring that a large enough fraction of the genome was detected. Только образцы с> 5 из 10 маркерных генов mOTU рассматривались для дальнейшего анализа, гарантируя, что была обнаружена достаточно большая часть генома. Only samples with >5 of the 10 mOTU marker genes were considered for further analysis, ensuring that a sufficiently large portion of the genome was found. Only samples containing more than 5 of the 10 mOTU marker genes were considered for further analysis to ensure that a large enough portion of the genome was found.
The normalized transcriptome profile of ‘Ca. The size of E. taraoceanii was downsized using UMAP and the resulting representations were used for unsupervised clustering using HDBSCAN (see above) to identify expression states. The significance of the differences between the identified clusters was tested by PERMANOVA in the original (non-reduced) distance space. Differential expression between these states was tested in 201 KEGG pathways identified in the genome (see above) and 6 functional groups, namely: BGC, secretion system and flagellar genes from TXSSCAN, degradation enzymes (proteases) from prokka. and peptidases) and predatory markers into predator and non-predator indices. For each sample, we calculated the median normalized expression for each category (note that BGC expression itself was calculated as the median expression of biosynthetic genes for that BGC) and tested for significance (adjusted for FDR for different conditions by the Kruskal-Wallis test).
Synthetic genes were purchased from GenScript and PCR primers were purchased from Microsynth. Phusion polymerase from Thermo Fisher Scientific was used for DNA amplification. NucleoSpin Plasmid and NucleoSpin Gel were used for DNA purification, as well as Macherey-Nagel PCR purification kits. Restriction enzymes and T4 DNA ligase were purchased from New England Biolabs. Chemicals were purchased from Sigma-Aldrich except for isopropyl-β-d-1-thiogalactopyranoside (IPTG) (Biosynth) and 1,4-dithiothreitol (DTT, AppliChem) without further purification. The antibiotics chloramphenicol (Cm), spectinomycin dihydrochloride (Sm), ampicillin (Amp), gentamicin (Gt), and carbenicillin (Cbn) were purchased from AppliChem. Bacto trypton medium components and Bacto yeast extract were purchased from BD Biosciences. Trypsin for sequencing was purchased from Promega.
Gene sequences were extracted from the predicted anti-SMASH pair BGC 75.1 “Ca. E. malaspinii” (additional information).
The embA (locus, MALA_SAMN05422137_METAG-framework_127-gene_5), embM (locus, MALA_SAMN05422137_METAG-framework_127-gene_4), and embAM (including intergenic regions) genes were sequenced as synthetic constructs in pUC57(AmpR) in E. coli. The embA gene was subcloned into the first multiple cloning site (MCS1) of pACYCDuet-1(CmR) and pCDFDuet-1(SmR) with BamHI and HindIII cleavage sites. The embM and embMopt genes (codon optimized) were subcloned into MCS1 pCDFDuet-1 (SmR) with BamHI and HindIII and into the second multiple cloning site (MCS2) pCDFDuet-1 (SmR) and pRSFDuet-1 (KanR)) in NdeI. /XhoI. The embAM cassette was subcloned into pCDFDuet1(SmR) with BamHI and HindIII cleavage sites. The orf3/embI gene (locus, MALA_SAMN05422137_METAG-scaffold_127-gene_3) was constructed by overlap extension PCR using primers EmBI_OE_F_NdeI and EmBI_OE_R_XhoI, digested with NdeI/XhoI, and ligated into pCDFDuet-1-EmbM(MCS1) using the same restriction enzymes (more information). Table 6). Restriction endonuclease digestion and ligation was performed according to the manufacturer’s procedure (New England Biolabs).
All constructs created above were introduced into chemically competent E. coli DH5α and plated on LB agar with appropriate selection of antibiotics. Plasmids were purified from single colonies and sequenced using sequencing primers to check for correct gene insertion (Supplementary Table 6). The embA and embAM genes were additionally subcloned into a modified pLMB509m(GtR) vector for M. aerodenitrificans expression by the Gibson assembly method, including the His6 N-terminal purification tag in the final EmbA protein product. A list of Gibson assembly primers EmbA_F_plmb, EmbA_R_plmb, Plmb_F_EmbA, and Plmb_R_EmbA is provided in Supplementary Table 6. Transform E. coli DH5α, isolate plasmids and verify correct clones by sequencing, then introduce NHis6-EmbA-pLMB509m and NHis6-EmbAM-pLMB509m into E. coli SM10 for conjugation.


Post time: Oct-14-2022