Host genetic regulation of human intestine microbial structural variation

Cohort description


The DMP consists of 8,719 people and is a part of the Lifelines research, a multidisciplinary potential population-based cohort research that makes use of a singular three-generation design to look at well being and health-related behaviours in 167,729 individuals residing within the northern Netherlands. Lifelines makes use of a broad vary of investigative procedures to evaluate the biomedical, socio-demographic, behavioural, bodily and psychological components that contribute to well being and illness, with a particular concentrate on multi-morbidity and complicated genetics48.

Microbiome knowledge technology for the DMP was described elsewhere8. In short, fresh-frozen faecal samples have been collected from contributors of the DMP research. Microbial DNA was remoted utilizing the QIAamp Quick DNA Stool Mini Equipment (Qiagen) by the QIAcube automated pattern preparation system (Qiagen). Metagenomic sequencing was carried out at Novogene, China utilizing the Illumina HiSeq 2000 sequencer. After filtering, 8,534 DMP samples have been used for SV calling.

DMP genotype knowledge technology was described beforehand2. In short, genotyping was carried out utilizing the Infinium International Screening Array MultiEthnic Illnesses model. Lacking genotypes have been imputed utilizing Haplotype Reference Consortium (HRC) panel v.1.1 (ref. 49). Solely bi-allelic SNPs with imputation high quality >0.4, minor allele frequency (MAF) > 0.05, name price >0.95 and Hardy–Weinberg equilibrium P-value > 10−6 have been retained. A complete of seven,738 samples had each metagenomic and genotype knowledge after high quality management (QC)2. We additional eliminated 349 samples overlapping with the LLD cohort. This resulted in phenotype, metagenomic and genotype knowledge being obtainable for 7,389 DMP samples.


The LLD cohort is one other a part of the Lifelines cohort consisting of 1,539 people. Microbiome knowledge technology for LLD was described elsewhere25. Contemporary-frozen faecal samples have been collected, and DNA was remoted with the AllPrep DNA/RNA Mini Equipment (Qiagen, catalogue quantity 80204). Sequencing was carried out utilizing the Illumina HiSeq platform on the Broad Institute, Boston. A complete of 1,135 metagenomic samples handed QC.

Genotyping was carried out utilizing the CytoSNP and ImmunoChip assays, as beforehand described50, and lacking genotypes have been imputed utilizing the HRC v.1.1 reference panel49. A complete of 984 samples had phenotype, metagenomic and genotype knowledge.


The 500FG cohort is a part of the Dutch Human Useful Genomics Venture (DHFGP) and consists of 534 people. The metagenomic knowledge technology was described beforehand26,51. Briefly, DNA was remoted from faecal samples with the AllPrep DNA/RNA Mini Equipment, and libraries have been sequenced on the Illumina HiSeq 2000 platform. A complete of 450 metagenomic samples handed QC and have been included in SV calling.

500FG genotype knowledge technology was described beforehand52. Briefly, genotyping was carried out utilizing the Illumina HumanOmniExpressExome-8 v.1.0 SNP chip. Lacking genotypes have been imputed utilizing the Genome of the Netherlands as a reference panel53. After QC, 396 samples had phenotype, metagenomic and genotype knowledge.


300OB can be a part of the DHFGP and consists of 302 people with physique mass index > 27 kg m−2. Metagenomic knowledge technology was described beforehand26,54 and was carried out utilizing an identical protocol and evaluation pipeline to these of LLD. A complete of 302 samples had metagenomic knowledge obtainable for SV calling.

300OB genotype knowledge technology was described beforehand55. In short, samples have been genotyped on the Illumina HumanCoreExome-24 BeadChip Equipment or the Illumina Infinium Omni-express chip. Imputation was carried out utilizing the HRC v.1.1 reference panel49. After genotype QC, 274 samples had phenotype, genotype and metagenomic knowledge obtainable.


For replication in non-European people, we included 300TZFG, a inhabitants cohort of 323 people from each rural and concrete areas of Tanzania. This research is a part of the DHFGP. Metagenomic knowledge technology has been described beforehand28. Briefly, bacterial DNA was remoted utilizing the AllPrep 96 PowerFecal DNA/RNA package (Qiagen), and libraries have been sequenced on the Illumina NovaSeq 6000 platform. A complete of 320 samples handed QC and have been obtainable for SV calling.

Host genotype knowledge technology was described beforehand56. In short, samples have been genotyped on the International Screening Array SNP chip, and genotype imputation was carried out utilizing Minimac4 with the HRC v.1.1 reference panel. After genotype QC, phenotype, genotype and metagenomic knowledge have been obtainable for 279 samples.

QC of metagenomic sequencing knowledge

We eliminated host-genome-contaminated reads and low-quality reads from the uncooked metagenomic sequencing knowledge utilizing KneadData (v.0.7.4), Bowtie2 (v. and Trimmomatic (v.0.39)58. In short, the data-cleaning process included two most important steps: uncooked reads mapped to the human reference genome GRCh37 (hg19) have been filtered out; and adapter sequences and low-quality reads have been filtered out utilizing Trimmomatic with default settings (SLIDINGWINDOW:4:20 MINLEN:50).

Taxonomic abundance

We estimated the relative abundance of intestine microbial species from the cleaned metagenomic reads utilizing Kraken2 (v.2.1.2)59 together with Bracken (v.2.6.2)60 primarily based on the identical reference genomes included within the database of SGV-Finder, and MetaPhlAn 3 (ref. 61) primarily based on the MetaPhlAn database of clade-specific marker genes (mpa_v30). The primary of those was used within the GWAS evaluation to take away the confounding impact of species abundance, and the final of those was used for the intestine microbiome variety and richness calculation.

Metagenomic SV detection

SVs are extremely variable genomic segments inside bacterial genomes that may be absent from the metagenomes of some people and current with variable abundance in different people. On the premise of the cleaned metagenomic reads, we detected microbial SVs utilizing SGV-Finder with default parameters. SGV-Finder (v.1) was developed and described beforehand20 and may detect two sorts of SV—vSVs and dSVs.

In short, the SV-calling process consists of two most important steps: resolving ambiguous reads with a number of alignments based on the mapping high quality and genomic protection utilizing the iterative-coverage-based learn task algorithm and reassigning ambiguous reads to the most probably reference with excessive accuracy; and splitting the reference genomes of every microbial species into genomic bins and inspecting the protection of genomic bins throughout all samples. For the dedication of dSVs inside every species, the genomic bins are categorized as deleted (protection near 0) or retained (protection near median protection of the genome) bins in every pattern, and people which are deleted in 25–75% of samples are stored within the evaluation as uncooked dSVs. The uncooked dSVs which are extremely correlated in co-occurrence are additional merged into bigger SV areas to supply the ultimate dSV profile. For the dedication of vSVs inside every species, the protection of genomic bins inside every pattern is standardized utilizing the Z-score method. Every bin is then assessed throughout all samples, and people which are extremely variable on the premise of a β′ distribution are stored as uncooked vSVs. The uncooked vSVs which are extremely correlated in standardized protection are additional merged into massive SV areas to supply the ultimate vSV profile.

To outline the genes that belong to the SV area, we expanded the genomic coordinates of SVs 1 kb upstream and downstream, with the genes that overlap with the expanded genomic area thought-about genes that belong to the corresponding SV.

To establish extremely variable genomic segments and detect SVs, we used the reference database supplied by SGV-Finder, which is predicated on the proGenomes database ( We referred to as SVs utilizing default parameters in a bigger panel of 13,195 samples from 10 datasets: 7 inhabitants cohorts (HMP1 (ref. 63), HMP2 (refs. 64,65), DMP8, LLD baseline25,48, LLD follow-up22, 500FG66 and 300TZFG28) and three illness cohorts (300OB67, IBD68 and HIV69). This resulted in 10,265 dSVs and three,931 vSVs. All bacterial species with SV calling have been current in no less than 75 samples. For the present research, we targeted on the 4 Dutch cohorts for which host genetic knowledge have been additionally obtainable: DMP, LLD baseline, 500FG and 300OB. We eliminated samples with <5% of SVs referred to as. After pattern removing, SV and genotype knowledge have been obtainable for 9,015 samples from the 4 cohorts: DMP (n = 7,372), LLD baseline (n = 981), 500FG (n = 396) and 300OB (n = 266).

SV filtering and normalization

First, we carried out filtering per cohort. Solely SVs that have been referred to as in >10% of samples have been used within the analyses. As well as, we eliminated dSVs with a MAF (frequency of both deletion or its absence) <5% and with each reference and different allele depend ≤80 (this quantity was decided on the premise of the advice that the variety of circumstances and controls is >10× the variety of predictors within the generalized linear mannequin affiliation take a look at70; see beneath). Subsequent, we stored solely SVs that have been current in no less than two cohorts. vSV knowledge have been normalized utilizing inverse regular rank transformation for the heritability and affiliation analyses.

Heritability estimation

We estimated SV heritability utilizing the GREML software program from the GCTA toolbox (v.1.94.1). We utilized the family-based method71 applied in GREML on the SV knowledge from the DMP cohort as a result of this cohort has the biggest pattern measurement and comprises family members. A complete of seven,389 samples with genotype and microbiome knowledge have been used for the evaluation. To estimate heritability, we used default settings correcting for age, intercourse, whole metagenomic sequencing learn quantity and species abundance. Heritability estimates for species abundance and the corresponding confidence intervals have been obtained from ref. 8, which estimated heritability on the premise of household relations in the identical DMP cohort.

GWAS and meta-analysis

The manipulation of human genotype datasets was performed utilizing PLINK (model alpha 2.1). Affiliation evaluation was carried out utilizing fastGWA from the GCTA toolbox (v.1.94.1)72, per cohort per SV. For dSVs, we used the generalized linear combined model-based model of the device (–fastGWA-mlm-binary)73. Within the affiliation analyses, we used a sparse genetic relationship matrix (GRM) created from the total GRM constructed on genotyped (non-imputed) SNPs with MAF > 5% utilizing GCTA with default choices (–make-grm and –make-bK-sparse 0.05). The next covariates have been added to the mannequin: age, intercourse, whole metagenomic sequencing learn quantity and centred log ratio (CLR)-transformed species abundance. The overall learn depend was standardized to have a imply of zero and a variance of 1. Meta-analysis was carried out utilizing the Metallic software program (model 2020-05-05)74 with default choices (weighting cohort-based P values based on pattern measurement). To regulate for a number of testing, we utilized the Bonferroni-corrected genome-wide significance threshold (5 × 10−8/SV quantity) and thought of affiliation outcomes with P values beneath this threshold as statistically important. For dSVs, the P-value threshold was 5 × 10−8/1,666 = 3.00 × 10−11. For vSVs, it was 5 × 10−8/1,886 = 2.65 × 10−11.

Affiliation with ABO blood group

We used two approaches to find out the ABO blood group. Within the DMP cohort, we decided the blood group on the premise of three variants (rs8176719, rs41302905 and rs8176746), as described beforehand2. For LLD and 500FG, by which a few of these variants weren’t genotyped, we used a much less delicate method primarily based on two SNPs, rs8176693 (T allele determines blood group B) and rs505922 (T allele determines blood group O), as reported in beforehand revealed papers75,76. Affiliation of blood teams with F. prausnitzii SVs was carried out in R (v.4.1.0) utilizing (generalized) linear combined fashions utilizing the R package deal lme4qtl (v.0.2.2). This package deal permits a kinship matrix to be included as a random impact to account for pattern relatedness. For every cohort, we created a kinship matrix primarily based on a GRM constructed by GCTA utilizing the operate kinship from the R package deal kinship2 (v.1.9.6). We corrected for a similar covariates as within the GWAS as described above. Meta-analysis was carried out utilizing Metallic74.

Inhabitants genetic construction of F. prausnitzii

We calculated an SV-based between-sample microbial genetic dissimilarity primarily based on Canberra distance for every microbial species individually utilizing the vegdist() operate of the R package deal vegan (v.2.6-2) to generate species-specific genetic distance matrices (MSV). We then carried out a principal coordinate evaluation primarily based on MSV utilizing the pcoa() operate of the R package deal ape (v.5.6-2), with the detrimental eigenvalues corrected with Cailliez’s methodology53.

Phylogenetic tree building

For the F. prausnitzii strains with SVs containing the GalNAc utilization gene cluster, we first constructed a phylogenetic tree utilizing the RAxML method primarily based on 81 precisely chosen single-copy marker genes77. We then constructed one other phylogenetic tree utilizing RAxML (v.8) primarily based on the GalNAc utilization genes positioned within the SV area78. The phylogenetic timber have been transformed to between-strain cophenetic distances utilizing the cophenetic() operate from the R package deal stats (v.4.3.0).

The phylogenetic tree proven in Fig. 3c was constructed utilizing CSI Phylogeny 1.4 on the premise of SNPs of whole-genome sequences of the 12 isolates79 and was visualized utilizing the R packages ggtree (v.3.2.1) and gggenomes (v.

Cohousing and SV sharing

Cohousing data on the time of faecal sampling is understood for 8,880 people from the DMP cohort. For this cohort, we eliminated people not cohousing with every other participant and people with no microbial or genetic data. For two,631 contributors, we assessed whether or not any particular person cohousing with them on the time of sampling had F. prausnitzii 577–579. We then used a logistic regression utilizing the presence or absence of 577–579 as a dependent variable and the secretion of A-antigens and the presence of family SV as impartial variables to estimate the impact of the presence of SV within the family on SV presence in a person. We additionally assessed the potential achieve or lack of F. prausnitzii in 338 people whose intestine microbiome was profiled once more after 4 years22. For 119 people, F. prausnitzii SV profiles have been generated at each time factors.

Genomic island prediction

Genomic islands have been predicted by SIGI-HMM81 and IslandPath-DIMOB82 as built-in into IslandViewer 4, a computational device that integrates a number of genomic island prediction strategies83. Each SIGI-HMM and IslandPath-DIMOB have been proven to have excessive general accuracy, with IslandPath-DIMOB having a barely larger recall and SIGI-HMM having a barely larger precision.

Microbial gene annotation

The genes of F. prausnitzii strains and reference genomes used for intestine microbial SV calling have been annotated utilizing MicrobeAnnotator (v.2.0.5)84 and Bakta (v.1.8.1)85. For the annotation of genes encoding glycoside hydrolase household 109 (GH109) in F. prausnitzii and C. aerofaciens strains, we first obtained 2,113 GH109 protein sequences from CAZy ( after which performed a homologue search of GH109 genes within the genomes of F. prausnitzii and C. aerofaciens strains utilizing tblastn (v.2.5.0+)87 with the next parameters: -outfmt 7 -evalue 1e-10.

Homologue search in genes concerned within the GalNAc pathway

We downloaded 10,487 assembled genomes of ABO-associated species from the Unified Human Gastrointestinal Genome assortment33, together with 1,103 assemblies of C. aerofaciens, 484 of F. lactaris, 1,109 of B. bifidum and seven,791 of F. prausnitzii. We then used the sequences of genes positioned in SV 577–579 as queries and carried out a homologue search within the assemblies utilizing tblastn (v.2.5.0+)87 with the next parameters: -outfmt 7 -evalue 1e-10.

Protein household search and profiling with shortBRED

We searched the metagenomes for 27 bacterial proteins recognized within the SV phase of F. prausnitzii (excluding dinB and HTF-238_02530, which have been used as SV area markers and usually are not positioned throughout the SV), together with the genes identified to be concerned in GalNAc metabolism, utilizing the shortBRED toolkit (v.0.9.5)88. We extracted the genes positioned within the SV and transformed the gene sequences to protein sequences, as required by shortBRED. We used the shortBRED device (v.0.9.5) to establish distinctive markers for the question genes, utilizing the UniRef90 database (downloaded on 1 November 2021) as a detrimental management.

Subsequent, the device (v.0.9.5) was used to quantify these markers in metagenomes. First, we assessed the affiliation of those gene abundances with the ABO blood group. We log-transformed the RPKM values supplied by shortBRED and carried out a linear combined mannequin evaluation utilizing shortBRED gene abundances as outcomes and ABO A or AB blood group as a predictor accounting for pattern relatedness utilizing random results within the lme4qtl package deal. We additionally included different covariates as predictors, together with age, intercourse, whole metagenomic sequencing learn quantity and CLR-transformed F. prausnitzii abundance, along with 4 F. prausnitzii dSVs and one vSV discovered to be related to ABO within the major GWAS evaluation.

Subsequent, we estimated the affiliation of gene abundance with the α-diversity (Shannon index and richness) of the intestine microbiome in DMP utilizing linear regression utilizing the next system:

αvariety = SV 577–579 + F. prausnitzii taxonomic abundance + C. aerofaciens taxonomic abundance + gene abundance.

Bacterial strains and progress

The Faecalibacterium and Collinsella strains used on this research have been from tradition collections (ATCC and DSMZ) and our native pressure assortment (Division of Medical Microbiology, College Medical Middle Groningen, Groningen, the Netherlands). On the premise of the presence or absence of SVs, the next Faecalibacterium strains have been chosen: F. prausnitzii A2-165 (DSM 17677), F. prausnitzii ATCC 27768, F. prausnitzii HTF-F (DSM 26943), F. prausnitzii HTF-112, F. prausnitzii HTF-495, F. prausnitzii HTF-238, F. prausnitzii HTF-383, F. prausnitzii 60C2, F. prausnitzii HTF-121, F. prausnitzii HTF-133, F. prausnitzii HTF−441 and F. prausnitzii FM4. Two strains of C. aerofaciens have been chosen on the premise of the presence of the GalNAc genes: C. aerofaciens 4PBA and C. aerofaciens HTF-129.

Strains have been cultured in a modified YCFA medium supplemented with totally different carbohydrates (glucose, galactose, GalNAc, mannose, lactose, fructose, N-acetylglucosamine, 2-fucosyllactose and N-acetylneuraminic acid). YCFA medium was ready as for YCFA–glucose (YCFAG) medium described earlier than89 with out the addition of glucose. YCFA medium was composed of (g l−1) 10 casitone, 2.5 yeast extract, 4 sodium bicarbonate, 0.45 dipotassium hydrogen phosphate, 0.45 potassium dihydrogen phosphate, 0.9 sodium chloride, 0.09 magnesium (II) sulfate heptahydrate, 0.12 calcium chloride dihydrate, 2.7 sodium acetate, 1 cysteine, 5 ml 0.02% resazurin and 0.2% haemin, 1 ml pink vitamin combination and yellow vitamin combination, and the liquid medium. The pink vitamin combination (per 100 ml) comprises 1 mg biotin, 1 mg cobalamin, 3 mg p-aminobenzoic acid, 5 mg folic acid and 15 mg pyridoxamine. The yellow vitamin combination (per 100 ml) comprises 5 mg thiamine and 5 mg riboflavin. The liquid medium consists of 600 µl l−1 propionate (≥99% purity, Sigma-Aldrich), 100 µl l−1 isobutyrate (≥99% purity, Sigma-Aldrich), 100 µl l−1 isovalerate (≥99% purity, Sigma-Aldrich) and 100 µl l−1 valerate (≥99% purity, Sigma-Aldrich). The medium is adjusted to a closing pH of 6.5.

Development experiments have been carried out in a Bactron 600 anaerobic incubator (Kentron Microbiome BV) utilizing a 24-well flat-bottom-plate with whole quantity of 1 ml per effectively YCFA medium supplemented with 4.5 g l−1 of the specified carbohydrate supply. Cultures have been began at an preliminary OD600nm vary of 0.10–0.15 by the addition of an in a single day glucose-grown pre-culture, and progress was monitored anaerobically at 600 nm over 24 h at 37 °C. Readings have been taken each 2 h, after 10 s shaking, utilizing Epoch 2 (Agilent BioTek Devices), and progress curves have been generated utilizing Gen5 software program. Every progress situation was carried out in triplicate utilizing three impartial pre-cultures. Information of progress curves are reported as means ± s.d.

Gene expression evaluation of GalNAc induction

Pattern assortment

The F. prausnitzii strains HTF-495, HTF-441 and ATCC 27768 have been chosen to check the mRNA expression degree of genes on the premise of the shortest distance throughout the phylogenetic tree. The F. prausnitzii strains have been pre-cultured individually in YCFAG medium in a single day anaerobically at 37 °C in triplicate. To get sufficient biomass, these pre-cultures have been used to inoculate recent triplicates of every pressure in a ratio of 1:20 (20 ml) and incubated for twenty-four h anaerobically at 37 °C in YCFAG medium. Every tradition was then cut up into two tubes (10 ml per tube) and centrifuged at 3,000 r.p.m. for 10 min. The supernatants have been eliminated and resuspended with 10 ml YCFAG or YCFA-GalNAc, individually for every tradition, in a complete of 18 samples. After 6 h of incubation, a 1:1 ratio (10 ml) of ice-cold killing buffer (20 mM Tris-HCl pH 7.5, 5 mM MgCl2, 20 mM NaN3) was added to the cultures. Samples have been centrifuged at 3,000 r.p.m. for 10 min at 4 °C, and the supernatants have been eliminated. The pellets have been resuspended in 1 ml TRIzol (Invitrogen) and saved at −80 °C till additional RNA isolation.

RNA isolation and cDNA synthesis

For RNA isolation, 200 µl of RNAse-free chloroform was added to every pattern and incubated at room temperature for five min. After incubation, the samples have been centrifuged at 12,000g at 4 °C, and the aqueous part was recovered into a brand new tube. To precipitate RNA, 500 µl of RNAse-free isopropanol was added to every pattern and combined briefly. Samples have been incubated for 10 min at room temperature and centrifuged for 10 min at 12,000g and 4 °C. The supernatant was eliminated, and the pellets have been washed in 1 ml of 75% RNAse-free ethanol, vortexed briefly and centrifuged for five min at 7,500g at 4 °C. The supernatant was eliminated, and the pellets have been air-dried at room temperature for 10 min. Afterward, the samples have been resuspended with RNAse-free water.

Lastly, DNA contamination was faraway from 10 µg of the pattern utilizing TURBO DNA-free Equipment (Invitrogen). cDNA was generated utilizing the TaqMan Reverse Transcription Reagents (Invitrogen) with random hexamers.

Quantitative PCR

Samples have been diluted to working focus and used as a template for quantitative PCR (qPCR) amplification of the goal genes (for primers, see Supplementary Desk 20). Every response contained 10 μl of GoTaq qPCR Grasp Combine (Promega), 9 μl of DNA template (10 ng) and two instances 0.5 μl primer answer (20 µM) in a complete response quantity of 20 μl. The amplification was carried out in a 7500 Actual-Time PCR System (Utilized Biosystems). The amplification program comprised two phases: an preliminary denaturation step at 95 °C for two min, adopted by 40 two-step cycles at 95 °C for 15 s and at 60 °C for 1 min. On the finish of the run, a melting curve evaluation was carried out. The cycle threshold (Ct) worth was first decided utilizing the 7500 Actual-Time PCR System detection system after which adjusted manually to set the edge throughout the exponential part of the curves. All qPCR reactions have been carried out in triplicate. TheΔCt values of the genes of curiosity have been obtained by correction for the Ct worth of rpoA because the housekeeping gene. Afterward, the totally different ({2}^{-Delta {C}_{{rm{t}}}}) values of every pressure have been calculated per situation. These values have been used to find out the relative fold change expression of the genes after GalNAc induction in comparison with progress in glucose.

Moral approval

The Lifelines research was accredited by the ethics committee of the College Medical Middle Groningen (METc2007/152). All contributors signed an knowledgeable consent type earlier than enrolment. Extra written consents have been signed by the DMP contributors or authorized representatives for youngsters aged underneath 18 years. The LLD research was accredited by the Institutional Ethics Evaluation Board of the College Medical Middle Groningen (ref. M12.113965), the Netherlands. The 300OB research was accredited by the IRB CMO Regio Arnhem-Nijmegen (quantity 46846.091.13). The 500FG research was accredited by the Moral Committee of Radboud College Nijmegen (NL42561.091.12, 2012/550). The inclusion of volunteers and experiments was performed based on the rules expressed within the Declaration of Helsinki. All volunteers gave written knowledgeable consent earlier than any materials was taken. The 300FGTZ research was accredited by the Moral Committees of the Kilimanjaro Christian Medical College Faculty (CRERC; quantity 936) and the Nationwide Institute for Medical Analysis (NIMR/HQ/R.8a/Vol. IX/2290) in Tanzania. The Tanzanian cohort supplied consent for the usage of their knowledge for the needs of this evaluation.

Reporting abstract

Additional data on analysis design is offered within the Nature Portfolio Reporting Abstract linked to this text.

Latest articles

Related articles

Leave a reply

Please enter your comment!
Please enter your name here