Analysis of somatic mutations identifies signs of selection during in vitro aging of primary dermal fibroblasts

Abstract Somatic mutations are critical for cancer development and may play a role in age‐related functional decline. Here, we used deep sequencing to analyze the prevalence of somatic mutations during in vitro cell aging. Primary dermal fibroblasts from healthy subjects of young and advanced age, from Hutchinson–Gilford progeria syndrome and from xeroderma pigmentosum complementation groups A and C, were first restricted in number and then expanded in vitro. DNA was obtained from cells pre‐ and post‐expansion and sequenced at high depth (1656× mean coverage), over a cumulative 290 kb target region, including the exons of 44 aging‐related genes. Allele frequencies of 58 somatic mutations differed between the pre‐ and post‐cell culture expansion passages. Mathematical modeling revealed that the frequency change of three of the 58 mutations was unlikely to be explained by genetic drift alone, indicative of positive selection. Two of these three mutations, CDKN2A c.53C>T (T18M) and ERCC8 c.*772T>A, were identified in cells from a patient with XPA. The allele frequency of the CDKN2A mutation increased from 0% to 55.3% with increasing cell culture passage. The third mutation, BRCA2 c.6222C>T (H2074H), was identified in a sample from a healthy individual of advanced age. However, further validation of the three mutations suggests that other unmeasured variants probably provide the selective advantage in these cells. Our results reinforce the notions that somatic mutations occur during aging and that some are under positive selection, supporting the model of increased tissue heterogeneity with increased age.


| INTRODUC TI ON
The aging process is characterized by a decline in tissue homeostasis, likely due in part to somatic mutations that arise and spread into future cellular generations as a consequence of mitosis (Failla, 1958;Szilard, 1959). Contrary to the assumption that all cells from a healthy individual have exactly the same genomic contents, each tissue of a given organism is actually a mosaic of cells with different mutational loads. Somatic mutation burden increases during postnatal tissue aging (Alexandrov, Nik-Zainal, Wedge, Campbell, & Stratton, 2013;Blokzijl et al., 2016;Forsberg et al., 2012;Jacobs et al., 2012;Laurie et al., 2012;Lo Sardo et al., 2016;Martincorena et al., 2015;Milholland, Auton, Suh, & Vijg, 2015) as a result of normal and abnormal DNA replication, defective DNA repair, and exogenous and endogenous exposures that cause DNA damage. While somatic mutations are well studied in the context of cancer, the consequences of mosaicism for age-related degeneration of generally healthy tissues and for etiology of other disease processes are less well understood (Jacobs et al., 2012;Laurie et al., 2012;Milholland et al., 2015;Vijg & Suh, 2013).
Efforts to characterize somatic mosaicism have been hindered by technical limitations of DNA sequencing assays to detect lowfrequency genetic variation within a tissue sample. In addition, low-frequency somatic mutations have often been considered of limited biological significance, as long as the tissue seems to be functioning properly (Vijg & Suh, 2013). However, studies enabled by advances in deep sequencing have improved our understanding of the extent and effects of intra-individual somatic genetic variation (Biesecker & Spinner, 2013;Forsberg et al., 2012;Jacobs et al., 2012;Jamuar et al., 2014;Laurie et al., 2012;O'Huallachain, Karczewski, Weissman, Urban, & Snyder, 2012;Roberts et al., 2015). Age-related structural genetic changes and an increased fraction of cells with somatic mosaicism of large structural variants with increased age following serial sampling from the same individual argue for somatic variation as a contributor to tissue aging (Forsberg et al., 2012;Schick et al., 2013;Bonnefond et al., 2013;Machiela et al., 2015).

Hutchinson-Gilford progeria syndrome (HGPS, OMIM 176670)
is a very rare disorder with several clinical features suggestive of premature aging. Xeroderma pigmentosum, complementation groups A and C (XPA and XPC, OMIM 278700 and 278720) are characterized by increased sensitivity to ultraviolet light and high incidence of skin cancer. Impaired DNA repair has been found in HGPS, XPA, and XPC and is believed to contribute to disease pathogenesis (Musich & Zou, 2009).
In this study, we investigate whether the frequency spectrum of genetic variation changes during in vitro aging, and whether that is affected by donor age, the presence of a germline mutation that promotes accelerated aging or by the presence of germline mutations that cause DNA repair deficiency. We used a deep sequencing approach to analyze intra-individual genetic variation at the single nucleotide level in 44 age-related candidate genes. Although deep whole-genome sequencing provides high resolution and the ability to resolve rearrangements to the base pair, identifying lower level somatic changes from mosaic samples still remains a challenge. To circumvent this problem and to allow for an increased depth of analysis and sensitivity, we used target enrichment of genomic regions in combination with DNA extracted from both pre-and post-cell culture expansion (early and late passage) primary cells from unaffected subjects of young and advanced age, and subjects with HGPS, XPA, and XPC. We identified a strongly enriched mutation in CDKN2A in an XPA patient and revealed a wide-spread accumulation of low-level somatic mutations during the in vitro cell aging process.

| Limited cell culture expansion and DNA preparation
For the limited primary fibroblasts expansion, the cells were plated in individual wells of 200 cells per well and expanded in culture for 13-17 cell divisions (Figure 1a). The early (pre-cell culture expansion) and late (post-cell culture expansion) passage populations were defined as follows: early passage DNA was collected from cells at the same time point as when cells were harvested for plating for cell culture expansion. Late passage DNA was collected from the cells that had been expanded in culture for 13-17 cell divisions (Figure 1a). DNA was extracted using the Gentra Pure Gene Cell kit (Qiagen) according to the manufacturer's recommendations, but without the 65°C incubation step.

| Target enrichment
NimbleGen SeqCap EZ Choice Library (Roche, Inc.) was used to capture a genomic target region of 290 kb. The target gene panel was made up of the coding region (exons and exon-intron boundaries) of 44 genes including genes involved in cell cycle regulation, DNA repair, telomere maintenance, and the nuclear lamina (Table 2; Table S1).

| Library preparation and sequencing of captured DNA
The library preparation was performed as per the manufacturer's pro-

| Sequence analysis and somatic variant calling
We aligned the sequence reads of each library to the GRCh37 ref- We took an empirical approach to discover somatic mutations that have gone through variant frequency change from one passage to the other. First, we used "lofreq somatic" (Wilm et al., 2012) to identify candidate somatic mutations that were detected in one passage, but not in the other of an individual (late passage compared to early passage or vice versa) above the level of sequencing TA B L E 2 Exons from the genes listed and the intron 11 of LMNA gene were targeted for sequencing using a hybridization selection array with quality score <30 were excluded from any coverage counts.
Finally, each SM-V was manually checked in the sequence reads using "samtools tview" to ensure no excess of mutations were observed in a 50 bp neighborhood. ANOVAR (Wang, Li, & Hakonarson, 2010) was used to annotate the functional significance of the variants based on the hg19 UCSC known gene database.
The Bayesian based most probable genotype calling algorithm implemented in mpg (Teer et al., 2010) was used to call diploid genotypes of each library from the sequence reads with mapping score ≥30 and bases with quality score of ≥30. Genotypes with mpg quality score ≥10 were used for further analyses. Genotypes of early and late passage libraries of all individuals were compared pairwise to ensure no sample swap or DNA mishandling occurred.

| Assessing limit of detection for mosaicism
To evaluate the sensitivity of our approach in calling low-level somatic mutations, we generated two mosaic control libraries by diluting DNA from one individual (host DNA, AG10579) with DNA from a second subject (doped-in DNA, AG07091; Table S3). These two control libraries were prepared and sequenced to a comparable depth to the other experimental samples described above.  ). We excluded SNPs with probe alignment problems or with known variants within 7 bp of the 3′ end of probes.
Based on the mutations identified in the mosaic DNAs and diploid genotypes of the two samples involved, we calculated sensitivity (true-positive rates, proportion of sites with variant alleles out of the expected ones) and specificity (true-negative rates, proportion of sites with no mutation out of the sites where no variant allele is expected).

| Mathematical modeling of genetic drift
In order to assess whether the observed change of variant frequency between the early and late passage cells per individual/patient at a given site was due to cell growth expected under the random process of genetic drift, we carried out an in silico experimental process of cell growth. We calculated the probability, given the early passage

| Validation of somatic mutations
Digital droplet PCR (ddPCR, Bio-Rad) was used to validate rare somatic mutations identified during deep sequencing (Hindson et al., 2011). Primers and probes for ddPCR were designed by the technical support at Bio-Rad (Table S4). ddPCR was performed on the same batch of genomic DNA that was used for the library preparation using 900 nM forward and reverse primers, and 250 nM mutant and wild-type genomic probes. DNA fragmentation by restriction enzyme digest in the ddPCR was achieved by the addition of HindIII (5 U/reaction; New England Biolabs) and incubation at RT for 5 min.
The PCR was performed with annealing/extension temperatures of 52.4-56.2°C for 40 cycles. The QX200 Droplet Digital PCR Systems (Bio-Rad) were used for droplet generation and analysis.
For two mutations, we were unable to use functional ddPCR assays due to the nucleotide composition of the surrounding sequence (Table S5). The CDKN2A c.53C>T mutation was thus validated using Sanger sequencing. PCR primers used to amplify the region containing the mutation included M13 tags and were as followed: forward 5′-GTA A A ACGACGGCCAGTCCGTA ACTATTCGGTGCGTTG-3′ and reverse 5′-GGAAACAGCTATGACCATGTAATAGCACCTCCTCC GAGC-3′. TERT c.121C>A mutation was validated by restriction fragment length polymorphism analysis (PCR-RFLP) using BsrBI restriction enzyme (5 U/reaction; New England Biolabs). The region containing the mutation was amplified prior to digestion using the following primers: forward 5′-GAGCCACCAGCACAAAGAG-3′ and reverse 5′-GCGCGAGTTTCAGGCA-3′.

| Deep sequencing
In this study, we developed a model to study the extent to which somatic mutations occur in exons of a selection of genes across the genome during in vitro aging of human skin fibroblasts (Table 2,   Table S1). Primary dermal fibroblasts from two different age-groups of healthy individuals (11-13 and 85-94 years of age) were included (Table 1). In addition, patient samples from genetic disorders with accelerated aging and/or genomic instability, including HGPS, XPC, and XPA, were included, as positive controls, anticipating the possibility of increased somatic mutation events in these patients (Table 1; Lodato et al., 2018;Mukherjee & Costello, 1998) To limit the initial cellular somatic genotypes and to explore the possibility of selection, we introduced an artificial bottleneck: two hundred early passage cells were sub-cultured and expanded in vitro for 13-17 cell divisions (Figure 1a). In initial attempts, smaller starting populations of cells were tested, but impaired growth of patient cells showed that 200 cells were the minimum number of cells needed to be able to obtain sufficient amounts of late passage cells for all the samples.
Target gene capture and high-throughput sequencing resulted in an average of 34.33 million mapped reads per sample (Table S2).
After removal of PCR duplicates and reads derived from outside our targeted region, there were 6.6 million mean unique reads within the region of interest across all the samples. The mean read depth across all samples was 1656× (in a range of 661-2856×; Figure 1b).

| Assessment of limits of mutation detection
In our two mosaic DNA libraries with 0.2% or 0.4% of doped-in second DNAs, we obtained sensitivity ≥74% in detecting variant alleles (

| Detection of variable frequency somatic mutations (SM-Vs)
In the two passages of 11 subjects, "lofreq somatic" identified 57 somatic mutations that were absent in one passage from an individual, but present in the other passage. In addition, one mutation that was present in both passages of an advanced age healthy subject (AG13077), but had a significant difference in allele frequency, was retained by our manual check (BRCA2, chr13:32914714, c.6222C>T, H2074H). These 58 SM-Vs are summarized in Table S6. Further evaluation of the relevant sequence reads did not find anomalies in the reads and bases involving these mutations.
Eleven SM-Vs with varying allele frequencies (ranging from 0.66% to 55.3%) were selected among the variants that differed in allele frequency between early and late passage for validation by ddPCR (Table S5). DNA used for the library preparation was also used for the validation to test for artifactual variants included during the library preparation. Eight of the mutations were validated by ddPCR. The fractional abundances obtained were in a similar range to that seen with deep sequencing (Table S5). One of the mutations, ERCC3 c.505G>T, p.V169F, was not detected by ddPCR (Table S5).
Two of the mutations failed ddPCR due to technical limitations with the assays and were instead validated by Sanger sequencing and RFLP analysis. The results from the Sanger sequencing and RFLP analysis confirmed the presence of the mutations (Table S5). Hence, all but one of eleven selected mutations was validated by an alternative method.  (Blokzijl et al., 2016;Franco et al., 2018;Lodato et al., 2018;Osorio et al., 2018). But while cells from aged individuals may have accumulated more mutations, there is conflicting evidence whether the rate of acquiring new mutations goes up or not (Milholland et al., 2015;Podolskiy, Lobanov, Kryukov, & Gladyshev, 2016). It is entirely possible that in our study, the number of cell divisions that occurred during in vitro aging was too low to induce enough somatic mutations to reliably distinguish between HGPS, young age healthy, and advanced age healthy samples, which are expected to have small differences in genomic integrity. In addition, our study only targeted a small region of the genome.  (Table S6).
The dinucleotide mutations were equally distributed between CC>TT (3/6) and GG>AA (3/6). The CC>TT mutations were only found in XP cells, which is in agreement with previous data that these tandem mutations are induced by UV radiation exposure or oxidative stress, and occur at a higher frequency in XP patients that have deficient nucleotide excision repair (Marteijn, Lans, Vermeulen, & Hoeijmakers, 2014). Five of the mutations had previously been found in different types of tumors, as reported in the COSMIC database (Table S6).

| Modeling of bottlenecks and genetic drift
For each of the 58 SM-Vs identified by lofreq, we assessed the evidence against a null model of genetic drift. Twenty of the mutations showed nominally significant departure from this null hypothesis, but only three of them showed significant evidence after correcting for multiple testing (p-value < 1.7 × 10 -7 ; Table 3, Table   S6). All three of these significant mutations increased in frequency from early to late passage. Two of the three mutations were not detected in the initial sequenced sample, thus likely having a very low initial allele frequency. The most significant p-value was for mutation chr9:21974774:A (CDKN2A c.53C>T), with a variant allele frequency changes from 0% to 55.3% (Table S6). While the remaining 55 SM-Vs did not pass a Bonferroni-corrected significance level, 36 SM-Vs had decreased variant allele frequency during cell growth, and the remaining ones had higher frequency in the late passage cells.
For the three SM-Vs, we estimated the maximum likelihood of the selection coefficient. All three of the variants have estimates of strong positive selection (s > 0.25) ( Table 3). The variant with highest significance value has the strongest selection coefficient estimate of 0.5. For these variants, we also calculated the 95% confidence intervals of the selection coefficient.

| Functional significance of detected mutations
Following mathematical modeling, three mutations remained that had allele frequencies that differed and could not be explained solely by genetic drift (Table 3). The most striking differences in allele frequencies between early and late passage (0% and 55.3%) were observed for the CDKN2A c.53C>T, p.T18M mutation during in vitro aging of dermal fibroblasts from an XPA patient (Tables S5 and S6). The mutation is a C>T substitution in a CpG dinucleotide, located in a CpG island (ENCODE Consortium, 2012, https ://genome.ucsc.edu). CpG islands are considered mutational hotspots (15 times higher mutation rate than the genome-wide average) resulting from spontaneous deamination of 5-mC that resolves as a C>T transition mutation (Kim, Elango, Warden, Vigoda, & Yi, 2006;Kondrashov, 2003). This mutation is located in the part of the CDKN2A gene that is not highly mutated in cancer (https ://cancer. sanger.ac.uk/cosmic), although it has been identified in one liver carcinoma case and suggested it might lead to CDKN2A inactivation (Anzola, Cuevas, Lopez-Martinez, Martinez de Pancorbo, & Burgos, 2004).
Missense mutation prediction algorithms (Polyphen-2, http://genet ics. bwh.harva rd.edu/pph2/; SIFT, http://sift.bii.a-star.edu.sg), however, point toward a minimal effect on protein function. While these predictions do not definitively exclude the possibility of the variant affecting protein function, that seems unlikely. Further bioinformatic analysis of the genetic region indicates that the mutation is located in a DNase hypersensitive site (DHS) (ENCODE), which suggests the presence of a regulatory element. The mutation also lies within the binding site of several transcription factors, including EZH2. EZH2 is a known inhibitor of the INK4A-ARF pathway (it has the highest rank for this particular position), and its binding has also been confirmed in several fibroblast cell lines (Bracken et al., 2007;ENCODE Consortium, 2012). So even though the CDKN2A c.53C>T, p.T18M mutation changes the amino acid sequence, we considered that its functional consequence might also relate to an effect on transcription. With that in mind, we selectively evaluated P16INK4A transcript expression in early and late passage samples from this XPA patient and compared it to P16INK4A expression in one young age healthy and one advanced age healthy individual The CDKN2A mutation was frequently coincident in the same patient (GM02990) with another mutation, ERCC8 c.*772T>A, located in the 3′UTR. The ERCC8 mutation differed in allele frequencies between early and late passage (0% and 15.7%, respectively), obtained from the fractional abundances during ddPCR validation (Table S5).
BRCA2 c.6222C>T, p.H2074H was the third SM-V that was indicated by its change in allele frequencies to provide a selective advantage over the growth (1.59% and 21.3%, respectively, obtained from the fractional abundances during ddPCR validation, Table S5).
This mutation was found in a sample from a healthy subject of ad-

| CON CLUS IONS
In this work, we designed a method based on deep targeted sequencing to detect rare variants that occur during in vitro aging. An important part of the work was to develop a population genetics model for tracking somatic variants in cell culture. We used this to estimate the statistical significance of somatic mutations with major differences in allele frequency between early and late cell passages.
Not unexpectedly, modeling indicated that most mutations (55/58 SM-Vs) are indistinguishable from random events based on genetic F I G U R E 2 CDKN2A gene and its expression in the presence or absence of the c.53C>T mutation. (a) Schematic representation of the genomic structure and transcripts of the CDKN2A gene locus with c.53C>T mutation and primer locations. The exons are shown as rectangles and primers as arrows. Primers 1 and 5 were used for RFLP analysis (see figure c), primers 3 and 4 were used for ddPCR analysis (see figure b), and primers 2 and 5 were used for additional ddPCR analysis (see Figure S1). An EagI restriction site is present in a wildtype carrier, but absent in the individual with the c.53C>T mutation. The size of the locus, transcripts, exons, and introns is not shown to scale. Of note, the size of the intron between exon 1a and exon 2 is approximately 3.5 kb. (b) Comparison of CDKN2A transcript copies after normalization to GAPDH in early and late passage primary fibroblasts from young and advanced age healthy subjects without the c.53C>T mutation and in early and late passage primary fibroblasts from the XPA patient where the c.53C>T mutation is present at an allele frequency of 55.3% in the late passage. Primers 3 and 4 were used for this ddPCR analysis. (c) RFLP analysis of CDKN2A transcript in early and late passage primary fibroblasts from young age healthy and advanced age healthy subjects without the c.53C>T mutation and in early and late passage primary fibroblasts from XPA patient where the c.53C>T mutation is present at an allele frequency of 55.3% in the late passage only. Primers 1 and 5 were first used to amplify the cDNA followed by digestion with restriction enzyme EagI (indicated by +). Part of the amplified sample was used as an undigested control (indicated by −). Undigested PCR product is 658 bp in size, while digested fragments are 274 bp and 384 bp in size drift; nevertheless, we identified three mutations with significant allele frequency changes that might represent a signature of positive selection. It is not surprising that two out of these three de novo mutations were identified in primary fibroblasts with inherited DNA repair deficiency. The high allele frequencies in these lines represent a signature of a clonal event, but in each instance, it seems most likely that another mutational event, outside the target area of our genome sequencing, represents the actual driver. Future studies are needed to further define the significance of somatic mutations in cells of various lineages for a wide range of disorders and aging.
This work was limited to a selected set of genes associated with aging and to in vitro induced aging that spanned relatively few cell divisions. Subsequent studies will need to take advantage of the increased availability and low cost of DNA sequencing to look at these issues across the entire genome of multiple different cell types. would also like to thank the National Intramural Sequencing Center Comparative Sequencing Program for sequence generation.

CO N FLI C T O F I NTE R E S T
Authors declare that there is no conflict of interest.

DATA ACCE SS I B I LIT Y
BAM files of DNA sequence reads for all samples used in the paper have been deposited in NCBI dbGaP database with the accession code phs001867.v1.p1 and are available via the repository's access request procedures.
F I G U R E 3 A model for the reoccurrence of somatic mutations in the same cell from an XPA patient. Both the CDKN2A c.53C>T, p.T18M and ERCC8 c.*772T>A mutations had changes in allele frequencies (0%-55.3% and 0%-15.7%, respectively) that suggest they occurred in a cell with a strong competitive advantage. Based on the allele frequencies, we propose that CDKN2A mutation occurred at the very beginning of the in vitro aging while the ERCC8 mutation occurred later on during cell culturing, in a cell that already encountered another unknown driver event