Hidden endemism, deep polyphyly, and repeated dispersal across the Isthmus of Tehuantepec: Diversification of the White‐collared Seedeater complex (Thraupidae: Sporophila torqueola)

Abstract Phenotypic and genetic variation are present in all species, but lineages differ in how variation is partitioned among populations. Examining phenotypic clustering and genetic structure within a phylogeographic framework can clarify which biological processes have contributed to extant biodiversity in a given lineage. Here, we investigate genetic and phenotypic variation among populations and subspecies within a Neotropical songbird complex, the White‐collared Seedeater (Sporophila torqueola) of Central America and Mexico. We combine measurements of morphology and plumage patterning with thousands of nuclear loci derived from ultraconserved elements (UCEs) and mitochondrial DNA to evaluate population differentiation. We find deep levels of molecular divergence between two S. torqueola lineages that are phenotypically diagnosable: One corresponds to S. t. torqueola along the Pacific coast of Mexico, and the other includes S. t. morelleti and S. t. sharpei from the Gulf Coast of Mexico and Central America. Surprisingly, these two lineages are strongly differentiated in both nuclear and mitochondrial markers, and each is more closely related to other Sporophila species than to one another. We infer low levels of gene flow between these two groups based on demographic models, suggesting multiple independent evolutionary lineages within S. torqueola have been obscured by coarse‐scale similarity in plumage patterning. These findings improve our understanding of the biogeographic history of this lineage, which includes multiple dispersal events out of South America and across the Isthmus of Tehuantepec into Mesoamerica. Finally, the phenotypic and genetic distinctiveness of the range‐restricted S. t. torqueola highlights the Pacific Coast of Mexico as an important region of endemism and conservation priority.


| INTRODUCTION
Variation is ubiquitous in nature. Yet lineages differ in how phenotypic and genetic diversity are partitioned among individuals, populations, and species. This complexity motivates long-standing efforts to identify the evolutionary and ecological forces that generate and maintain biodiversity across temporal and spatial scales. Phylogeography quantifies genetic and phenotypic partitioning within and among species to provide insight into the patterns and processes that underlie population differentiation, speciation, and diversification (Avise, Arnold, Ball, & Bermingham, 1987;Brito & Edwards, 2008;Edwards, Potter, Schmitt, Bragg, & Moritz, 2016). Examining concordance (or lack thereof) between genetic and phenotypic partitions in phylogeographic data sets can reveal the contributions of historical demographic events and selective pressures toward extant patterns of biodiversity (García-Moreno, Navarro-Sigüenza, Peterson, & Sánchez-González, 2004;Zamudio, Bell, & Mason, 2016), while large genomic data sets continue to provide new insights into the origins of biodiversity (Ekblom & Galindo, 2010;McCormack, Hird, Zellmer, Carstens, & Brumfield, 2013;Toews, Campagna, et al., 2016).
Most taxonomic authorities have long recognized between three and five geographically restricted subspecies within S. torqueola, which differ primarily in male plumage patterning and coloration (Clements et al., 2016;Gill & Donsker, 2017 (Figure 1). We also use a large panel of molecular markers to examine how genetic variation is partitioned among individuals and populations within this complex. Finally, we consider the biogeographic history of the S. torqueola complex in the broader phylogenetic context of related taxa in the Neotropical subfamily Sporophilinae.

| Morphological measurements
We measured standard phenotypic characters to examine geographic and taxonomic variation in morphology and plumage. We recorded a set of standard measurements (culmen length, bill length from the gonys, depth of the bill at the nostril, width of the bill at the nostril, wing chord length, tarsus length, hallux length, and length of the central rectrix). We also assessed multiple presence-absence plumage characters, including a partial eye ring, primary wing bars, and white edging on the secondaries and tertials of the wing feathers, and a white spot at the base of the primaries. Measurements and plumage scores of 663 specimens of S. torqueola, including at least 25 individuals of both sexes from all three groups (Appendix S1), were all conducted by the same observer (Olvera-Vital). We performed a principal component analysis (PCA) and examined the dispersion of measured specimens corresponding to the three groups for both males and females. We also conducted a multinomial logistic regression to evaluate the diagnosability of each group using both the continuous morphological measurements and presence-absence plumage characters.

| Sampling, DNA extraction, target capture, and sequencing
We sampled 68 vouchered specimens for molecular analyses, including representatives of all three groups within S. torqueola (Figure 1; Table S1). Our sampling also included four S. minuta individuals as an outgroup (Mason & Burns, 2013). We extracted genomic DNA from tissues or museum specimen toe pads using a Qiagen DNeasy Blood and Tissue kit. We quantified each extraction using a QuBit fluorometer (Life Technologies, Inc.), diluted or concentrated each sample to approximately 20 ng/μl, and visualized genomic DNA on an agarose gel to assess potential degradation.
We performed sequence capture of ultraconserved elements (UCEs) to acquire a reduced-representation genomic data set (Mamanova et al., 2010). This method captures DNA fragments from thousands of orthologous loci using synthetic probes that were generated from alignments of amniote genomes, including chicken and Anolis . Once orthologous loci have been successfully captured and recovered, single nucleotide polymorphisms and indels that are present in regions adjacent to the UCEs are called for use in downstream analyses. While UCEs were first used to resolve deeper evolutionary relationships McCormack et al., 2012;McCormack, Harvey, et al., 2013), various studies have demonstrated their potential to infer evolutionary processes within more recent evolutionary time scales (Giarla & Esselstyn, 2015;Harvey, Smith, Glenn, Faircloth, & Brumfield, 2016;Smith, Harvey, Faircloth, Glenn, & Brumfield, 2014).
We processed 24 of our samples using published, open-source protocols for library preparation and UCE enrichment http://ultraconserved.org). In brief, we sheared 100 μl of each DNA extraction after normalizing concentrations to a distribution of 200-400 bp using a Bioruptor sonicator (Diagenode).
We did not shear degraded samples, including extractions from toepads of museum specimens. We constructed genomic libraries for each sample using KAPA LTP library preparation kits with custom PCR barcodes . We pooled indexed libraries into groups of eight and enriched each pool using a set of 5,472 probes that targeted 5,060 UCE loci (Tetrapods-UCE-K5v1 probe set; http://ultraconserved.org). We performed an 18-cycle PCR-recovery and combined libraries in equimolar ratios to achieve a final concentration of 10 μM. We send the pooled libraries to the UCLA Genoseq Facility for 250 bp paired-end sequencing on one lane of an Illumina MiSeq. For the remaining 44 samples, we sent DNA extractions to RAPiD Genomics (http://www.rapid-genomics. com) for enrichment of the same probe set to be run on 25% of a shared, single lane of Illumina HiSeq2000.
Next, we assembled reads into contigs using Velvet (Zerbino, 2002) in combination with VelvetOptimiser (Zerbino & Birney, 2008), which tested the assembly performance of hash lengths that varied from 67 to 75. Following contig assembly, we aligned reads to probes and their corresponding contigs using BWA ), created indexed bam files with SAMtools , and called phased SNPs and indels via GATK (DePristo et al., 2011;McKenna et al., 2010). We retained only high quality SNPs (Phred Score ≥ Q30) and prepared input files for downstream population genetic analyses.
Mitochondrial DNA is often present among postcapture libraries due to off-target capture (Mamanova et al., 2010). We assembled mtDNA sequences using the program MITObim (Hahn, Bachmann, & Chevreux, 2013), which uses an iterative baiting method to generate mtDNA contigs. We used a published sequence of cytochrome b (cyt b) as our initial reference (GenBank accession JN810151). To examine the evolutionary relationships of the samples included this study with respect to other species in the Sporophila genus, we also downloaded existing cyt b sequences for other species (Burns et al., 2014;Mason & Burns, 2013). We aligned our cyt b assemblies with MAFFT (Katoh & Standley, 2013) for downstream analyses.

| Phylogenetic and population genetic analyses
We first conducted phylogenetic analyses on all individuals using the UCE data set. For the phylogenetic analyses, we extracted the 1,000 UCE loci with the highest number of informative sites using AMAS (Borowiec, 2016) and custom R scripts. We used RAxML (Stamatakis, 2006) with the "-f a" run option that finds the best tree and performs rapid bootstrapping in the same run with a separate partition for each UCE locus. We performed the same RAxML analyses on the cyt b F I G U R E 1 Map showing sampling localities of samples used in this study. Sampling points have been jittered to help visualize the density of samples at localities with more than one sample. Sample numbers are shown next to each species' portrait. See Table S1 for more detailed information on sampling localities used in genetic analyses alignments of mtDNA. We ran these analyses on an alignment matrix that included outgroup individuals and one that omitted outgroup individuals.
We then performed a Discriminant Analysis of Principal Components (DAPC; Jombart, Devillard, & Balloux, 2010) on the data set with outgroups as well as with outgroups removed using the adegenet package in R (Jombart & Ahmed, 2011). For the data set with outgroups, we removed individuals with more than 85% missing data and loci that had more than 25% missing data to generate a data matrix of 54 individuals and 1,358 loci. For the data set without outgroups, we removed individuals with more than 80% missing data and loci with more than 75% missing data to generate a matrix of 50 individuals and 4,067 loci. For both data sets, we retained principal component axes that together accounted for at least 70% of the total variation and constructed a scatterplot of the first two principal component axes to examine population structure.
We ran STRUCTURE (Pritchard, Stephens, & Donnelly, 2000) on the full data matrix to evaluate different population assignment models. We ran STRUCTURE with different K values (i.e., number of populations) from 1 to 6 with 10 replicate runs for each K value. Each run included 500,000 generations preceded by 10,000 burn-in steps in the MCMC chain. We identified the optimal K value by examining changes in log-likelihood scores and variance in log-likelihoods among runs (Evanno, Regnaut, & Goudet, 2005). After identifying the optimal K value for the full data matrix, we ran STRUCTURE with the same settings on the data matrix with outgroups removed and separately on two prominent clusters identified in our data to examine possible population substructuring that might be obscured by deeper, hierarchical splits. For these nested STRUCTURE analyses, we removed one misidentified female from the S. t. moreletti group (see RAxML and full STRUCTURE results below); female Sporophila are often difficult to identify (Mason & Burns 2013;de Schauensee, 1952) and collectors occasionally misidentify females, such as this individual. We also removed loci that were not polymorphic or had more than 50% missing data within either population cluster. This generated a panel of 12510 loci and 44 individuals for the S. t. morelleti cluster and 12,912 loci and 19 individuals for the S. t. torqueola cluster.
We subsampled the five individuals with the least amount of missing data from S. t. torqueola and S. t. morelleti in combination with the three S. t. sharpei and four outgroup samples. We restricted our data set to biallelic loci with a minor allele frequency >0.1 and <10% missing data; we then extracted 500 random loci from this subset.
To acquire marginal likelihood scores, we performed path sampling with 48 steps. Each step included 100,000 generations in its MCMC chain with 10,000 pre-burn-in steps and 10% of subsequent generations removed as burn-in; these settings generated MCMC chains that attained stationarity upon examining the posterior with Tracer (Rambaut, Suchard, Xie, & Drummond, 2014).
Finally, we estimated demographic parameters, including effective population size and migration rates, between subspecies groups in the S. torqueola species complex with the Bayesian program G-PhoCS (Gronau, Hubisz, Gulko, Danko, & Siepel, 2011). Specifically, we extracted 1,000 randomly selected loci and included all individuals except for a single female that we suspect was misidentified (see RAxML and STRUCTURE results). We ran 2,000,000 iterations, remove 10% as burn-in, and examined the posterior with Tracer (Rambaut et al., 2014).
We ran 10 separate chains with different starting seeds to examine consistency among MCMC runs. We assumed an exponential distribution with a mean of 0.0001 for mutation-scaled divergence time estimates (τ), an exponential distribution with mean of 0.01 for population size parameters (θ), and a Gamma priors (α = 0.002, β = 0.00001) for migration parameters (m). We calculated the effective population size (N e ) in terms of θ = 4N e μ We used a mutation rate of 2.2 × 10 −9 that is commonly applied to nuclear loci among vertebrate taxa (Kumar & Subramanian, 2002) to provide a rough estimate of N e for populations in our study. While mutation rates vary substantially among regions of the genome and different taxa, we can compare relative differences in demographic parameters among populations considered in this study.
We also calculated the proportion of individuals in population x that

| Biogeographic analyses
After generating alignments of mtDNA sequences of all available cyt b sequences of species in the genus Sporophila, we identified the optimal codon-based partitioning scheme and corresponding models of nucleotide evolution for our mtDNA alignment using PartitionFinder (Guindon et al., 2010;Lanfear, Calcott, Ho, & Guindon, 2012;Lanfear, Frandsen, Wright, Senfeld, & Calcott, 2016). The best-fit model included a separate partition for each codon position. Under the best-fit model, the first codon position evolved under HKY + I + Γ, the second codon position evolved under GTR + I + Γ, and the third codon position evolved under a separate GTR + I + Γ model. We used BEAST 2  to infer an ultrametric phylogeny with our partitioned mtDNA alignment. We implemented a relaxed log-normal clock with a substation rate of 2.1% per million years linked across partitions (Weir & Schluter, 2008). We ran three replicate BEAST 2 analyses for a total of 100 × 10 6 generations and discarded the first 10 × 10 6 generations as burn-in. We assessed convergence across BEAST runs by ensuring that effective sample sizes surpassed 300 and that independent runs were congruent in topology. We combined the three runs and inferred a maximum clade credibility tree from the combined posterior distributions for biogeographic analyses.
We designated eight biogeographic regions to reconstruct the biogeographic history of Sporophila (from northwest to southeast; Basin forest clearings and scrub, (6) southern savannas (which includes the Chaco and Caatinga regions), and (7) Atlantic Forest clearings and scrub. To designate these regions, we compared shape files of species' range maps (downloaded from http://birdlife.org) to calculate the amount of overlap between each species' distribution shapefile and vectors corresponding to terrestrial ecoregions (Olson et al., 2001). We also included written descriptions of species' preferred habitats and geographic distributions that we used to further refine the boundaries of our biogeographic regions and determine which regions each extant species occupies.
We subsequently modeled the evolution of geographic distributions within Sporophila using the BioGeoBEARS methodological framework (Matzke, 2014). In brief, BioGeoBEARS reconstructs biogeographic history by inferring ancestral geographic ranges using biogeographic areas delimited by users, a phylogeny, and an array of discrete Markov models with varying free parameters and assumptions. BioGeoBEARS users can alter settings to initialize and optimize Bayesian chains or maximum likelihood searches that explore different parameter space and subsequently compare the performance of models that differ in how species' ranges evolve, such as the Dispersal-Extinction-Cladogenesis model (DEC; Ree & Smith, 2008) and a modified DEC model that includes an extra parameter to account for founder-event speciation (DEC + J; Matzke, 2014).
Users can then select the best-performing model(s) via AICc scores and make inferences about the geographic ranges of ancestral nodes as well as the quantity and timing of different biogeographic events. Here, we used BioGeoBEARS to estimate ancestral ranges within Sporophila and determine how many times certain dispersal events-such as dispersal across the Isthmus of Tehuantepec-have occurred. We confirmed that our phylogeny had no extremely short branches, which are problematic for BioGeoBEARS, and set our initial parameter estimates to d = 0.0615, e = 0.0342, and j = 0.0001.
All other settings were left at their default values. With eight biogeographic areas, we had a total of 128 possible states for each ancestral node as a node or species can occupy multiple areas.
Following the maximum likelihood search, we displayed the state with the highest marginal probability for each node and the corresponding marginal probability.

| Morphological differentiation within Sporophila torqueola
We found appreciable clustering corresponding to different subspecies groups within S. torqueola by comparing morphological measurements of over 600 specimens (Figure 2; Figures S1 and S2). After performing a principal component analysis using the seven morphological characters we measured, we found that PC1 and PC2 accounted for approximately 30% and 20% of the total variation, respectively (see Table S2

| UCE sequencing and population genetics
We recovered 4,376 UCE loci, which were 320 bp long on average ( Figure S3). A total of 788 loci were invariant, while the remaining loci F I G U R E 3 Phylogenetic and population genetic analyses of S. torequola with outgroup removed. (a) RAxML phylogeny built with 1,000 loci with highest number of parsimonious sites. Built by searching for the best tree and performing rapid bootstrapping in the same run (-f a setting). Nodes with white circles have bootstrap support above 70. Tip colors correspond to taxa in the lower right corner. (b) RAxML phylogeny of cyt b mtDNA sequences. Built by searching for the best tree and performing rapid bootstrapping in the same run (-f a setting). Nodes with white circles have bootstrap support above 70. Tip colors correspond to taxa in the lower right corner. (c) PCA plot constructed by filtering data set to include individuals with <85% missing data (n = 50) and loci with <75% missing data (n = 4,067). Dot colors correspond to taxa in the lower right corner. (d) STRUCTURE plot with optimal K value (2) determined by the Evanno method. Individuals are sorted according to taxa and by decreasing latitude within taxa, with rectangular boundary colors corresponding to taxa in the lower right corner. (e) STRUCTURE output for hierarchical analyses (optimal K = 2 for both) performed by subsetting the data set for each of the clusters identified and removing mismatched individuals in panel (d) contained between 1 and 28 SNPs for a total of 17,130 SNPs ( Figure   S3). Variable loci contained between 0 and 25 parsimony informative sites (mean = 2.67 ± 3.21 SD per locus).
Our phylogeny of UCE loci constructed with RAxML recovered two deeply divergent lineages: one corresponding to vouchered specimens identified as S. t. torqueola and another that contained samples of (Figure 3a). However, one individual female was mismatched; this sample was originally identified as S. t. morelleti, but was part of the clade that contained mostly S. t. torqueola samples. We recovered a similar topology and level of divergence in our phylogeny based on cyt b mitochondrial sequences (Figure 3b).

S. t. morelleti and S. t. sharpei
Thus, phylogenetic reconstructions were highly congruent between nuclear and mitochondrial loci. The PCA plot similarly separated two groups along the first PCA axis, which accounted for 54.37% of the total variation (Figure 3c). Again, one individual was mismatched between their a priori subspecies designation and the population cluster that they were assigned to. We identified K = 2 as the optimal value for the number of STRUCTURE population clusters with outgroup sequences removed (Table 1) Figure 3d). When we ran STRUCTURE on each of these clusters separately, we identified K = 2 as the optimal number of clusters for both subsets, with individuals assigned to either cluster based largely on latitude, suggesting a pattern of isolation by distance within both lineages (Figure 3e; Tables S4 and S5). Analyses that included outgroup sequences were congruent with analyses that did not include outgroup samples ( Figure S4 and

| Biogeography of Sporophila torqueola
We inferred a mitochondrial phylogeny that was concordant with previous phylogenies of the group (Burns et al., 2014;Mason & Burns, 2013   Effective population sizes (N e ) were calculated assuming a mutation rate of 2.2 × 10 −9 . Migration bands are shown in effective number of migrants per generation. The 95% lower and higher highest probability density (HPD) values represent the predictive distribution of parameter estimates and are analogous to confidence intervals and are shown in parentheses next to each mean value.  García-Trejo & Navarro-Sigüenza, 2004;Peterson & Navarro-Sigüenza, 2000). Below, we discuss the implications of these results toward the broader conceptual goal of understanding the evolutionary processes that underlie avian biodiversity in the Neotropics.

| Diversification of the Sporophila torqueola Complex
Our analysis of phenotypic variation within the S. torqueola complex revealed largely diagnosable clusters of morphological variation and plumage characters that correspond to currently recognized subspecies (Figure 2). For the purpose of this discussion and clarity, we hereafter refer to these lineages as separate species: S. torqueola and S. morelleti. Given that multiple plumage and morphological characters differ between the western S. torqueola and the eastern S. morelleti, there are presumably multiple regions of the genome underlying phenotypic differentiation in this system (Mackay, Stone, & Ayroles, 2009;Stranger, Stahl, & Raj, 2011). Our genetic analyses corroborate this assertion: The phenotypically differentiated S. torqueola expresses strong genetic differentiation from S. morelleti among many unlinked loci (Figure 4). Thus, genetic differentiation is prevalent throughout the genome, which differs from a suite of recent studies that have found patterns of widespread genomic homogeneity with a small number of loci likely underlying phenotypic differentiation in other avian taxa (Mason & Taylor, 2015;Poelstra et al., 2014;. Indeed, the deep evolutionary split that we documented between S. torqueola and S. morelleti is in stark contrast to the shallow evolutionary history that characterizes the southern capuchino radiation in the same genus, in which rapid and recent phenotypic differentiation and speciation are driven by a few loci of large effect (Campagna et al., 2017).
We recovered reciprocal monophyly between S. torqueola and S. morelleti with the exception of a mismatched individual (Figure 3a,b).
This individual was a female (voucher UWBM 112783) identified as S. morelleti based on where it was collected in the Campeche state of the Yucatan peninsula (Table S2), yet our phylogenetic analyses reveal this individual was likely misidentified. Given the proximity of the Campeche region to the putative contact zone between S. torqueola and S. morelleti, this female may either represent a recent, natural dispersal event or could be an escaped caged bird (Alves & de Faria Lopes, 2016;Carrete & Tella, 2008;CONABIO 1996;do Nascimento & Czaban, 2015;Aguilar Rodriguez, 1992). We do not see any evidence of introgression in either lineage, however, as indicated by the lack of admixture for this individual and all other samples included in the STRUCTURE analysis ( Figure 3e). This finding highlights the challenges of identifying female Sporophila, which are very similar in overall appearance (de Schauensee, 1952), and warrants caution in interpreting species identifications of female Sporophila based on phenotypes and presumed geographic distributions alone.
The deep molecular divergence and polyphyly that we recovered between S. torqueola and S. morelleti have been obscured by broadscale similarity in plumage patterning and geographic distributions.
The genus Sporophila has a remarkably high speciation rate that is similar to Darwin's Finches (Burns et al., 2014;Campagna et al., 2017).
Our study joins others in describing heretofore unrecognized specieslevel diversity in Sporophila (but see Rising, 2017), such as the newly recognized species S. beltoni (Repenning & Fontana, 2013

| Biogeography of Neotropical seedeaters
The deep molecular divergence and phenotypic differentiation that we recovered within the S. torqueola complex changes our biogeographic perspective of diversification in Neotropical seedeaters (Figure 4).
Our BioGeoBEARS analysis suggested that many of the deeper nodes occupied clearings and scrub in the Amazon basin, although there is a lot of uncertainty in which exact states the deepest ancestral nodes in Sporophila occupied ( Figure S6). Extinction raises further uncertainty in inferring ancestral states deep in evolutionary time with confidence (Crisp, Trewick, & Cook, 2011). Thus, while the exact ancestral range of the most recent common ancestor of all Sporophila remains unknown, our analysis suggests a biogeographic origin in South America.
The striking ecological gradients that characterize elevational rise in the Sierra Madre Occidental combined with periodic isolation from the Atlantic coastal plains via rises in sea level and submersion of the Isthmus of Tehuantepec may have contributed to ecological specialization, biogeographic isolation, and endemism in the region. Elevating S. t. torqueola to species status, which we recommend based on our findings here, further contributes to understanding endemism in the Pacific coastal plains of Mexico, highlighting this area as a biodiversity hotspot worthy of conservation efforts in Mesoamerica (García-Trejo & Navarro-Sigüenza, 2004;Peterson & Navarro-Sigüenza, 2000).

| CONCLUSION
In summary, our study documents deep molecular divergence between phenotypically differentiated subspecies that have heretofore been considered a single species of Neotropical bird. The polyphyly that we recover within the S. torqueola species complex emphasizes the capacity of phenotypic convergence and geographic affinities to obscure true evolutionary relationships. Our improved phylogeny of the genus Sporophila transforms our understanding of the biogeographic history of this lineage by suggesting there have been multiple, repeated dispersal events out of South America into Mesoamerica.
Finally, our study suggests a taxonomic split to recognize S. morelleti and S. torqueola as separate species, which exhibit pronounced phenotypic and genetic differentiation and are not sister species, highlighting the Pacific coast and lowlands of Mexico as a region that harbors high, yet underrecognized, levels of endemism.