Genomic analysis defines clonal relationships of ductal carcinoma in situ and recurrent invasive breast cancer

This research complies with all relevant ethical regulations: The Sloane project, United Kingdom National Health Service Breast Screening Programme, was approved by the UK Health Research Authority (Ethical approval REF 08/S0703/147, 19/LO/0648); The Dutch DCIS cohort study, a Netherlands Cancer Registry (NCR; reference no 12.281), nationwide network and registry of histology and cytopathology in the Netherlands (PALGA; reference no. LZV990) approved by the Central Committee on Research Involving Human Subjects in the Netherlands and the Institutional Review Board (IRB) of the Netherlands Cancer Institute (CFMPB166, CFMPB393 and CFMPB688); The Duke Hospital cohort approved by Duke University Health System IRB, USA (Pro00054877, Pro00068646).

Statistics and reproducibility

No statistical method was used to predetermine sample size. The research question we addressed (‘What percentage of primary DCIS is clonally related to invasive recurrences?’) has never been addressed on a large scale before; therefore, we collected all available paired samples from three large cohorts.

Samples were excluded due to the following reasons:

  • Only one sample of a pair was available,

  • DNA quantity was not sufficient,

  • QC analyses for either WES, CN or panel seq failed.

The experiments were not randomized. The investigators were blinded to the outcome of the clonality score of the different technologies used. Data distribution was not assumed to be normal and appropriate nonparametric statistical tests were used.


Cases of pure primary DCIS that, after treatment, had subsequently developed recurrent disease were identified from:

  1. (1)

    the Sloane project, a national audit of women with noninvasive neoplasia within the UK National Health Service Breast Screening Programme (REF 08/S0703/147, 19/LO/0648), median follow up 5.3 years26.

  2. (2)

    the Dutch DCIS cohort study, a nationwide, population-based patient cohort derived from the Netherlands Cancer Registry (NCR), in which all women diagnosed with primary DCIS between 1989 and 2004 were included, with a median follow up time of 12 years27. This cohort was linked to the nationwide network and registry of histology and cytopathology in the Netherlands (PALGA). The study was approved by the review boards of the NCR (reference no. 12.281) and PALGA (reference no. LZV990) and the IRB of the Netherlands Cancer Institute under numbers CFMPB166, CFMPB393 and CFMPB688.

  3. (3)

    the Duke Hospital cohort, a hospital-based study of women (age 40–75 years) diagnosed with DCIS between 1998 and 2016, with a median follow up of 7.9 years (IRB approvals: Pro00054877 and Pro00068646).

Formalin-fixed paraffin-embedded (FFPE) tissue specimens of patient-matched DCIS and subsequent recurrence were retrieved and reviewed by specialist breast pathologists to confirm the diagnosis and exclude confounding features (such as microinvasion).

In total, 129 DCIS recurrence pairs were included in this study, 95 had developed an ipsilateral invasive recurrence and 34 had an ipsilateral DCIS recurrence. In addition, 34 synchronous DCIS-IBC lesions identified from the Duke DCIS cohort and 14 with a subsequent invasive recurrence in the contralateral breast from the Sloane cohort were also included for testing of the Breakclone algorithm. Details of the cohorts can be found in Supplementary Data Table 1. Associations between clinical variables and clonality were assessed by Fisher’s exact test.

Person-year analysis for invasive breast cancer risk

We performed person-year analyses to compare the risks of ipsilateral and contralateral invasive breast cancer in the full Dutch cohort from which the Dutch cases and controls were derived with those in the general Dutch female population, overall and for the different treatment groups, allowing one of both or both ipsilateral invasive breast cancer (iiBC) and contralateral invasive breast cancer (ciBC). The time at risk started at date of diagnoses and ended at the date of ipsilateral or contralateral invasive breast cancer, the end of follow up (31 December 2010) or date of death, whichever occurred first. The full Dutch cohort comprised 10,090 primary DCIS patients diagnosed in the period 1989 until 2005 and followed until 1 January 2011. To enable direct comparison of the cumulative incidence rates of iiBC and ciBC, separately and combined, in the full Dutch cohort with death as competing risk, we calculated the expected cumulative incidence in the Dutch general female population.

DNA isolation

For DNA isolation, either macrodissection using a light microscope or laser microdissection (LMD) was performed. Sections (8 µm) were stained using nuclear fast red (macrodissection) or toluidine blue (LMD) and DCIS or invasive disease were separated from the normal tissue. Tumor DNA was extracted using the AllPrep DNA/RNA FFPE Kit (Qiagen).

Exome sequencing

WES of the paired DCIS with subsequent recurrence together with matched normal tissue was performed at the Department of Genomic Medicine, MD Anderson Cancer Center. Genomic DNA (18–300 ng) was used to generate sequencing libraries using the SureSelectXT Low Input library kit. Libraries were sequenced on NovaSeq 6000, multiplexing 16 tumor samples per lane.

WES of the Duke Hospital Cohort of 34 synchronous paired DCIS and invasive disease and matched normal tissue was performed at the McDonnell Genome Institute at Washington University School of Medicine. Genomic DNA (30–150 ng) was sheared to a mean fragment length of 250 bp and Illumina sequencing libraries were generated as dual-indexed, with unique barcode identifiers, using the Swift Biosciences library kit. Libraries were sequenced on Illumina HiSeq2500 1T instrument by multiplexing nine tumor samples per lane.

Data were converted to a FASTQ format and then aligned to the hg19 reference genome using the Burroughs-Wheeler Aligner (BWA). The aligned BAM files were subjected to mark duplication, realignment and recalibration using Picard v.2.21.9 and GATK v. The BAM files were then analyzed by MuTect and Pindel against the matched normal sample to detect somatic single nucleotide variants and insertions/deletions, respectively.

Individuals with normal sample median target coverage (MTC) >40x and tumor sample MTC >80x were included for further investigation. Variants were filtered by the following criteria: (1) Log odds score ≥10; (2) exonic variants; (3) tumor sample coverage at this site ≥15; (4) normal sample coverage at this site ≥10; (5) allele fraction in tumor sample ≥0.02; (6) allele fraction in normal sample <0.01; (7) population frequency <0.01 in ExAC, ESP6500 and 1000G database; (8) hotspot mutations in PIK3Ca and TP53 were added back to the dataset, if they did not pass these criteria and (9) for nonclonal pairs, private mutations were checked manually in the Integrative Genomics Viewer in both the primary and recurrence to ensure that they were indeed private and not filtered out by QC criteria.

We identified potential sample mismatches using an inhouse script for computing SNP matching index. Indexed BAM files from both tumor-normal pairs were used as an input to the variant caller Platypus v.0.5.2 to identify germline variants. For any pair of Platypus vcfs (two samples), we removed the SNPs from random chromosomes as well as SNPs with coverage <10, and calculated the number (nAB) of overlapping SNPs (by position), and the number (nGAB) of the same alleles within the overlapping SNPs. The score (match-index %) = nGAB × 100/nAB. Using this index, we removed all mismatches with score <90%. The details of all mutations detected can be found in Supplementary Data File 1.

Copy number analysis

Somatic copy number aberrations (SCNAs) were ascertained using the HumanCytoSNP-12 BeadChip Kit (Illumina) in cases with 100–250 ng of DNA available from the Sloane Project. DNA was restored with the Infinium HD FFPE DNA Restore Kit (Illumina). Raw SNP array data was processed with GenomeStudio 2.0 software (Illumina) and subsequently with the ASCAT (allele-specific copy number analysis of tumors) software algorithm (implemented in R), to estimate allele-specific copy number, the aberrant cell fraction and tumor ploidy28. Copy number profiles with number of segments higher than 500 and a log R ratio (LRR) noise higher than 0.16 were removed from the analysis. PLINK v.1.07 was used to estimate the pairwise relatedness using the raw SNP genotyping data to exclude sample mismatches between paired primary DCIS and recurrences.

In cases with limited DNA, copy number was ascertained using low-pass whole-genome analysis. The UK cohort used NEBNext Ultra II DNA Library Prep Kit for Illumina as per manufacturer’s instructions and the Dutch cohort used the KAPA hyper prep kit (KAPA Biosystems), protocol KR0961-v.5.16. Agilent S5XT-2 (1–96) adapters with Illumina P5 and P7 sequences were used, containing 8 bp Agilent indices.

Libraries were pooled and sequenced single-end on a HiSeq2500 sequencer (Illumina). After demultiplexing, FASTQ files were aligned to the human reference genome GRCh38 (hg38) using BWA v.0.7.17 aligner and converted to BAM files with SAMtools v.1.9. Duplicate reads were marked with Picard v.2.18.3 and removed, together with reads with mapping qualities lower than 37 using SAMtools v.1.9. Samples were sequenced at an average of 0.2× and minimum 0.03× for the Dutch cohort and average of 0.3× and minimum of 0.04× for the UK cohort. Relative copy number profiles were obtained with QDNAseq v.1.22.0 after setting a 100 kb fixed bin size. A bin size of 100 kb was used as this gave the best balance between sensitivity and noise in our data experience. We filtered out copy number profiles with a number of segments >400 and observed/expected noise ratio >50. We used CGHcall v.2.48.0 for relative copy number calling. We filtered out profiles that did not show any copy number aberrations. Despite these QC criteria, there were still a small subset of samples with poor quality copy number profiles, which could lead to incorrect copy number calls and the potential to erroneously call a pair independent if one pair of a sample was of poorer quality than the other. All copy number plots were therefore assessed visually and independently by two experts (L.F.A.W. and E.H.L.) and those that were considered by both to be of poor quality such that copy number calling may not be accurate were excluded. Overall, 14 primaries and four recurrences out of 159 pairs failed the visual inspection.

For detecting differential copy number variation between groups, absolute copy number calls and Fisher’s exact test were used. In the samples processed by SNP genotyping, absolute copy number calls were determined relative to tumor ploidy. If the copy number of a segment was more than 0.6 above tumor ploidy, it was called as a gain. If the copy number of a segment was more than 0.6 below tumor ploidy, the SNP was called as a loss. Due to the noisy LRR profiles, we filtered out calls with no BAF signal. In low-pass whole-genome sequenced samples, absolute copy number calls were obtained after tumor cell fraction adjustment with ACE v.1.4.0. The copy number profiles for all pairs can be found in Supplementary Fig. 1.

Targeted sequencing

For the UK cohort, sequencing of all exons of a custom 121 breast cancer-associated gene panel (Supplementary Data Table 10) was performed using the SureSelectXT low input Target Enrichment System (Agilent Technologies); 100 bp read paired-end sequencing was performed on the HiSeq2500 platform. The sequencing output was aligned to the reference genome hg19 using the BWA-MEM (maximal exact match). Variants were called using MuTect2 from the Genome Analysis Toolkit (v., using the matched normal tissue to exclude germline variants. Variants with an allele frequency <5%, coverage <30× were excluded. Sequencing reads of tumor and normal pairs were visualized on the Integrative Genomics Viewer to exclude germline variants and also potential sequencing artefacts.

The Dutch cohort was sequenced using an IonTorrent AmpliSeq custom 53-gene panel (Supplementary Data Table 11) and were processed according to the Ion AmpliSeq Library Kit Plus protocol (ThermoFisher Scientific). Reads were aligned to the reference genome GRCh37 (hg19) using the Torrent Mapping Alignment Program, and variant calling was performed using Torrent Variant Caller (TVC) v.5.6. Variant data in VCF format was first translated to GRCh38 and annotated using bedtools, Picard (, SAMtools, bcftools and VEP, and further analyzed in R, employing vcfR and tidy verse. True somatic variants, identified via filtering during which low quality variants VAF <10%, coverage <100× and a quality (QUAL) of <1,000), artifacts (found in >90% of samples), and germline variants (more than five cases in GNOMAD and GoNL) were removed. Details regarding the amplicon panel design, performance and filtering QC are provided in Supplementary Note 1.

Single-cell sequencing

FFPE samples were deparaffinized using the FFPE Tissue Dissociation Kit from MACS (catalog no. 130-118-052). Nuclear suspensions were prepared from the recovered cell suspensions using a 4,6-diamidino-2-phenylindole (DAPI)-NST lysis buffer (800 ml of NST (146 mM NaCl, 10 mM Tris base at pH 7.8, 1 mM CaCl2, 21 mM MgCl2, 0.05% BSA, 0.2% Nonidet P-40)), 200 ml of 106 mM MgCl2, 10 mg of DAPI). The nuclear suspensions were filtered through a 35 mm mesh and single nuclei were flow sorted (BD FACSMelody) into individual wells of 384-well plates from the aneuploid peak (Supplementary Note 2). After sorting single nuclei, direct tagmentation chemistry was performed following the acoustic cell tagmentation (ACT) protocol10. Briefly, nuclei were lysed and tagmentation was performed using TN5 to add dual barcode adapters to the DNA, followed by 12 cycles of PCR. The resulting libraries were QCed for concentration >10 ng µl–1 and pooled for sequencing on the HiSeq4000 (Illumina) instrument at 76 cycles.

To calculate single-cell copy number profiles, we demultiplexed sequencing data from each cell into FASTQ files, allowing one mismatch of the 8 bp barcode. FASTQ files were aligned to hg19 (NCBI Build 37) using bowtie2 (v.2.1.0)29 and converted from SAM to BAM files with SAMtools (v.0.1.16)30. PCR duplicates were removed based on start and end positions. Copy number profiles were calculated at 220 kb resolution using the variable binning method31. The preprocessing steps to compute DNA copy number profiles have been described in detail previously9. Single cells with <10 median reads per bin were excluded for downstream copy number analysis. GC-normalized read counts were binned into bins of variable size, averaging 200 kb, followed by population segmentation with the multipcf32 (gamma = 10) method from the R Bioconductor multipcf package. The log2 copy number ratio was calculated and used for subsequent analysis. We filtered out noisy single cells with mean nine-nearest neighbor correlation less than 0.85. The mean nine-nearest neighbor correlation is calculated as the average of the Pearson correlation coefficients between any single cell and its nine-nearest neighbors. This step removed single cells with poor whole-genome amplification from the subsequent data analysis. Single-cell ratio data was embedded into two dimensions using UMAP33, R package ‘uwot’ (v.0.1.8, seed = 31, min dist = 0.2, n_neighbors = 30, distance = ‘manhattan’). The resulting embedding was used to create an SNN graph with R Bioconductor package scran (v.1.14.6)34. Subclones were identified with R package ‘dbscan’ (v.1.1-5, k_minor = 0.02 × no. of cells)35. Heatmaps were plotted with R package ComplexHeatmap (v.2.2.0)36.

Clonal relatedness calculation using Breakclone

Breakclone is an inhouse package to assess clonal relatedness. Unlike other packages12,14, it incorporates both population frequency and allele frequency when using mutation data for determining clonal relatedness. When using copy number data, it uses the position of the individual copy number aberration breakpoints rather than aberration events at the chromosome arm level to determine clonal relatedness correcting for the frequency of the event within the cohort. These are harder to compare across cohorts analyzed with different techniques but, we believe, provide much stronger evidence of clonal relatedness when shared between lesions37. A reference distribution of concordance scores is calculated by randomly permuting all possible pairs from different patients, the number of permutations empirically determined as necessary for the distribution to converge and is used to calculate P values for the concordance score of each tumor pair. The threshold for determining clonal relatedness is set as P < 0.01. Clonality scores between 0.05 and 0.01 were called ambiguous. All values above 0.05 were considered as nonclonal. We considered a sample pair as clonally related if at least one of the different methods (WES, copy number, panelseq) gave a clonal score, with P102 as an exception, as only the copy number data showed borderline relatedness.

Copy number data

Each breakpoint shared between two tumors is interpreted as evidence of their relatedness, while each breakpoint unique to one tumor is interpreted as evidence of independence. However, given the generally stochastic process of genomic instability, a common aberration provides stronger evidence than an independent once; therefore, the effect of the unique aberrations in the score calculations is weighted down by one-half.

The concordance score range starts at zero for samples that share no aberrations and approaches the theoretical limit of 1 as the samples become more similar—the score for any two identical samples will be slightly below 1 due to the population frequency corrections.

Each SCNA breakpoint was compared between the pairs of tumors from the same individual. Concordant breakpoints were defined as the same type of aberration, present in the same location ± (5 mathrm{x} ;averageinterprobelength) to account for technical variation. This figure was determined empirically as the number that captured the most likely concordant breakpoints without compromising their uniqueness—larger values led to the same breakpoints being included in calculations twice. Each concordant breakpoint was adjusted for its frequency in the entire cohort (fb), producing an adjusted breakpoint concordance score (sb) based on the equation:

Samples bearing whole-genome duplications (WGD) present a particular challenge in assessing clonal relatedness. Relatedness would be especially underestimated in the case of clonally related samples, only one of which has undergone WGD, as this single event would be interpreted by the algorithm as a large number of gains and amplifications across the entire genome, obscuring the true common events between the two samples. To that end, a correction is applied, by inferring the most likely copy number state that would have existed before the WGD event; for example, an allelically balanced tetraploid region, which would be normally interpreted as a gain, is likely to have originated from a pre-WGD region of normal copy number, and will be corrected accordingly. All relevant corrections applied are presented in the Supplementary Data Table 12. This correction is applied only in the case of SNP array data, which enables the detection of WGD.

The final sample concordance score (s) was then calculated between the pairs using all of the SCNAs in the samples and taking into account the total number of breakpoints in both samples (nb), using the following formula:

$$s_mathrm{s} = frac{{{sum} {s_mathrm{b}} }}{{{sum} {s_b} + frac{1}{2} times left( {n_mathrm{b} – 2 ast {sum} {s_mathrm{b}} } right)}}$$

A reference distribution of concordance scores was calculated using all possible tumor pairs from different patients and was used to calculate P values for the concordance score of each tumor pair.

Somatic mutation data

The allele frequency is weighted according to the population frequency: the lower the population frequency, the higher the weight of the allele frequency. In the calculations, the square root of the population frequency is used to normalize the range of possible values. The range of values for this score is between 0 for samples with no shared mutations and 1 for samples with identical mutation profiles.

Mutation data from each sample was compared and common variants were assigned a score, based on both their allele frequency in each sample (A1 and A2), and their frequency in the population (Pc). A higher allele frequency is interpreted as a stronger indicator of clonal relatedness, while a higher population frequency is interpreted as diminishing the predictive value of the variant. The TCGA Pan-Cancer Atlas breast cancer mutation calls were used for this adjustment, in addition to the mutations found in our cohort. The concordance score (ss) was subsequently calculated, taking into account the private variants in both tumor samples and their allele (Ap) and population (Pp) frequencies, using the following formula:

$$s_mathrm{s} = frac{{{sum} {frac{{A_1 + A_2}}{{sqrt {P_c} }}} }}{{{sum} {frac{{A_1 + A_2}}{{sqrt {P_c} }}} + 0.5 times {sum} {frac{{A_p}}{{sqrt {P_p} }}} }}$$

A reference distribution of concordance scores was calculated using all possible tumor pairs from different patients and was used to calculate P values for the concordance score of each tumor pair.

An implementation of the method is available as an R package at

Validation of Breakclone in:

(1) Synchronous DCIS and invasive disease

Results from WES on 34 synchronous DCIS-INV pairs were run through breakclone and, as expected, confirmed that most were clonally related (31/34, 91%). The three pairs that were considered unrelated did not share any mutations and also had the least number of mutations overall, which may suggest that methodological limitations were preventing us from detecting their common origin (Supplementary Data Table 13).

(2) Pure DCIS with contralateral recurrence

A total of 14 pure DCIS and paired contralateral invasive recurrences were analyzed by SNP array and the clonality score calculated by Breakclone. The results showed, as expected, that none of the contralateral invasive recurrences were related to their primary DCIS (Supplementary Data Table 14).

Comparison of the different clonality algorithms

In order to assess the added usefulness of our method, we applied the relevant copy number and mutation-based functions in the previously published Clonality package13,14 to our samples where copy number was ascertained by SNP array (Supplementary Note 3). We observed with the estimate of the number of clonally related pairs was lower for the clonality package compared with our proposed method, as well as suggesting a number of the contralateral recurrences were clonally related. Visual inspection of the contralateral samples that were called clonal showed that they were generally genomically stable samples, with few main aberrations. When those samples do share aberrations on the same chromosomal arms, they are considered clonally related by the Clonality package which, by design, relies on fewer events, but not by our method, which relies on the presence of multiple copy number events.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.