Statistical methods for evaluating the fine needle aspiration cytology procedure in breast cancer diagnosis | BMC Medical Research Methodology

Via Peters

A total of 1 740 women with 1 820 breast tumors were included between April 2004 and March 2007. In addition to the FNAC, subjects’ imaging findings (mammography and ultrasound) were evaluated for breast cancer diagnosis, with a classification of the risk of breast cancer based on the American College of Radiology’s (ACR) guidelines. Cytopathologic, and histopathologic results were extracted from the hospital’s computerized medical records.

Table 1 summarizes the results of the FNAC test and of the two standards used to assess the existence of breast cancer (D+) or not (D). Note that these figures refer to tumor samples (not to patients). Indeed, according to the study oncologist, we considered and analyzed the 1 820 breast tumors altogether, assuming independence between the observations of the subjects with more than one tumor. It is also noteworthy that some exceptions of the diagnostic strategy were observed (with 38 patients having positive FNAC tests but not verified by histology).

Table 1 Data presenting the results of FNAC test compared to histology and follow-up gold standards

Cytologic diagnoses were classified into four categories: benign, suspect, malignant, and insufficient. Suspect results were defined as neither positive nor negative results, that is where the cytologist could not affirm nor refute the malignancy, though the latter being highly suspect of malignancy [5, 15]. Insufficient results were those achieved due to insufficient materials. Indeed, due to sampling technical issues, the FNAC test may have yielded insufficient cellularity. Since the obtained material was insufficient to be tested, no definitive diagnosis could be done, resulting in missing data. However, according to experts from the field, such data could not be combined with suspect results, but considered completely missing at random (MCAR). Therefore, the 53 samples with insufficient materials of the FNAC test were excluded from further analyses.

Data presented in Table 1 holds some statistical issues that should be taken into consideration in the analysis.

Handling suspect diagnostic test results

The first issue refers to the recorded responses of the FNAC test. Indeed, while a diagnostic test usually yields binary responses, the FNAC test is a 3-valued outcome measure, that includes suspect results in addition to positive and negative test results. These latter outcomes (n=154) could be defined and treated as a non-positive, non-negative results [16].

For the gold standards, we first ignored its source, pooling results from the histology and the follow-up, and excluding missing data (lost to follow-up). Accordingly, data can be described using a 3×2 decision matrix (Table 2).

Table 2 Decision matrix for handling suspect results

Several strategies were used.

Estimates of performance measures based on a 2×2 cell matrix

The simplest approach consisted in resuming the data in a 2×2 cell matrix, which allows applying the usual diagnostic measures estimators directly. This required to combine the suspect results with one of the positive or negative values of the FNAC. Four approaches were considered.

The “conventional” strategy consisted in excluding suspect results from the calculation [16]. In the “worst case”, the suspect results were combined with negative results in diseased patients and with the positive ones in non-diseased participants [16]. In the “best case”, conversely, the suspect test results were considered as positive in diseased participants, and as negative in non-diseased [5, 16]. At last, we applied Multivariate Imputation by Chained Equations (MICE) to impute those suspect results, assuming missing at random mechanisms (MAR) [17, 18]. Given the rate of such missing data, M = 10 complete datasets were imputed, where the imputation model included all the factors possibly impacting the presence of the disease (patient’s age, lesion location within the breast, tumor size, side of the breast tumor, and ACR), results of FNAC, histology, and follow-up. Then, from each of these tables, estimates (except for LR) of the diagnostic performance of the cytology test with their intra-imputation variance were pooled by using Rubin’s Rule [17]. We then calculated the corresponding 95% confidence interval of each estimate [18].

Estimates of performance measures based on a 3×2 cell matrix

In contrast with the previous approaches, we secondly tried to respect the data structure of the 3×2 matrix.

Simel et al. [16] proposed conditional definitions of diagnostic performance measures, conditioned on the positive or negative test results, so-called positive or negative “test yield” (Y+,Y):

$$Y^{+} = P(T^{+} cup T^{-}|D^{+})= frac{a+c}{a+e+c} $$


$$Y^{-} = P(T^{+} cup T^{-}|D^{-})= frac{b+d}{b+f+d} $$

Conditional measures of sensitivity (Sec) and specificity (Spec) were defined, resulting in similar estimators as those of the “conventional strategy” described above [16]:

$$Se^{c}=frac{P(T^{+}|D^{+})}{P(T^{+} cup T^{-}|D^{+})}=frac{a}{a+c} $$

$$Spe^{c}=frac{P(T^{-}|D^{-})}{P(T^{+} cup T^{-}|D^{-})}=frac{d}{b+d} $$

Simel et al. [16] and Eusebi et al. [19] also defined the conditional LR of suspect results (LR±), the overall test yield, and the test accuracy of the test, as follows:

$$ LRpm = frac{P(Tpm |D^{+})}{P(Tpm |D^{-})} = frac{1-(Y^{+})}{1-(Y^{-})} $$


$$ textrm{Overall test yield} = frac{a+b+c+d}{a+b+c+d+e+f} $$


$$ Accuracy= frac{a+d}{a+b+c+d+e+f} $$


Exact 95% confidence interval (95% CI) of Se, Sp, test yields, and accuracy, were estimated. We used the Simel et al. 95% CI formula for the positive and negative LR (LR+ and LR) [16, 20].

Handling verification bias in gold standard

In the previous sections, we ignored the different sources of the gold standard, that is, assuming that disease status was similarly measured at the same time as the FNAC for all subjects. The estimates of the 2 x 2 matrix will be considered as naive estimates in the further analyses, since they did not take into account the presence of verification bias. However, the disease status was not always measured by histology, but only when the “triple test” provided positive findings. Otherwise, diagnosis was based on follow-up imaging of the breast. Moreover, there were missing data in the verification procedure (lost to follow-up, n=191). We thus applied methods to handle this verification bias.

Partial verification bias

First, we considered the partial verification bias, which is, treating histology as the only gold standard for diagnosis measure, so that patients not verified by histology (either with or without follow-up) had missing disease status.

Begg and Greenes method

Begg and Greenes proposed to infer about the probabilities of test results (T) given the disease status (D), in the presence of missing disease status, that is, when there is only a subset of patients whose disease status has been completed (V=1).

Let X be the vector capturing all the other information likely to influence the selection of V. In our setting, it represents the imaging and clinical information. Although the disease process affects both T and X, it only affects selection (V) through its influence on T and X. Thus, given that conditional independence between the verification status V and D, the probability of T given D and X is defined by:

$$P(T|D,X)=frac{P(T,X) P(D|T,X,V=1)}{sum_{T} P(T,X) P(D|T,X,V=1)} $$

They proposed to estimate the non verified patients by applying inverse weighting, using the observed proportions of diseased and non-diseased among the verified patients by histology (V=1) to calculate the expected number of diseased and non-diseased patients among non-verified patients (follow-up or lost to follow-up), as reported in Table 3. Accuracy measures were then computed as if all disease status had been measured by histology [11].

Table 3 Begg and Greenes correction method

We applied the method on the “conventional strategy” described above. We combined the verified with non-verified patients as if all of them had been verified by histology [11], applied the adjusted Se = (a+a)/(a+a+c+c) adjusted Sp = (d+d)/(d+d+b+b), and derived the LR+ and LR.


Given that the verification by histology depends on patients’ observed data, missing gold standard could be considered as missing at random (MAR). Thus, multiple imputation by chained equations (MICE) was applied [11, 21, 22], and compared to the Begg and Greenes method. It was applied to the conventional strategy of naive estimates. Missing data of unverified patients (with follow-up or not) was imputed with M=38 complete tables, given the percentage of missing data in this sample. The imputation model included all the factors likely to impact the presence of the disease (patient’s age, lesion location within the breast, size of the tumor, side of the breast tumor, and ACR), results of FNAC, and histology. Estimates of Se and Sp of each of the M analyses were then combined using Rubin’s rule to produce the estimate and confidence interval that incorporate between and within imputation variability [23]. We could then estimate the LR+ and LR.

Differential verification bias

Second, we corrected for the differential verification bias, considering “follow-up” as an alternative gold standard to histology.

Due to the imperfect nature of follow-up, the estimated Se and Sp may be incorrect [10]. A Bayesian correction approach [12] was applied to the conventional strategy. First, patients lost to follow-up were excluded. Second, they were imputed by applying MICE [21]. The information from the observed data was summarized into a likelihood function, defined as a product of four independent binomial density functions, each corresponding to the probability of a positive result on a gold standard (D+) conditional on the index test (FNAC) (T)[12]:

$$ {}P(D^{+}|T^{+})times (1-P(D^{+}|T^{+}))times P(D^{+}|T^{-})times (1-P(D^{+}|T^{-})) $$



$$ {}{begin{aligned} P(D^{+}|T^{+}) &= sD{frac{prevtimes sT}{(prev times sT) +(1-prev)(1-cT)}}\ &quad+(1-cD)frac{(1-prev)(1-cT)}{(prevtimes sT) +(1-prev)(1-cT)} end{aligned}} $$



$$ {}{begin{aligned} P(D^{+}|T^{-}) &= sD{frac{prevtimes(1-sT)}{(prevtimes(1-sT))+(1-prev)cT}}\&quad +(1-cD)frac{(1-prev) cT}{(prevtimes(1-sT))+(1-prev)cT} end{aligned}} $$



  1. sT,cT: sensitivity, specificity of FNAC,

  2. sD,cD: sensitivity, specificity of histology or follow-up,

  3. prev: prevalence of the disease.

These formulas were applied for each of the histology and follow-up gold standards. Bayesian inference was applied, where sT, cT, sD, cD, and prev, were considered as random variables with prior distributions. According to deGroot et al, we used independent Beta (α,β) prior distributions [12]. Given that the histology reference is the perfect gold standard for breast cancer diagnosis, its Se and Sp were set at 1 [24, 25]. We used informative prior distribution Beta (172.55, 30.45) for both Se and Sp of imaging follow-up, corresponding to a density centered at 0.85 with estimated standard deviation derived from 1/4 of the range, 0.80-0.90 [12]. We used non informative Beta (1,1) priors for sT,cT,prev, to limit the incorporation of any subjective prior opinion [12].

Using Jags software, the likelihood function was combined with the prior using Bayes theorem to derive posterior distribution. We ran a total of 20 000 iterations, of which we dropped the first 2 000 to allow for a burn-in period. The convergence of the Markov Chain Monte Carlo was checked and summary statistics (posterior mean, 2.5% and 97.5% quantiles) of the parameters of interest were computed.

We checked the effect of the priors chosen by a sensitivity analysis (see Additional file 1).

Handling both suspect test results and verification bias

At last, we aimed to handle both statistical issues (suspect results and verification bias) in evaluating the performance of the FNAC test.

We proposed to apply the Begg and Greenes method to the 3×2 matrix that estimated test characteristics conditionally to the suspect results. Disease status was only based on histology, and all the other patients (followed-up or lost to follow-up) were considered as non-verified. We extended formulas used to estimate the results of non-verified patients, in order to estimate their suspect results (e and f), as reported in Table 4.

Table 4 Begg and Greenes correction method for the 3×2matrix

We estimated the adjusted Se, Sp from the combination of verified and non-verified patients results, and derived the Y+,Y and the LR±, by applying the conditional measures defined in the section namerefsuspec.


For data description, continuous variables were presented as mean (standard deviation), and categorical variables as frequency (percentage). The diagnostic performance measures of the FNAC were presented by the point estimate with 95% confidence interval, or by the posterior mean with 95% credible intervals when the Bayesian approach was applied.

Analyses were performed using the statistical software R, version 4.0.4 (

Next Post

Potential new approach for treating diabetes complications -- ScienceDaily

A potential new treatment approach for complications relating to diabetes has been described today in the open-access eLife journal. Findings shed new light on how diabetes causes tissue damage when oxygen levels drop, and point to the repression of a protein complex as a possible treatment target. Diabetes is a […]