A total of 1 740 women with 1 820 breast tumors were included between April 2004 and March 2007. In addition to the FNAC, subjects’ imaging findings (mammography and ultrasound) were evaluated for breast cancer diagnosis, with a classification of the risk of breast cancer based on the American College of Radiology’s (ACR) guidelines. Cytopathologic, and histopathologic results were extracted from the hospital’s computerized medical records.
Table 1 summarizes the results of the FNAC test and of the two standards used to assess the existence of breast cancer (D^{+}) or not (D^{−}). Note that these figures refer to tumor samples (not to patients). Indeed, according to the study oncologist, we considered and analyzed the 1 820 breast tumors altogether, assuming independence between the observations of the subjects with more than one tumor. It is also noteworthy that some exceptions of the diagnostic strategy were observed (with 38 patients having positive FNAC tests but not verified by histology).
Cytologic diagnoses were classified into four categories: benign, suspect, malignant, and insufficient. Suspect results were defined as neither positive nor negative results, that is where the cytologist could not affirm nor refute the malignancy, though the latter being highly suspect of malignancy [5, 15]. Insufficient results were those achieved due to insufficient materials. Indeed, due to sampling technical issues, the FNAC test may have yielded insufficient cellularity. Since the obtained material was insufficient to be tested, no definitive diagnosis could be done, resulting in missing data. However, according to experts from the field, such data could not be combined with suspect results, but considered completely missing at random (MCAR). Therefore, the 53 samples with insufficient materials of the FNAC test were excluded from further analyses.
Data presented in Table 1 holds some statistical issues that should be taken into consideration in the analysis.
Handling suspect diagnostic test results
The first issue refers to the recorded responses of the FNAC test. Indeed, while a diagnostic test usually yields binary responses, the FNAC test is a 3valued outcome measure, that includes suspect results in addition to positive and negative test results. These latter outcomes (n=154) could be defined and treated as a nonpositive, nonnegative results [16].
For the gold standards, we first ignored its source, pooling results from the histology and the followup, and excluding missing data (lost to followup). Accordingly, data can be described using a 3×2 decision matrix (Table 2).
Several strategies were used.
Estimates of performance measures based on a 2×2 cell matrix
The simplest approach consisted in resuming the data in a 2×2 cell matrix, which allows applying the usual diagnostic measures estimators directly. This required to combine the suspect results with one of the positive or negative values of the FNAC. Four approaches were considered.
The “conventional” strategy consisted in excluding suspect results from the calculation [16]. In the “worst case”, the suspect results were combined with negative results in diseased patients and with the positive ones in nondiseased participants [16]. In the “best case”, conversely, the suspect test results were considered as positive in diseased participants, and as negative in nondiseased [5, 16]. At last, we applied Multivariate Imputation by Chained Equations (MICE) to impute those suspect results, assuming missing at random mechanisms (MAR) [17, 18]. Given the rate of such missing data, M = 10 complete datasets were imputed, where the imputation model included all the factors possibly impacting the presence of the disease (patient’s age, lesion location within the breast, tumor size, side of the breast tumor, and ACR), results of FNAC, histology, and followup. Then, from each of these tables, estimates (except for LR) of the diagnostic performance of the cytology test with their intraimputation variance were pooled by using Rubin’s Rule [17]. We then calculated the corresponding 95% confidence interval of each estimate [18].
Estimates of performance measures based on a 3×2 cell matrix
In contrast with the previous approaches, we secondly tried to respect the data structure of the 3×2 matrix.
Simel et al. [16] proposed conditional definitions of diagnostic performance measures, conditioned on the positive or negative test results, socalled positive or negative “test yield” (Y^{+},Y^{−}):
$$Y^{+} = P(T^{+} cup T^{}D^{+})= frac{a+c}{a+e+c} $$
and
$$Y^{} = P(T^{+} cup T^{}D^{})= frac{b+d}{b+f+d} $$
Conditional measures of sensitivity (Se^{c}) and specificity (Spe^{c}) were defined, resulting in similar estimators as those of the “conventional strategy” described above [16]:
$$Se^{c}=frac{P(T^{+}D^{+})}{P(T^{+} cup T^{}D^{+})}=frac{a}{a+c} $$
$$Spe^{c}=frac{P(T^{}D^{})}{P(T^{+} cup T^{}D^{})}=frac{d}{b+d} $$
Simel et al. [16] and Eusebi et al. [19] also defined the conditional LR of suspect results (LR±), the overall test yield, and the test accuracy of the test, as follows:
$$ LRpm = frac{P(Tpm D^{+})}{P(Tpm D^{})} = frac{1(Y^{+})}{1(Y^{})} $$
(1)
$$ textrm{Overall test yield} = frac{a+b+c+d}{a+b+c+d+e+f} $$
(2)
$$ Accuracy= frac{a+d}{a+b+c+d+e+f} $$
(3)
Exact 95% confidence interval (95% CI) of Se, Sp, test yields, and accuracy, were estimated. We used the Simel et al. 95% CI formula for the positive and negative LR (LR^{+} and LR^{−}) [16, 20].
Handling verification bias in gold standard
In the previous sections, we ignored the different sources of the gold standard, that is, assuming that disease status was similarly measured at the same time as the FNAC for all subjects. The estimates of the 2 x 2 matrix will be considered as naive estimates in the further analyses, since they did not take into account the presence of verification bias. However, the disease status was not always measured by histology, but only when the “triple test” provided positive findings. Otherwise, diagnosis was based on followup imaging of the breast. Moreover, there were missing data in the verification procedure (lost to followup, n=191). We thus applied methods to handle this verification bias.
Partial verification bias
First, we considered the partial verification bias, which is, treating histology as the only gold standard for diagnosis measure, so that patients not verified by histology (either with or without followup) had missing disease status.
Begg and Greenes method
Begg and Greenes proposed to infer about the probabilities of test results (T) given the disease status (D), in the presence of missing disease status, that is, when there is only a subset of patients whose disease status has been completed (V=1).
Let X be the vector capturing all the other information likely to influence the selection of V. In our setting, it represents the imaging and clinical information. Although the disease process affects both T and X, it only affects selection (V) through its influence on T and X. Thus, given that conditional independence between the verification status V and D, the probability of T given D and X is defined by:
$$P(TD,X)=frac{P(T,X) P(DT,X,V=1)}{sum_{T} P(T,X) P(DT,X,V=1)} $$
They proposed to estimate the non verified patients by applying inverse weighting, using the observed proportions of diseased and nondiseased among the verified patients by histology (V=1) to calculate the expected number of diseased and nondiseased patients among nonverified patients (followup or lost to followup), as reported in Table 3. Accuracy measures were then computed as if all disease status had been measured by histology [11].
We applied the method on the “conventional strategy” described above. We combined the verified with nonverified patients as if all of them had been verified by histology [11], applied the adjusted Se = (a+a^{′})/(a+a^{′}+c+c^{′}) adjusted Sp = (d+d^{′})/(d+d^{′}+b+b^{′}), and derived the LR^{+} and LR^{−}.
MICE
Given that the verification by histology depends on patients’ observed data, missing gold standard could be considered as missing at random (MAR). Thus, multiple imputation by chained equations (MICE) was applied [11, 21, 22], and compared to the Begg and Greenes method. It was applied to the conventional strategy of naive estimates. Missing data of unverified patients (with followup or not) was imputed with M=38 complete tables, given the percentage of missing data in this sample. The imputation model included all the factors likely to impact the presence of the disease (patient’s age, lesion location within the breast, size of the tumor, side of the breast tumor, and ACR), results of FNAC, and histology. Estimates of Se and Sp of each of the M analyses were then combined using Rubin’s rule to produce the estimate and confidence interval that incorporate between and within imputation variability [23]. We could then estimate the LR^{+} and LR^{−}.
Differential verification bias
Second, we corrected for the differential verification bias, considering “followup” as an alternative gold standard to histology.
Due to the imperfect nature of followup, the estimated Se and Sp may be incorrect [10]. A Bayesian correction approach [12] was applied to the conventional strategy. First, patients lost to followup were excluded. Second, they were imputed by applying MICE [21]. The information from the observed data was summarized into a likelihood function, defined as a product of four independent binomial density functions, each corresponding to the probability of a positive result on a gold standard (D^{+}) conditional on the index test (FNAC) (T)[12]:
$$ {}P(D^{+}T^{+})times (1P(D^{+}T^{+}))times P(D^{+}T^{})times (1P(D^{+}T^{})) $$
(4)
with
$$ {}{begin{aligned} P(D^{+}T^{+}) &= sD{frac{prevtimes sT}{(prev times sT) +(1prev)(1cT)}}\ &quad+(1cD)frac{(1prev)(1cT)}{(prevtimes sT) +(1prev)(1cT)} end{aligned}} $$
(5)
And
$$ {}{begin{aligned} P(D^{+}T^{}) &= sD{frac{prevtimes(1sT)}{(prevtimes(1sT))+(1prev)cT}}\&quad +(1cD)frac{(1prev) cT}{(prevtimes(1sT))+(1prev)cT} end{aligned}} $$
(6)
where:

–
sT,cT: sensitivity, specificity of FNAC,

–
sD,cD: sensitivity, specificity of histology or followup,

–
prev: prevalence of the disease.
These formulas were applied for each of the histology and followup gold standards. Bayesian inference was applied, where sT, cT, sD, cD, and prev, were considered as random variables with prior distributions. According to deGroot et al, we used independent Beta (α,β) prior distributions [12]. Given that the histology reference is the perfect gold standard for breast cancer diagnosis, its Se and Sp were set at 1 [24, 25]. We used informative prior distribution Beta (172.55, 30.45) for both Se and Sp of imaging followup, corresponding to a density centered at 0.85 with estimated standard deviation derived from 1/4 of the range, 0.800.90 [12]. We used non informative Beta (1,1) priors for sT,cT,prev, to limit the incorporation of any subjective prior opinion [12].
Using Jags software, the likelihood function was combined with the prior using Bayes theorem to derive posterior distribution. We ran a total of 20 000 iterations, of which we dropped the first 2 000 to allow for a burnin period. The convergence of the Markov Chain Monte Carlo was checked and summary statistics (posterior mean, 2.5% and 97.5% quantiles) of the parameters of interest were computed.
We checked the effect of the priors chosen by a sensitivity analysis (see Additional file 1).
Handling both suspect test results and verification bias
At last, we aimed to handle both statistical issues (suspect results and verification bias) in evaluating the performance of the FNAC test.
We proposed to apply the Begg and Greenes method to the 3×2 matrix that estimated test characteristics conditionally to the suspect results. Disease status was only based on histology, and all the other patients (followedup or lost to followup) were considered as nonverified. We extended formulas used to estimate the results of nonverified patients, in order to estimate their suspect results (e^{′} and f^{′}), as reported in Table 4.
We estimated the adjusted Se, Sp from the combination of verified and nonverified patients results, and derived the Y^{+},Y^{−} and the LR±, by applying the conditional measures defined in the section namerefsuspec.
Computation
For data description, continuous variables were presented as mean (standard deviation), and categorical variables as frequency (percentage). The diagnostic performance measures of the FNAC were presented by the point estimate with 95% confidence interval, or by the posterior mean with 95% credible intervals when the Bayesian approach was applied.
Analyses were performed using the statistical software R, version 4.0.4 (https://cran.rproject.org/).
https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s1287402201506y