Algorithm Details: Digit Preference: Within-study Comparison
Depending on what label field is specified by a user, the algorithm is able to compare leading or trailing digits between user-defined groups in a clinical trial, for example, for sites or for randomization arms (for checking whether randomization is correct).
The algorithm:
- Calculate a contingency table for the frequency distribution of digits of interest (either 1 or 2 leading or trailing digits). That table is defined as follows. Every column of it contains 2 values:
1) a count of occurrences of every digit (or sequence of digits) for a label value (for example, if labels are site IDs, it is a count for a specific site),
2) a count of that digit (sequence) for all other values of label (if labels are site IDs, it is a count for all other sites, except for the site we are looking at).
- Calculate standardized midrank scores [1][2] for columns of the table.
- Perform Cochran-Mantel-Haenszel “Row Mean Scores Differ” test [2], which effectively compares rows of the contingency table between each other using Chi-squared test [4], which is calculated from standardized midrank scores.
- Considering all tests that are executed in a scope of a CSM Request, whether for trailing or leading digits are parts of a single experiment, it's needed to perform adjustment for multiple comparisons. It is done by using the Benjamini-Yekutieli method for control of the false discovery rate [3]. Adjusted p-values are calculated using that method.
- Scores are calculated from the p-values using the formula -log10(p-value), which provides values that have a more convenient and useful scale than the p-values themselves. Scores are highlighted in a report if a p-value is lower than a user-defined significance level, which means that the data is suspicious and the case should be investigated. If the number of observations is small (50 or less), a cell is highlighted with a lighter color.
- But if the number of highlighted cells is too big, then the highlights are additionally filtered by finding outliers of the distribution of test results and highlighting them only.
- In the report, the user can also examine the distribution of digits in data and find out for any digit the difference (in percent) between actual and expected counts (the expected counts are taken from the distribution of digits for all other label values).
Literature:
[1] Handbook of Biological Statistics, John H. McDonald http://www.biostathandbook.com/cmh.html
[2] Categorical Data Analysis Using SAS®Third Edition. Maura E. Stokes, Charles S. Davis, Gary G. Koch, chapter 4.3 Sets of 2 r Tables