CSM Algorithms

Algorithms Results
IQR box plot For IQR Algorithm the Portal shows the Boxplot charts and details table for each selected column. On the plot you can find besides the boxplot itself, a collection of inliers and outliers and a predicted mean as a line. The box plot shows outliers and inliers with labels. A label column can be selected as an option for the IQR algorithm. It must contain the following: The details in the table are the four quartiles and the median value.
  •        Q1
  •        Q2
  •        Median
  •        Q3
  •        Q4
  •        Amount

Amount is calculated as the total number of inliers and outliers added to the number of filtered out inliers and outliers.

Correlation matrix For the Correlation Matrix algorithm, you can find a single image of the heatmap of the selected columns data correlation. For demonstration purposes, there is a "crosshair" shown when hovering over this image with a mouse. There are ellipses in cells of a matrix (unless number of selected variables is 50 or more, in that case there are just colored cells). The greater is a value of the correlation, the more oblong is the ellipsis. Positive correlations are displayed in blue and negative correlations in red color. Color intensity is also proportional to the correlation value.
Histogram of all numeric values For Histogram Algorithm, it a histogram charts and details table for each selected column are shown. The details in the table are, in order: Minimum (abbreviated to Min), Maximum (abbreviated to Max), Range, Mean, Median, 3rd Quartile (abbreviated to 3rd Qu.), 1st Quartile (abbreviated to 1st Qu.), Total Number, Variance, Standard Deviation (abbreviated to Std. Dev.) and the Number of Missing Values (abbreviated to Missing). Similar to the IQR algorithm, each chart in the histogram should be labeled with the name of the column it correlates to. The charts on the CSM Results page must show an additional graph: a black spline displaying the kernel density estimate overlaid on top of the histogram columns. An additional Y-axis for its values will be shown on the right side of the chart.

 A tool-tip for this chart should display both the count (based on the bin being hovered over) and the density estimate (based either on a specific point of the spline being hovered over, or, if the user is hovering over a histogram column, based on the spline point that is closest to the middle-point of the column/bin). It must also show the current bin as a range of values and, if the user is hovering over a specific point on the density graph, it will show the specific value (i.e., the value on the X-axis).

Scatterplot You see a raster plot which is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables - one plotted along the x-axis and the other plotted along the y-axis. Also, you will see
  •      a linear regression line reflecting dependency between variables,
  •      kernel density estimate curve reflecting density of points for each axis,
  •      colored ticks near coordinate axes reflecting density of points for every label,
  •      list of outlier points under the plot.
(For many variables selected), you see a table with all the possible list of variables, from which you can click and select to see that particular scatter plot for the chosen combination. Below the table you will find the scatter plot and there should a folded table with outliers for that pair of variables under the scatter plot (if any).
Digit Preference-within study comparison Depending on what label field is specified by a user, the algorithm is able to compare leading or trailing digits between user-defined groups in a clinical trial, for example, for sites or for randomization arms (for checking whether randomization is correct).The algorithm: Calculates a contingency table for the frequency distribution of digits of interest (either 1 or 2 leading or trailing digits). That table is defined as follows. Every column of it contains 2 values: 1) a count of occurrences of every digit (or sequence of digits) for a label value (for example, if labels are site IDs, it is a count for a specific site), 2) a count of that digit (sequence) for all other values of label (if labels are site IDs, it is a count for all other sites, except for the site we are looking at).

Calculates standardized midrank scores for columns of the table.

Performs Cochran-Mantel-Haenszel “Row Mean Scores Differ” test, which effectively compares rows of the contingency table between each other using Chi-squared test, which is calculated from standardized midrank scores.

Considering all tests that are executed in a scope of a CSM Request, whether for trailing or leading digits are parts of a single experiment, it's needed to perform adjustment for multiple comparisons. It is done by using the Benjamini-Yekutieli method for control of the false discovery rate. Adjusted p-values are calculated using that method.

Scores are calculated from the p-values using the formula -log10(p-value), which provides values that have a more convenient and useful scale than the p-values themselves. Scores are highlighted in a report if a p-value is lower than a user-defined significance level, which means that the data is suspicious and the case should be investigated. If the number of observations is small (50 or less), a cell is highlighted with a lighter color.

But if the number of highlighted cells is too big, then the highlights are additionally filtered by finding outliers of the distribution of test results and highlighting them only.

In the report, the user can also examine the distribution of digits in data and find out for any digit the difference (in percent) between actual and expected counts (the expected counts are taken from the distribution of digits for all other label values).

Digit Preference- Benford’s law The algorithm is able to detect fraud by checking that the first significant digit (FSD) frequency distribution complies with Benford's Law, which is true for many clinical data variables. The first thing needed to apply this algorithm is learning whether a variable of interest follows the law in general. The algorithm will leave a warning in the warnings list if some of the properties of data complying with Benford's Law are not observed. What should be considered when deciding on whether to use the algorithm, is the origin of the data: Data should have values covering several orders of magnitude (this is also checked by the algorithm).

Data which is distributed normally typically does not follow Benford’s Law.

The law is probably followed for numbers that result from a mathematical combination of numbers, first of all multiplicative.

The law is not followed for numbers assigned sequentially; numeric distributions constrained by some threshold, so that threshold heavily affects histogram of data; numbers distributed very unevenly across their range.

The algorithm:

Finds frequencies of values of one or two leading digits in data.

Finds second-order frequencies of two leading digits. For that, it first removes duplicates from the set of numbers, sorts them in ascending order and finds differences of numbers neighboring in a sequence. Then it runs a test on a sequence of differences.

Compares these frequencies with the expectation of following Benford’s law. The algorithm performs different statistical tests to compare FSD distribution with Benford's law. In every case a statistical test is selected based on the number of analyzed rows, whether in the whole dataset if Label column is not selected, or in a group otherwise (that is especially important if some of the groups are small), every test is the most sensitive and unbiased in its range.

The algorithm achieves the best results on big datasets (500 rows or more). For them the Excess MAD test is used, which answers the question of how similar the actual distribution is to the expected distribution, so the result of it is an effect size statistic. This result is compared with marginal conformity and nonconformity thresholds for MAD specified in [3], but is also corrected for the size of the sample using the method from [4]. Effect size is useful because besides Benford sets of numbers, which follow Benford’s Law with high confidence, there are almost Benford sets and certain effect size can be observed for certain data. Even if the threshold is not crossed, issues can be found by comparing an effect size with an expectation found from a similar but proven for correctness set of data.

For a small dataset size, it is difficult to detect effect size precisely. Other 2 tests produce p-value and answer a different question, which is whether a distribution follows Benford’s Law with a given significance level.

p-values produced by statistical tests are adjusted for multiple comparisons by the Benjamini-Yekutieli method and are compared with the significance level specified by a user. For a very small dataset size tests are even less powerful, for that reason p-value correction for G-test and d* test is performed separately, that increases the probability of a single error, but helps to preserve statistical power.

Scores are calculated from the p-values using the formula -log10(p-value), which provides values that have a more convenient and useful scale than the p-values themselves. Scores are highlighted in a report if a p-value is lower than a user-defined significance level, which means that the data is suspicious and the case should be investigated. If the number of observations is small (50 or less), a cell is highlighted with a lighter color.

In the report, the user can also examine the distribution of digits in data and find out for any digit the difference (in percent) between actual and expected counts (the expected counts are taken from the distribution of digits for all other label values).

An effective number of numbers processed by the algorithm is affected by NA values in data and specifically for the second-order test also by duplicated numbers. So regarding the second-order test, if there is a big amount of duplicates in data, then that test is not very useful.

Digit Preference- check for uniformity The algorithm detects data collection errors and fraud by checking that the trailing digits frequency distribution is uniform. The following should be considered whether a set of data is eligible for that test. Many clinical variables allow expecting that their trailing digits are distributed uniformly, which means the probability of occurrence of all digits is equal. But we need to be careful: for some measurements, it is not the case and it is sometimes not yet known why. On the other hand, if the leading digits follow Benford’s law, and data consists of numbers with more than 3 digits, then it can be expected that trailing digits have Uniform distribution because that is specified by Benford’s law for higher decimal places. You can look at the manual page for Digit preference: Benford's law to learn more about the applicability of Benford's Law.

Most common use case of the test is data given with a specific precision (number of decimal places after the decimal separator).

Though if rounding of numbers is abnormal, that will also be found by the test.

“The last-two digit test is generally run on data tables where we are looking for signs of number invention...” [1], so this test shouldn’t be ran on data for which last digits are normally affected by some psychological thresholds or rounding.

The algorithm:

Finds frequencies of values of one or two trailing digits in data.

Compares these frequencies with the expectation of uniform distribution.

p-values produced by statistical tests are adjusted for multiple comparisons by the Benjamini-Yekutieli method and compared with the significance level specified by a user. For a very small dataset size tests are less powerful. For that reason, p-value correction for G-test and KS test is performed separately, that increases the probability of a single error, but helps to preserve statistical power.

Scores -log10(p-value) are calculated, which are in a more convenient scale than p-value itself. Scores are highlighted in a report if a p-value is less than a significance level, which means that the data is suspicious and the case should be investigated. If the number of observations is small (50 or less), a cell is highlighted with a lighter color.

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.