Digit preference: check for uniformity

The algorithm detects data collection errors and fraud by checking that the trailing digits frequency distribution is uniform.

The following should be considered whether a set of data is eligible for that test.

  • Many clinical variables allow expecting that their trailing digits are distributed uniformly, which means the probability of occurrence of all digits is equal. But we need to be careful: for some measurements, it is not the case and it is sometimes not yet known why (for example, see [2]). 
  • On the other hand, if the leading digits follow Benford’s law, and data consists of numbers with more than 3 digits, then it can be expected that trailing digits have Uniform distribution because that is specified by Benford’s law for higher decimal places. You can look at the manual page for Digit preference: Benford's law to learn more about the applicability of Benford's Law.
  • Most common use case of the test is data given with a specific precision (number of decimal places after the decimal separator). 
  • Though if rounding of numbers is abnormal, that will also be found by the test.
  • “The last-two digit test is generally run on data tables where we are looking for signs of number invention...” [1], so this test shouldn’t be ran on data for which last digits are normally affected by some psychological thresholds or rounding. 

The algorithm:

  • Finds frequencies of values of one or two trailing digits in data. 
  • Compares these frequencies with the expectation of uniform distribution. The algorithm performs different statistical tests, in every case a statistical test is selected based on a data sample size, optimal for that test, see below. 
Number of rows Method of control
1...49 G-test
50 and more Kolmogorov-Smirnov (KS) test
  • p-values produced by statistical tests are adjusted for multiple comparisons by the Benjamini-Yekutieli method and compared with the significance level specified by a user. For a very small dataset size tests are less powerful. For that reason, p-value correction for G-test and KS test is performed separately, that increases the probability of a single error, but helps to preserve statistical power.
  • Scores -log10(p-value) are calculated, which are in a more convenient scale than p-value itself. Scores are highlighted in a report if a p-value is less than a significance level, which means that the data is suspicious and the case should be investigated. If the number of observations is small (50 or less), a cell is highlighted with a lighter color.

Literature:

[1] Mark J. Nigrini (2012). Benford's Law: Applications for Forensic Accounting, Auditing, and Fraud Detection.

[2] S. J. Hayes, Journal of Clinical Pathology, 61(9), 2008 Terminal digit preference occurs in pathology reporting irrespective of patient management implication.

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.