Digit preference: Benford's law

The algorithm is able to detect fraud by checking that the first significant digit (FSD) frequency distribution complies with Benford's Law, which is true for many clinical data variables. Good explanation of the law can be found in [1], there is also a good Wikipedia article [2].

The first thing needed to apply this algorithm is learning whether a variable of interest follows the law in general. 

The algorithm will leave a warning in the warnings list if some of the properties of data complying with Benford's Law are not observed. What should be considered when deciding on whether to use the algorithm, is the origin of the data: 

  • Data should have values covering several orders of magnitude (this is also checked by the algorithm). 
  • Data which is distributed normally typically does not follow Benford’s Law.
  • The law is probably followed for numbers that result from a mathematical combination of numbers, first of all multiplicative.
  • The law is not followed for numbers assigned sequentially; numeric distributions constrained by some threshold, so that threshold heavily affects histogram of data; numbers distributed very unevenly across their range. 

Examples of datasets that can be checked for fraud using this algorithm:

  • Incidence rates, for example, the number of events divided by number of patients.
  • Lab data in some cases, for example, insulin levels.
  • It is more probable that the combined data of all of the laboratory measurements in a clinical trial is eligible to the Law, such tests are useful for fraud detection.
  • Values derived from lab measurements, for example, Dose area product (the product of the radiation dose and the area exposed).
  • Counts data, for example, RNA count data in genetics.

Tests performed by the algorithm:

  • Leading digit test is high-level, it will only detect obvious deviations, determining the direction of research. Fraudsters, when inventing numbers, tend to overuse certain digit patterns, and that is shown by the test. After catching such a pattern it can be looked at in more detail in a 2 leading digits test.
  • 2 leading digits test allows spotting suspicious leading digits and numbers to select small samples for auditing, beginning with these digits.
  • The second-order test uses the differences to test for anomalies and errors for the broader family of distributions than the test mentioned above. “If the data is made up of non-discrete random variables drawn from any continuous distribution with a smooth density function (e.g., the uniform, triangular, normal, or gamma distributions), the digit patterns of the N– 1 difference between the ordered elements will be almost Benford. “Almost Benford” means that the digit patterns will conform closely, but not exactly, to Benford’s Law.” [3] Most of the “data from the natural sciences will be either from a continuous or a discrete distribution satisfying the assumptions of the second-order test, and the expectation is that we will see almost Benford results most of the time. There is only a small difference between the Benford and Almost Benford probabilities, and these differences depend only slightly on the distribution of the original data.” [3]

The algorithm:

  • Finds frequencies of values of one or two leading digits in data. 
  • Finds second-order frequencies of two leading digits. For that, it first removes duplicates from the set of numbers, sorts them in ascending order and finds differences of numbers neighboring in a sequence. Then it runs a test on a sequence of differences.
  • Compares these frequencies with the expectation of following Benford’s law. The algorithm performs different statistical tests to compare FSD distribution with Benford's law. In every case a statistical test is selected based on the number of analyzed rows, whether in the whole dataset if Label column is not selected, or in a group otherwise (that is especially important if some of the groups are small), every test is the most sensitive and unbiased in its range (see the table below).
Number of rows Method of control Result Threshold P-value correction for multiple comparisons
1...49 G-test p-value Significance level Benjamini-Yekutieli method
50...499 Cho-Gaines d* test (Euclidean Distance) p-value Significance level Benjamini-Yekutieli method
500 and more Excess Maximum Absolute Deviation (MAD) test Excess MAD Thresholds from  Nigrini [3] Nothing, it is not required. 
  • The algorithm achieves the best results on big datasets (500 rows or more). For them the Excess MAD test is used, which answers the question of how similar the actual distribution is to the expected distribution, so the result of it is an effect size statistic. This result is compared with marginal conformity and nonconformity thresholds for MAD specified in [3], but is also corrected for the size of the sample using the method from [4]. Effect size is useful because besides Benford sets of numbers, which follow Benford’s Law with high confidence, there are almost Benford sets and certain effect size can be observed for certain data. Even if the threshold is not crossed, issues can be found by comparing an effect size with an expectation found from a similar but proven for correctness set of data.
  • For a small dataset size, it is difficult to detect effect size precisely. Other 2 tests produce p-value and answer a different question, which is whether a distribution follows Benford’s Law with a given significance level.
  • p-values produced by statistical tests are adjusted for multiple comparisons by the Benjamini-Yekutieli method and are compared with the significance level specified by a user. For a very small dataset size tests are even less powerful, for that reason p-value correction for G-test and d* test is performed separately, that increases the probability of a single error, but helps to preserve statistical power.
  • Scores are calculated from the p-values using the formula -log10(p-value), which provides values that have a more convenient and useful scale than the p-values themselves. Scores are highlighted in a report if a p-value is lower than a user-defined significance level, which means that the data is suspicious and the case should be investigated. If the number of observations is small (50 or less), a cell is highlighted with a lighter color.
  • In the report, the user can also examine the distribution of digits in data and find out for any digit the difference (in percent) between actual and expected counts (the expected counts are taken from the distribution of digits for all other label values).
  • An effective number of numbers processed by the algorithm is affected by NA values in data and specifically for the second-order test also by duplicated numbers. So regarding the second-order test, if there is a big amount of duplicates in data, then that test is not very useful. 

Literature:

[1] R. M. Fewster. The American Statistician Vol. 63, No. 1 (Feb., 2009). A Simple Explanation of Benford's Law.

[2] https://en.wikipedia.org/wiki/Benford%27s_law

[3] Mark J. Nigrini (2012). Benford's Law: Applications for Forensic Accounting, Auditing, and Fraud Detection.

[4] Bradley J. Barney and Kurt S. Schulzke (2016) "Moderating "Cry Wolf" Events with Excess MAD in Benford’s Law Research and Practice"

[5] Cho, W.K.T. and Gaines, B.J. (2007) Breaking the (Benford) Law: Statistical Fraud Detection in Campaign Finance. The American Statistician. 61, 218–223. 

[6] Morrow, J. (2010) Benford’s Law, Families of Distributions and a Test Basis.