I often write about the use of big data in a best practices compliance program. However, care must be taken in the use and interpretation of big data. One of the clearest reads into this topic is the recent book by Cathy O’Neil, entitled “Weapons of Math Destruction, where she looks at issues surrounding big data and how it increases inequality and threatens democracy. In a chapter entitled “Civilian Causalities – Justice in the Age of Big Data” she considers the crime predictive software made by PredPol, which basically positions police officers “where crimes are most likely to appear” according to a wide variety of data, including historical records of crime in a neighborhood or area.

The major problem that O’Neil sees is that although the software is designed to be race and ethic neutral, if you consider nuisance crime, or “anti-social behaviors” the data tends to be skewed into lower income areas. The more data is available to indicate crimes could occur, the more patrolling occurs, leading to more arrests, leading to more information points. This is the theoretical basis of the “broken-window” philosophy of policing. As O’Neil notes, “This creates a pernicious feedback loop.”

I thought about O’Neil thesis and her work when reading a recent article in the MIT Sloan Management Review, entitled “Why Big Data Isn’t Enough, by Sen Chai and Willy Shih. The authors posit, “There is a growing belief that sophisticated algorithms can explore huge databases and find relationships independent of any preconceived hypotheses. But in businesses that involve scientific research and technological innovation, the authors argue, this approach is misguided and potentially risky.” The authors reject the simple approach of simply letting “the data “tell the story,” rather than having to develop a hypothesis and go through the painstaking steps to prove it.”

The authors’ research led to three general points which I found useful for the compliance professional. First is developing “some guidelines for using big data effectively: how to extract meaning from open-ended searches; how to determine appropriate sample sizes; and how to avoid systematic biases.” They “also identified several opportunities in which the use of large datasets can complement traditional hypothesis generation and testing, and have reaffirmed the importance of theory-based models.”

  1. Beware of spurious correlations in open-ended searches. Here the authors come very close to what Ben Locwin called ‘white noise’ in data, when noting, “A pitfall in studying large datasets with billions of observational data points is that large deviations are often more attributable to the noise than to the signal itself; searches of large datasets inevitably turn up coincidental patterns that have no predictive power.” Their point was that simple correlation is separate and apart from causation.
  1. Be conscious of sample sizes and sample variation when mining for correlations. Here the authors note, “As the number of dimensions expands, the need for sufficient variation in the sample size is also important. Statistically, low variation can lead to biased estimates and limit predictive power — especially at the tails of the distribution. With the drastic growth in dimensionality, researchers must be mindful of both sample size and sample variation.” The bottom line is that the number of data points must be increased as variables are increased or you are at risk of obtaining a false correlation from the data.
  1. Beware of systematic biases in data collection. This was one of the key points from O’Neil’s book. The authors state, “with open-ended searches, researchers need to understand potential measurement biases and pay close attention to how experiments are designed.” This translates into the need for using “standardized normalization techniques to remove the distorions.”

The authors report on successful uses of big data to improve existing models and providing the analysis is done with traditional approaches it can lead to significant insights. They pointed to two different examples. The first was around the strengthening of a weather forecasting model where “Increasing the scale and scope of data observations has led to the discovery of unexpected phenomena that have helped to improve underlying physics models.” Forecasters were able to review data from a weather event only known to occur at certain latitudes in the Indian Ocean and determine the effects in the United States. The authors quoted Peter Neilley, senior vice president of The Weather Co.’s global forecasting services unit who related that the weather events “may start in the southwest Pacific, but [they] can have an influence on the weather in Boston 30 days later.”

In the creation of new models and products, the authors said, “What began as the application of basic principles of physics and continuum mechanics has evolved to include sophisticated numerical methods based on the idea that large complex objects can be broken down and modeled as sets of individual elements. The simple equations for the individual elements are then assembled into a larger system that models the entire problem.”

The bottom line for the compliance professional, and one of the reasons why the profession need not be afraid of big data or Artificial Intelligence (AI), is that data-driven information should always be viewed as a “supplement to existing methods”. In other words, as “a way to expand dimensionality, discover potentially new relationships, and refine theory. Clearly, data-intensive methods are important complements to experimentation, theoretical models, computer modeling, and simulation because they take us into a realm beyond what such methods are capable of today.”

Running these large data sets around corporate sales; gifts, travel and entertainment; commissions paid to third parties or costs to Supply Chain vendors is a useful exercise for any compliance professional. However, it is only a step in the analysis of whether there is a compliance violation or even something as legally significant as a Foreign Corrupt Practices Act (FCPA) violation. Even if you have the straight-line view that all compliance professionals desire, you are still going to have to interpret that information. Simply because you have a huge sales spike in long off province in a country where you previously did not have significant sales does not mean there is a legal problem. There might be but you will have to investigation and perform an analysis to make a final determination.

The bottom line is that having the data is useful but you must understand not only the source of the data to determine if there is bias but you must also work to analyze the data.


This publication contains general information only and is based on the experiences and research of the author. The author is not, by means of this publication, rendering business, legal advice, or other professional advice or services. This publication is not a substitute for such legal advice or services, nor should it be used as a basis for any decision or action that may affect your business. Before making any decision or taking any action that may affect your business, you should consult a qualified legal advisor. The author, his affiliates, and related entities shall not be responsible for any loss sustained by any person or entity that relies on this publication. The Author gives his permission to link, post, distribute, or reference this article for any lawful purpose, provided attribution is made to the author. The author can be reached at tfox@tfoxlaw.com.

© Thomas R. Fox, 2017