Raising Artificial Intelligence

Statistical theory tells us that as we increase the sample size, we decrease the probability of bias.

The statistical sampling and bias problems we skipped over lightly above can be serious. Let’s consider this example. A facial recognition system designed to recognize felons was applied to the members of the US House of Representatives. This system identified the majority of the House’s dark-skinned members as felons. Alternately, consider the autonomous vehicle system in Phoenix that didn’t recognize a homeless person as a person, but rather considered her as insignificant debris and killed her. If such a facial recognition system is employed by a CSP to control access to sensitive facilities, these kinds of problems can be very significant.

Supervised training can overcome some of these problems. But supervised training is expensive, takes a long time, and there are not enough people who can do it effectively to support all the applications desired. These facts bring us to unsupervised training—the machine learning discussed above. 

Cycle Time?

In addition to the statistical sampling and bias problems, and even with an advanced TPU multi-processor system and very experienced engineers, it can take 15 to 20 days to train the solution with the training data set. In many application areas, the underlying problems (questions) are changing faster than the training period. For example, in dynamic industries like security threat hunting, the threats are evolving at an alarming rate. So, with a 15-day training period, the basis of the trained solution set is invalidated even before it goes live. 

It is important to point out that the actual run time of training is not the only factor to be considered in cycle time. Significant time is involved in deciding and planning. Then there is more time involved in developing and acquiring the training data set. Finally, staff and machine resources are not infinite, so specific training has to be scheduled and may have to wait its turn. If there is not a problem acquiring a satisfactory training data set, the planning and development can double or triple the cycle time. If there is a problem obtaining an unbiased training data set, or there is a staff or machine bottleneck, adding those delays to the cycle time can also produce a dramatic increase.

Data Scale?

Data scale is important because when the data sets grow beyond critical points, processing time grows dramatically. For example, a typical security service within a CSP can collect 20+ BEPD (billions of events per day). Looking through a month’s worth of actual data will equal over 600 billion events. One query with a conventional database this size could take over 10 days. Thus, the data scale problem impacts both training and run time. 

Statistical theory tells us that as we increase the sample size, we decrease the probability of bias. Therefore, one way to address the training data set bias problem is to use a larger training data set. Another approach is to use different training data sets for different contexts. For example, using one training data set for a midwestern US population, a different one for a Chinese population and a third for a Nigerian population offers the advantage of being able to match the training sets to the different population characteristics. Each of the training sets can be refreshed based on the different problem change rates in the different populations. Both approaches add to the data scale problem. Similarly, having larger problem data sets to work with increases the probability of achieving correct results. Again, this increases the time it takes to achieve a result. 

Work is ongoing to solve the data scale problem that will be discussed in a future article, but—going back to the beginning of this article—for now, just remember to ask the questions.


AI is going through a growth spurt. This spurt is having a very significant positive impact in many areas, but it also has limitations and dangers. Understanding both the promise and the cautions is key to making good decisions around AI. So, if someone tells you that they have the ultimate AI that will solve all your problems, remember to ask them the three key questions we have discussed: Where are you in AI’s evolutionary path? What is the cycle time and bias for training relative to the cycle time of the underlying problems? How do you handle the data scale problem?


Latest Updates

Pipeline Memberships>

Subscribe to our YouTube Channel