Lately I've been noticing a lot of chatter about automating Data Science.  Some companies are moving into the this space (bigML, dataHero) recognizing that there is a void (and therefore a market) driven by the lack of qualified Data Scientists that exist.  While I have nothing against automation I think businesses that are considering automation as a solution need to think carefully about the fundamentals of their business processes before proceeding.

All businesses make bets to survive (or not survive).  A bet in this sense is a decision to use scarce resources (like capital)  Some are explicit (meaning the business owners are very cognizant of the bet they are making):

  • Purchasing ad keywords
  • Hiring more personnel

Others are implicit (meaning the business owners are not fully aware of the implications of the bet they are making):

  • Moving to the cloud
  • Switching accountants

Businesses are using Data Science and predictive models to help choose what bets to make by using historical data to drive new bets (decisions).  The SIZE of the bets businesses make should drive the utilization of Data Science automation.

Data Science automation tools are not foolproof, they are not bug free, and they do not explore every facet of the data.  In short, the models coming out of any Data Science automation are uncertain (how key reasoning about uncertainty is to Data Science in practice is another blog post), and the degree to which they are uncertain is not always well qualified.  This is no knock against the algorithms the Data Science automation tools are running or the people who implement them, it is just a fact of life that no algorithm and no company are perfect.

Given the last paragraph, it becomes clear why using Data Science automation is unwise in the case where a business is making big bets.  If a business is making big bets against an uncertain model then it will have fewer big bets it can make with an uncertain model before the costs spiral out of control.

In general this is really just basic utility theory.   In the binary case (an event happens or it doesn't happen) utility boils down to:

Utility = Probability An Event Happens x Cost of the Event Happening + Probability An Event Does Not Happen x Cost of the Event Not Happening

Lets work through a specific example.  Say a business is selling a truck:

  • The profit on selling a truck is $5,000
  • The customer acquisition cost is $10,000
  • A automated Data Science predictive model says that if you spend the $10,000 on customer acquisition you have a 75% chance of making a sale to any given qualified customer

The utility then turns into:

Utility = 0.75 * $5,000 + 0.25 * -$10,000 = $3,750 - $2,500 = $1,250

So it makes sense to go ahead with this business model and spend the $10000 on customer acquisition.  However, what if the model is wrong and the $10000 on customer acquisition only gives the business a 60% chance of converting the sale.  Then the utility becomes

Utility = 0.6 * $5,000 + 0.4 * -$10,000 = $3,000 - $4,000 = -$1,000

So obviously the business would not want to spend the $10,000 on customer acquisition in this case (or they want to raise the price of the truck, or just generally rethink the business model).

So in this case a 15% change in the predictive model made a huge difference in our fictional companies bottom line.  Now consider that our fictional business would eventually notice the problem with their predictive model (hopefully), but since we are talking about big bets  (thousands of dollars), the damage to the company could be extreme.

Take the other extreme though and consider small bets.  Lets take the example above, and divide everything by $10,000 and consider that we are selling widgets instead of trucks.  Now we get

  • The profit on selling a widget is $0.5
  • The customer acquisition cost is $1
  • A automated Data Science predictive model says that if you spend the $1 on customer acquisition you have a 75% chance of making a sale to any given qualified customer

The utility then turns into:

Utility = 0.75 * $0.5 + 0.25 * -$1 = $0.375 - $0.25 = $0.125

and if the model is inaccurate as above then:

Utility = 0.6 * $0.5 + 0.4 * -$1 = $0.3 - $0.4 = -$0.1

In this case with small bets the potential loss due to the inaccurate predictive model is small.

The difference between the 2 cases given above is in the number of times that the inaccurate model can be used before it is hugely detrimental to the company.  It takes sufficient statistics to tell whether a model is working in the wild, lets say at least 100 runs in the real world to see it perform.  In the first case that hundred runs could cost hundreds of thousands of dollars.  While in the second case the 100 runs would cost tens of dollars.

The conclusion then is that if you are making small bets then you can afford to try an automated solution because you will be able to tell if it isn't performing before it costs you a significant amount of money.  If you are making large bets then an automated solution is not a good choice, instead investing in a Data Science professional to seriously analyze your data and build a well tested business specific model is a better solution.

As a caveat, I have really only discussed random error in the automated predictive models.  If the automated predictive models (or any model) had a systemic error then things can get even worse, because the business could be systematically excluding potential customers and/or including bad prospects.

To summarize in 3 bullets:

  • Big Bets - go to a Data Scientist
  • Small Bets - carefully try automation.
  • Always continuously measure and validate all models!