There is a lot of discussion about BigData and Data Science revolving around processing massive amounts of information (terrabytes of data, which must be trillions of data points). While I am not sure of all the varied problems that people are trying to solve by processing this data, I do have my doubts about the necessity of the effort. Lets think about the major reasons one wants to process data: analytics and learning.

Now lets remember our the basics of statistics, namely that we can compute statistics on a statistically relevant sample of any data set and that it should reflect the statistics of the entire dataset. When computing the mean of data we don't need to examine every data point to get a reasonable approximation (see http://en.wikipedia.org/wiki/Convergence_of_random_variables, and http://en.wikipedia.org/wiki/Central_limit_theorem). The same holds true for the variance and other analytical metrics. For most analytics needs, analyzing trillions of data points is completely unnecessary.

We also use machine learning to learn models from our data. Here once again we should realize that it does not take trillions of points to create a model. Given that we are building a model that is stable across our data then we will get diminishing returns as we add in more data to the learning process. Again this is because we are usually approximating a distribution of some form or another at some point and the links above apply.

Now if you need to look for a needle in the haystack (outlier detection) than I can see processing the entire dataset, but that is probably a fairly rare use case. The more probable case for a terrabytes of data is a temporal model, but even then, older data will cease to be effective in the modeling.

The real cases that I can see are repeated modeling, like a recommendation engine per user. Even in these cases a general model could be augmented by the relatively "small" data that an individual user brings to the table. However I can see how millions of users would require significant infrastructure and data processing when computing a million models.

So what am I missing? Let me know in the comments.