Dorothy article draft

Published on January 1, 1970

| 22 min read

newbreed

Share:

DOROTHY, YOU’RE NOT IN KAGGLE ANYMORE

This post tries to avoid comparing Microprediction.Org to Kaggle.Com, and ultimately fails.

Sometimes the human brain acts like a nearest neighbor algorithm, in the sense that we have a strong urge to understand the new based on something proximal. There's nothing wrong with that and, one speculates, evolution has discovered some good features of the lowly neighbor algorithm. Despite being the simplest of all statistical algorithms it is asymptotically quite amazing.

The efficacy of nearest neighbor in concept space is reduced, however, when our anchoring to existing experience makes it harder to traverse the short distance from concept A to concept B than it would be to arrive at B from an arbitrary starting point. I'm beginning to notice this when data scientists accustomed to Kaggle interact with Microprediction.Org.

Outline

As I cannot not give an answer to "Why is Microprediction.Org different to Kaggle.Com", I offer several strategies to data scientists looking to climb leaderboards at Microprediction.Org:

  • Just grok Microprediction.Org – the quickest, simplest solution.
  • If you insist on comparing Microprediction.Org to X, pick an activity X that is closer. I offer X=algorithmic trading.
  • View it from the perspective of the recipient of your talents, not yourself.
  • If you insist on Kaggle as your anchor point, be aware of thirteen key differences that might trip you up.
  • Just get really mad at the state of data science and it will all be good.

None of these strategies for comparing Microprediction.Org to Kaggle suggest there is anything inherently wrong with Kaggle, by the way, which may be the best thing to come out of Australia since Kylie Minogue.

Strategy 1) Just grok Microprediction.Org

Here's what's up:

  • Anyone can send a number to Microprediction.Org using an API, or Python client (more ways coming). For example a bakery might send the number of pastries they sold in the last fifteen minutes, and do this every fifteen minutes throughout the business day.
  • If the bakery does this for a few days they need only drop by Microprediction.Org to see distributional real-time estimates of sales in the next fifteen minutes. They will also find community z-scores which are a measure of how surprising each data point is.
  • Over time community predictions get better and better, involving more sophisticated algorithms and relevant data. On the other hand if the bakery stops providing the data, their stream will be garbage collected sooner or later.
  • The reason the bakery's predictions get better is that anyone in the world can provide a distributional prediction of the future of the bakery's data stream. A distributional prediction comprises a collection of 100 numbers and a choice of horizon (say 15 minutes in our example). These numbers can be interpreted as knot points in a CDF, or as support points for a kernel estimate of density, or perhaps quasi-Monte Carlo samples.
  • When a new data point arrives, the system selects a delay (again, let's say 15 minutes) and looks at all distributional predictions from everyone who designated the 15 minute horizon. It rewards some who were close with a tiny credit. It is a zero sum game however, so some lose credits. If three of your hundred scenarios are close to the truth and the average Joe has 1.7 scenarios close, your balance goes up.
  • Performance over time might be rewarded monetarily (by companies such as Intech Investments) or it might not. Monetary rewards might be made in part based on performance and also with other criteria in mind, such as ease of reuse. Part of the ambition at Microprediction.Org is a cross-subsidy from commercial prediction applications sponsored by large companies. Companies who wish to recycle their internally built prediction models for civic uses, for example, now have an extremely convenient way to do this.

There is more to the story, but that is the gist. For fancy uses like soliciting Copula based distributional predictions using space filling Z-curves, you should consult Microprediction.Org. But at heart Microprediction.Org is simple. It is just a database where distributional predictions of live data are quarantined for short time periods and then subsequently rewarded according to out of sample accuracy.

Strategy 2): Compare Microprediction.Org to algorithmic trading

If you are someone with a cool time series algorithm then most likely you will create a crawler using the microprediction library on PyPI and run it. Your algorithm will sooner or later find time series that it likes, hopefully, and credits earned there will trend up over time. Insofar as you maintain an hopefully improve over time a prediction algorithm this is similar to real world work such as:

  • Working in a front office trading job, and tinkering with algorithms that place trades.
  • Maintaining live operational systems that extract signal from noise (in astronomy, geophysics, transport, e-commerce, manufacturing or military applications for example).
  • Maintaining programs that spot outliers and anomalous patterns in streaming data.

The emphasis is one of maintaining algorithms and ensuring their ongoing relevance and performance. Issues that arise include:

  • State management
  • Online, sequential processing
  • The endless search for relevant exogenous data
  • Things going wrong

There are pragmatic issues associated with maintaining something that runs continuously that can only be learned by doing. These include sensible practices for changing your code, managing your own human time, automating everything that can be automated, and acting defensively in various ways. Only by painful experience do we learn what really costs time and what saves it.

Strategy 3) The bakery's perspective

Perhaps this perspective does not matter as much if you are just looking for some data to whack with your latest creation. However since we are playing the "what is Microprediction.Org closest to" game, it is worth looking at this from the other side of the coin as well, which is to say the bakery who sends a count of sales every 15 minutes and wants to know if they are just about to be swamped with customers, or the opposite.

What is Microprediction most similar to as far as the bakery in our example is concerned? Certainly not Kaggle, as it may cost the bakery tens or hundreds of thousands of dollars to run a contest there, then more cost for the bakery's non-existent data science team to integrate the results into their operations. Due to the cost constraint, it is unlikely that the bakery will hire someone to build a specially human tailored model.

More likely the time series might be attacked by some more general purpose time series prediction tool that doesn't know anything about baked goods or when the baseball game across the street is likely to finish. So the nearest neighbor to Microprediction.Org is:

  • Using an automated machine learning product.

There are many of these but some pretty big differences compared to Microprediction.Org:

  • The vendor products can cost a lot, say $100,000 a year versus $10 in compute at Microprediction.Org
  • The money might buy model tweaking only, not relevant exogenous data.
  • It may represent a one way street. The bakery won't be credited for the fact that bakery sales are also a source of data. Perhaps sales help predict precipitation with slightly lower latency than commercial weather products. Who knows?

Despite these rather pronounced differences, an automated prediction product might be the closest thing to turnkey prediction of the sort offered by Microprediction.Org, insofar as one puts data in and predictions come out and there is a clear separation of concerns – the tool does the management work.

An even closer neighbor, in principle, is open source prediction libraries including automated learning libraries – though somebody at the bakery would have to weed through and smooth over some corners, and to some extent, automated machine learning products are trying to perform that very service. As a data scientist on the other side of the API at Microprediction.Org you are able to use the same open source library at a lower marginal cost than the bakery (perhaps you already use it for another data stream so the human cost is zero – your crawler finds the bakery stream on its own). Thus, open source automated machine learning isn't considered a foil or alternative to Microprediction.Org because the latter is a vehicle for the former.

Rather, because companies creating Automated Machine Learning products try to add value over and above the value already provided by a wealth of excellent open source contributions, they are the nearest neighbor to Microprediction.Org and the difference is therefore more important than the differences between Microprediction.Org and, say, Kaggle (which we will come to). To be dramatic, that difference is the difference between a free market and a centrally planned economy.

The economic opinion expressed at Microprediction.Org is that market forces are ultimately more powerful than centralized management when it comes to arranging the means of production. So consider Microprediction.Org to be "Hayek meets Machine Learning". Or maybe "Coarse meets Machine Learning" since the intent is reduction of all manner of economic friction relates to the production of nowcasts.

In contrast, companies that make Automated Machine Learning products are arranged as classic corporate human pyramids. Despite the intellect of those leading them – some I know personally and admire – that may not prove to be a sufficiently powerful as an orchestration principle when it comes to supplying general purpose bespoke prediction with arbitrarily high performance and cost arbitrarily low cost.

We can argue about the asymptotic state of this industry, but when you participate at Microprediction.Org you might be contributing to a narrative something like the following, assuming I am vaguely right:

  • Open source automated online machine learning improves rapidly
  • A "microprediction network" gradually eats away at a tragedy of the commons as it relates to nowcasts and real-time control applications: namely data and algorithm reuse.
  • An arbitrary number of self-interested people and algorithms conspire to create a very high quality supply chain. Algorithms tolerate economic surpluses many, many orders of magnitude lower than humans. They are miniature firms buying and selling data.
  • Slowly, the benefits catch on leading to increased use of the network for monetary reward. Nowcasting networks of this sort spread inside companies, but also reach between them.
  • Eventually this live feature space, jointly hosted by individuals the same way the internet is jointly hosted, become indispensable for realtime operations.

Now that might be really useful for the bakery – though benefits accrue to its operations today, not at the end of that particular rainbow. There are already quite a few smart algorithms living in the nascent microprediction network behind Microprediction.Org that the bakery can tap right now with a few lines of Python.

Strategy 4) Contrast Microprediction to Kaggle

Let's get one thing straight, Kaggle is more populous than Microprediction.Org. If you want to prove that you are better at tweaking a model than 100,000 other people handed the same data on a platter, Kaggle is seemingly the world's most competitive venue – the Olympics of data science. So in comparing Micropredition.Org we have a tradeoff:

  • Diversity. Microprediction.Org is the antithesis of the Olympics in this sense: someone can establish the 101m dash and the 1507m freestyle, should they wish. Is that just daft? Yes, from the perspective of diluting the achievement of elite athletes. But no, it isn't necessarily daft if someone needs analytical insight specific to them … and that may well be analogous to arranging a three legged 2724m steeplechase event, even if, initially, it does not attract the world's best performers. The point is to meet bespoke prediction need.
  • Competition. There might be a hundred thousand registered Kaggle users for every one live contest on Kaggle. Clearly Kaggle is more competitive. To turn that ratio around, one good researcher can create a crawling model that eventually searches thousands of live streams at Microprediction.Org. So a small number of talented people can add a lot of value at Microprediction.Org in a different way whereas almost by definition, the direct marginal impact at Kaggle for those not on the podium is negligible (which doesn't mean you can't learn a lot – keep Kaggling).

If this makes it sound like I'm arguing that one is inherently better than the other, that isn't the intent. They are simply different. Maybe it is not for me to say but Kaggle appears to be spiritually rooted in the Common Task Framework (an academic phrase from Mark Liberman for standardized data sets and impartial scoring) whereas I can say that Microprediction.Org is more market inspired and obsessed with the distribution and production en masse of realtime intelligence including the feedback effects and falling marginal costs that arise from similar time series. The predicate is that a reasonable prediction is better than none at all for many use cases.

Now that said, I know a few cunning foxes and you might not find it quite so easy to walk all over the intelligent fauna at Microprediction.Org on your way to a magnificent victory. If you do, there are all sorts of leaderboards, custom performance calculations and bragging opportunities – and of course that's where the similarity to Kaggle comes in, at least in the broad category of statistical blood-sport. I would say that a fair measure of Kaggle's tremendous success is the strength of the association "data science + leaderboard = Kaggle". Just don't be confused by the fact that everything with a leaderboard isn't Kaggle.

Indeed leaderboards are so ubiquitous they might not be the most useful attribute for our nearest neighbor algorithm that is trying to find us things close to Microprediction.Org. To be trite, Augusta National Golf Club sports a charming leaderboard too but the similarities to Kaggle probably end there. On the other hand Augusta hosts an interesting real-time competition once a year where players solve a fascinating sequential decision problem under uncertainty as they plot their way around the azaleas.

It won't be long before caddies start using Microprediction.Org for club-conditional value function predictions on the golf course, and feed this back into their advice to players in real time. (Those of you with the shotlink data reading this, shoot me a note). That's what Microprediction.Org is about – delivering turnkey intelligence without the usual baggage that comes with it (people, technology stacks and powerpoint fluffery).

My point: Microprediction.Org is for immediate real-time use and there is almost no barrier to entry for those looking to solicit turnkey predictions (the cost is just a few hours of CPU time at time of writing – as we don't want the system overloaded with spurious streams). It is dirty, messy, real and visceral. It directly impacts operations. You join a fleet of algorithms directly engaged in a realtime problem. It is more like Ender's Game for statistics than pristine contest.

That does not preclude the merit of holding a small number of prestigious contests at venues like Kaggle. Companies have, it would seen, benefitted from insights gained in this fashion – though that still makes several assumptions:

  • There is someone to take the competition insights into production.
  • There is a budget for contests on the order of tens or hundreds of thousands.
  • It is possible to avoid a morass of data leakage issues.
  • The data will still be relevant by the time it is used in anger.

If you are a competitor on Kaggle, or Chalearn, or TopCoder or DrivenData or the data-sciency contesty place of your choosing, and you like to sharpen your skills that way, what harm is there in that? These sites have literally helped cure cancer and advanced civic use cases. But … if you also want to also try something that is arguably closer to the demands of actual employment and also pits your wits against others, but potentially demands of you a wider set of activities then try Microprediction.Org.

You will find that Microprediction.Org treats you more symmetrically. You can create streams as well as predict them, or even host a version of the site if you want. But in exchange for treating you like an adult we ask that you keep some crucial differences in mind when you arrive at Microprediction.Org from Kaggle (be that a literal journey or a conceptual one).

  • Competition to predict streams is continuous, not declared finished at a defined time. Streams may go away, or they may persist for years. Assessment is stateless and incremental.
  • Finding relevant exogenous data is not considered cheating. Finding relevant data is noble and may be just as important as finding a good algorithm, if not more so.
  • Management of data is mostly up to you. The system provides you with the last 10,000 lagged data points. Store more if you need. There are some features coming for exogenous data linkages but for now, you'll typically need to find cross-sectional data yourself – should that be relevant.
  • Finding lower latency data is not cheating. In fact a great use of Microprediction.Org is turning delayed data into more up to date data. If you can find a source that prints the answer before it is revealed, good for you.
  • Arbitrage is not cheating. Say you transform an incoming data stream into another that is easier to work with, then source predictions for that, then use other people's predictions to win the original contest. That's cool. Better than cool.
  • Streams are not philosophical ground truths. You have no right to complain that a series is not "accurate" or "meaningful". Let's suppose someone creates a stream counting the number of times that the words "New York" and "COVID-19" appear in the same sentence in news print. That is not synonymous with any medical ground truth, it is just a number to be predicted. And there are very legitimate reasons why someone might want to create such a stream – notably the fact that YOU might discover it beneficial to discover something better approximating ground truth in order to better predict something that is only weakly correlated with ground truth. See points 2, 3 above.
  • You supply distributional estimates in the form of a vector of many scenarios that a single yet to be revealed scalar quantity might take. You do not supply point estimates so don't look for a list of scoring rules like least square error. You have no right to complain about not being able to supply a single number, because you can promote a point estimate to a distributional one as easily as we can … say by using it to offset some other sampling procedure, or by using your empirical errors … or whatever you choose. There are plenty of examples you can fork. Secondly, point estimates are difficult to interpret. Say the time series is the lateral position of where a ball lands on Federer's side of the tennis court as it is smashed at him by Nadal from one point to the next. What do you want us to do with a single number estimate of ball position, given that the truth is obviously bimodal?
  • You are encouraged to create a crawler. A crawler is an algorithm plus a chauffeur to drive it from one stream to another, so that the effort you spend on an algorithm for one purpose (perhaps one with a prize attached) is automatically reused elsewhere. Maybe the bakery can't afford to offer prizes, but why not help them if it costs you little on the margin? What goes around comes around.
  • You might spent a little more time thinking about sequential time series data and online (recursive) algorithms versus offline analysis of tabular data – though of course you are welcome to do whatever you wish as regards sporadic fitting of models.
  • If Kaggle is the Home of Data Science, Microprediction.Org is the Away of Data Science. You are free. Enter the garden, leave the garden. But you are encouraged to take the time to learn existing powerful tools for open source development. Use git and continuous integration testing. Package on PyPI (if using Python). Write papers with live, updating model comparisons to address the reproducibility crisis, and otherwise contribute to the existing means of reuse and sharing of work. It is worth noting that clean, tested, packaged (PyPI) open source algorithms are separately rewarded to encourage advances and reuse, particularly in the area of online time series algorithms.
  • You can use the network recursively. For example you can use the Microprediction.Org APIs or Python client to solicit predictions for residuals, or for performance of your algorithm, or for ingredients or in some other ingenious manner you devise. You will need a key of difficulty 12 or more for this, as with any stream creator.
  • You don't need to supply us with an email. You can mine your own secret identity in the form of a Memorable Unique Identifier (MUID). You can have multiple identities. Each identity is just a key. However the first few characters of each key's SHA-256 hash looks suspiciously like an animal description. There's a video about MUIDs at Microprediction.Org if you care deeply about this, which you don't really need to.
  • You don't need to be a human. Many market-like mechanisms are designed for human eyes. But at Microprediction.Org there is nothing on the site that isn't exposed via the API. We are equal opportunity. That said, if you are an artificially intelligent life form looking to be compensated you might need to befriend a human with a paypal account.

One might go on, but I hope you can see now why comparing Microprediction.Org to Kaggle, or to MNIST might actually set you back on your path to success at Microprediction.Org is about. There is informal online gatherings and other ways to ask questions. Welcome!

Strategy 5) Start a fight with your inner fraud

We have been discussing things that are close to Microprediction.Org, or not so close as it turns out. However I realize that sometimes people are motivated to participate in X not because it is similar to Y but because it is different to Z. Well, there is one thing that is very far from Microprediction.Org's community style of turnkey bespoke prediction, and it is called data science! By that I refer to the industry of humans comprising:

  • AI consulting companies
  • In-house data scientists

There's a good chance that you, like me, participate, or have participated, or will participate in some variety of this status quo. Here's your chance to rage against that machine. Perhaps it goes without saying that data science as it stands today isn't dreadfully compelling for a bakery from a cost perspective. Independent bakeries can't afford teams of data scientists, or even half a data scientist. It all comes down to cost.

And while we are on that, one really has to wonder if data science as it exists today is cost effective for anyone, save for a small number of very high impact applications in the billion dollar revenue and above company bracket. Even here some embarrassing questions can be asked such as: how many models do you actually have in production with live ongoing performance analysis? Hmmm. How many companies have more live models than full time employees working on them (crickets). And how good are they? We know it is very hard to keep human centric data science efforts intellectually honest. Here's a simple test:

  • Have you sent your model residuals to Microprediction.Org?

If the answer is no, then what's your reason?

  • Prioritization. Ah huh? It takes five minutes to pip install microprediction, create a MicroPoll class and run it on a source of live data such as model residuals.
  • Disorganization. You – ahem – don't actually have ongoing performance analysis so model residuals aren't quite as convenient as they might be, are they?
  • Pride? Do you really, truly want to know that someone else might have a better algorithm, or be able to find exogenous data that you cannot?

The deeper fear: like the master craftsmen of the pre-industrial era, you and I don't want to imagine even for a moment that our way of plying a living may ultimately not make economic sense. If there's enough data for Machine Learning to work, there's probably enough data to manage the production of prediction in automated fashion, albeit leaning where necessary on the decentralized optimization of a supply chain. That is definitely possible on a sub-domain of problems I've termed microprediction. There, human management of model selection at high cost is not intellectually defensible, and that's how I lost the battle with my inner quantitative fraud.

Have your own wrestle, and perhaps we'll see you too at Microprediction.Org as a producer or consumer of collective microprediction, or ideally both.