How Clean ML Is Shattering Data Science’s Glass Ceiling
By: Matthew Karasick, Chief Product Officer, Habu
In the days before the Internet, brands had access to drastically less data about….well, just about everything. Intelligence around customer intent and behavioral insights typically required self-reporting from surveys and focus groups conducted and analyzed by marketing analysts.
The web exponentially increased the amount of data available to organizations and marketers. In 2008, D.J. Patil and Jeff Hammerbacher were leading the intersection of data and analytics at LinkedIn and Facebook when they coined a new term to describe what they were doing on a daily basis: data science.
Within a few years, everyone, across every sector and company function, was talking about Big Data, and its volume, velocity and potential. Industry pundits rightly convinced marketers that they were sitting on a vast pool of consumer data that, if tapped, could readily power better outcomes and even new business models through better analytics and data-driven execution.
Corporations set to work building teams of data scientists who worked across the organization to add data-driven intelligence to their Sales, Marketing, HR, Operations, & beyond.
Brands, and the technology and data providers who make up their “stacks”, have been steadily growing their investments and focus in building analytics and data science disciplines and are markedly more sophisticated than just a handful of years ago.
With access to the wide-open pipes of the ad tech ecosystem, and the results are impressive; data is captured, parsed and activated within milliseconds of its creation. It is ubiquitously understood that a click on a product page anywhere will likely shape what you see, starting within milliseconds from the click. If you’re like me, you have friends who swear that they see ads for things based on words they have spoken (“my devices are listening…”).
Big Data Hits a Data Ceiling
With the changes to browsers, regulations, and beyond, access to second and third party data has recently become far less ubiquitous. Data collaboration now requires far more intentionality and clean room software has emerged as a viable path to enable data collaboration to occur at scale by allowing data owners to have fine grained control over how their data is accessed and used.
Using clean rooms, endemic publishers are able to enable strategic advertising partners to use their data for measurement without fear of their audience leaking and being activated without their involvement. Retailers are able to allow it’s CPG partners to utilize transaction data to inform optimization opportunities. Auto OEMs and dealer groups are pushing past friction which has existed in it’s three-tier system for decades.
With the momentum of data collaboration which clean rooms have paved the way for, savvy teams quickly landed at a question. If we can find ways to join distributed data, can we also find a way to put my machine learning model with your dataset to run inference or predictions, without either of us ever having to share/ship our respective assets with each other?
CleanML is the natural evolution of Clean Rooms whereby two (or more) parties can each bring distributed raw materials for machine learning/AI such as a model or model-training code or dataset(s), with each respective partners’ assets remaining safe and protected in its own clean room. CleanML then creates a temporal neutral compute environment whereby the assets are joined to produce output which is then written to one or both of the partners’ mutually agreed upon clean room(s). The compute environment then quickly evaporates with the only remaining artifact being the generated desired output.
Using CleanML, data science teams are now the proverbial kids in the candy store. Ask 100 data scientists what they’d rather have, better data or better algorithms, and you’ll hear somewhere between 98-100 of the same answer: “better data”. With CleanML, these smart teams are now realizing that they can quickly leave their own four walls and to start thinking who has potentially valuable data (or models) and would be a candidate for collaboration using CleanML.
Read More: The Future Of ECommerce Technologies
Clean ML Use Cases
CleanML is now being used across a number of verticals and use cases. CPG companies and their retail partners are building new propensity models which are driving both advertising and distribution powered by CPG models which utilize retail partners’ data. Brands are able to utilize data enrichment vendors without their data ever leaving its home. R&D departments are now using secret product data from distribution partners to inform new product development. Partners in highly regulated industries such as Financial Services & Healthcare are shattering past ceilings which seemed immovable due to regulation and trust.
CleanML opens the door, not just to an incremental new tactic, but rather a whole path of innovation for data science teams and their partners alike. Roadmaps can and should now imagine working with datasets or models which could come from anywhere. And while it is still early on this arc of innovation where models (or other types of executables) and data can be joined, something tells me that we can hardly even imagine how smart enterprises will use this technology to do amazing things.