7 Machine Learning Best Practices

Netflix’s famous algorithm challenge awarded a million dollars to the best algorithm for predicting user ratings for films. But did you know that the winning algorithm was never implemented into a functional model?

Netflix reported that the results of the algorithm just didn’t seem to justify the engineering effort needed to bring them to a production environment. That’s one of the big problems with machine learning.

At your company, you can create the most elegant machine learning model anyone has ever seen. It just won’t matter if you never deploy and operationalize it. That’s no easy feat, which is why we’re presenting you with seven machine learning best practices.

Download your free ebook, “Demystifying Machine Learning

At the most recent Data and Analytics Summit, we caught up with Charlie Berger, Senior Director of Product Management for Data Mining and Advanced Analytics to find out more. This is article is based on what he had to say. 

Putting your model into practice might longer than you think. A TDWI report found that 28% of respondents took three to five months to put their model into operational use. And almost 15% needed longer than nine months.

Graph on Machine Learning Operational Use

So what can you do to start deploying your machine learning faster?

We’ve laid out our tips here:

1. Don’t Forget to Actually Get Started

In the following points, we’re going to give you a list of different ways to ensure your machine learning models are used in the best way. But we’re starting out with the most important point of all.

The truth is that at this point in machine learning, many people never get started at all. This happens for many reasons. The technology is complicated, the buy-in perhaps isn’t there, or people are just trying too hard to get everything e-x-a-c-t-l-y right. So here’s Charlie’s recommendation:

Get started, even if you know that you’ll have to rebuild the model once a month. The learning you gain from this will be invaluable.

2. Start with a Business Problem Statement and Establish the Right Success Metrics

Starting with a business problem is a common machine learning best practice. But it’s common precisely because it’s so essential and yet many people de-prioritize it.

Think about this quote, “If I had an hour to solve a problem, I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”

Now be sure that you’re applying it to your machine learning scenarios. Below, we have a list of poorly defined problem statements and examples of ways to define them in a more specific way.

Machine Learning Problem Statements

Think about what your definition of profitability is. For example, we recently talked to a nation-wide chain of fast-casual restaurants that wanted to look at increasing their soft drinks sales. In that case, we had to consider carefully the implications of defining the basket. Is the transaction a single meal, or six meals for a family? This matters because it affects how you will display the results. You’ll have to think about how to approach the problem and ultimately operationalize it.

Beyond establishing success metrics, you need to establish the right ones. Metrics will help you establish progress, but does improving the metric actually improve the end user experience? For example, your traditional accuracy measures might encompass precision and square error. But if you’re trying to create a model that measures price optimization for airlines, that doesn’t matter if your cost per purchase and overall purchases isn’t going up.

3. Don’t Move Your Data – Move the Algorithms

The Achilles heel in predictive modeling is that it’s a 2-step process. First you build the model, generally on sample data that can run in numbers ranging from the hundreds to the millions. And then, once the predictive model is built, data scientists have to apply it. However, much of that data resides in a database somewhere.

Let’s say you want data on all of the people in the US. There are 360 million people in the US—where does that data reside? Probably in a database somewhere.

Where does your predictive model reside?

What usually happens is that people will take all of their data out of database so they can run their equations with their model. Then they’ll have to import the results back into the database to make those predictions. And that process takes hours and hours and days and days, thus reducing the efficacy of the models you’ve built.

However, growing your equations from inside the database has significant advantages. Running the equations through the kernel of the database takes a few seconds, versus the hours it would take to export your data. Then, the database can do all of your math too and build it inside the database. This means one world for the data scientist and the database administrator.

By keeping your data within your database and Hadoop or object storage, you can build models and score within the database, and use R packages with data-parallel invocations. This allows you to eliminate data duplications and separate analytical servers (by not moving data) and allows you to to score models, embed data prep, build models, and prepare data in just hours.

4. Assemble the Right Data

As James Taylor with Neil Raden wrote in Smart Enough Systems, cataloging everything you have and deciding what data is important is the wrong way to go about things. The right way is to work backward from the solution, define the problem explicitly, and map out the data needed to populate the investigation and models.

And then, it’s time for some collaboration with other teams.

Machine Learning Collaboration Teams

Here’s where you can potentially start to get bogged down. So we will refer to point number 1, which says, “Don’t forget to actually get started.” At the same time, assembling the right data is very important to your success.

For you to figure out the right data to use to populate your investigation and models, you will want to talk to people in the three major areas of business domain, information technology, and data analysts.

Business domain—these are the people who know the business.

  • Marketing and sales
  • Customer service
  • Operations

Information technology—the people who have access to data.

  • Database administrators

Data Analysts—people who know the business.

  • Statisticians
  • Data miners
  • Data scientists

You need the active participation. Without it, you’ll get comments like:

  • These leads are no good
  • That data is old
  • This model isn’t accurate enough
  • Why didn’t you use this data?

You’ve heard it all before.

5. Create New Derived Variables

You may think, I have all this data already at my fingertips. What more do I need?

But creating new derived variables can help you gain much more insightful information. For example, you might be trying to predict the amount of newspapers and magazines sold the next day. Here’s the information you already have:

  • Brick-and-mortar store or kiosk
  • Sell lottery tickets?
  • Amount of the current lottery prize

Sure, you can make a guess based off that information. But if you’re able to first compare the amount of the current lottery prize versus the typical prize amounts, and then compare that derived variable against the variables you already have, you’ll have a much more accurate answer.

6. Consider the Issues and Test Before Launch

Ideally, you should be able to A/B test with two or more models when you start out. Not only will you know how you’re doing it right, but you’ll also be able to feel more confident knowing that you’re doing it right.

But going further than thorough testing, you should also have a plan in place for when things go wrong. For example, your metrics start dropping. There are several things that will go into this. You’ll need an alert of some sort to ensure that this can be looked into ASAP. And when a VP comes into your office wanting to know what happened, you’re going to have to explain what happened to someone who likely doesn’t have an engineering background.

Then of course, there are the issues you need to plan for before launch. Complying with regulations is one of them. For example, let’s say you’re applying for an auto loan and are denied credit. Under the new regulations of GDPR, you have the right to know why. Of course, one of the problems with machine learning is that it can seem like a black box and even the engineers/data scientists can’t say why certain decisions have been made. However, certain companies will help you by ensuring your algorithms will give a prediction detail.

7. Deploy and Automate Enterprise-Wide

Once you deploy, it’s best to go beyond the data analyst or data scientist.

What we mean by that is, always, always think about how you can distribute predictions and actionable insights throughout the enterprise. It’s where the data is and when it’s available that makes it valuable; not the fact that it exists. You don’t want to be the one sitting in the ivory tower, occasionally sprinkling insights. You want to be everywhere, with everyone asking for more insights—in short, you want to make sure you’re indispensable and extremely valuable.

Given that we all only have so much time, it’s easiest if you can automate this. Create dashboards. Incorporate these insights into enterprise applications. See if you can become a part of customer touch points, like an ATM recognizing that a customer regularly withdraws $100 every Friday night and likes $500 after every payday.

Conclusion

Here are the core ingredients of good machine learning. You need good data, or you’re nowhere. You need to put it somewhere like a database or object storage. You need deep knowledge of the data and what to do with it, whether it’s creating new derived variables or the right algorithms to make use of them. Then you need to actually put them to work and get great insights and spread them across the information.

The hardest part of this is launching your machine learning project. We hope that by creating this article, we’ve helped you out with the steps to success. If you have any other questions or you’d like to see our machine learning software, feel free to contact us.

You can also refer back to some of the articles we’ve created on machine learning best practices and challenges concerning that. Or, download your free ebook, “Demystifying Machine Learning.”

 

How Blockchain Will Disrupt the Insurance Industry

The insurance industry relies heavily on the notion of trust among transacting parties. For example, when you go to buy car insurance you get asked for things like your zip code, name, age, daily mileage, and make & model of your car. Other than, maybe, the make & model of your car you can pretty much falsify other information about yourself for a better insurance quote. Underwriters trust that you are providing the correct information, which is one of the many risks in the underwriting business.

Enterprise blockchain platforms such as one from Oracle essentially enables trust-as-a-service in such interactions. Participants (insurer and insured) need to come together to do business, but they do not necessarily trust each other. Blockchain provides a scalable mechanism to securely and easily enable trust in such scenarios. There are 4 key properties of Blockchain that enable trust-as-a-service:

  1. Transparency of digital events and transactions it manages,
  2. Immutability of records stored on the blockchain. through append-only time-stamped and hashed records,
  3. Security and assurance that records stored on blockchain aren’t compromised through built-in consensus and encryption mechanisms,
  4. Privacy through cryptography

Blockchain can be a good solution for a number of insurance use cases such as:

  • Reducing frauds in underwriting and claims by validating data from customers and suppliers in the value chain
  • Reducing claims by offering tokenized incentives to promote safer driving behavior by capturing data from insured entities like motor vehicles
  • Enabling pay-per-mile billing for insurance by keeping verifiable records of miles traveled
  • And, in the not so distant future, using blockchain to determine liability in case of an accident between two autonomous vehicles by using blockchain to manage timestamped immutable records of decisions made by deep-learning models from both autonomous vehicles right before the accident.

Besides these use cases, blockchain has potential to eliminate intermediaries, improve transparency of records, eliminate manual paperwork, and error-prone processes, which together can deliver orders of magnitude improvement in operational efficiency for businesses. Of course, there are other types of insurance such as healthcare, reinsurance, catastrophic events insurance, property and casualty insurance, which would have some unique flavor of use cases but they would similarly benefit from blockchain to reduce risk and improve business efficiency.

There is no question that blockchain can, potentially, be a disruptive force in the insurance industry. It would have to overcome legal and regulatory barriers before we see mass adoption of blockchain among the industry participants. 

If you are working on an interesting project related to the use of Blockchain for insurance industry feel free to get in touch by leaving a comment or contact us through social media or Oracle sales rep. We’d be glad to help you connect with our subject matter experts and with your industry peers who may be working on similar use cases with Oracle. For more information on Oracle Blockchain, please visit Oracle Blockchain home pages here, and here

 

 

Intro to Use Case Types

Analyzing or designing the various features and functions of a software system can be daunting, especially when there are multiple actors and other interfacing systems involved. Thankfully, analysts can turn to use cases to make this process much easier. At the very minimum, an effective use case should: define how stakeholders interact with a system define how a system […]

CX For Gov IT: Driving Federal IT Performance Using CX Principles

The reality is that most Federal IT leaders are under increased scrutiny to deliver increased value and productivity from technology investments at the lowest cost. Simply performing a top-to-bottom technology rationalization effort will no longer make…

A Pattern for Sizing ArchiMate Diagrams

In my recent blog series, I highlighted the importance of communication for strategic transformations. This affects several functions and various roles in your organization by asking different questions, such as:
* What capabilities do I need to suppo…

Steering towards success

Too often enterprises (businesses, governments, service organisations, …) constrain change to only being responsive to outside influences. Rather than being proactive and having control over their future they are reactive, consequently abrogating their control and the benefits it brings. Ideally … Continue reading

All Brands Need A Full-Fledged Digital Commerce Strategy

The internet has enabled brands to sell more of their products to an ever-expanding universe of shoppers, but digital distribution has also added complexity to the relationship between brands and retailers and other channel partners. All brands need to…