The Five Predictive Learning Models

By Ravi EvaniFiled under Machine Learning12 Comments

Data Science and Learning Models

Data science and learning models are rapidly creating a gateway to a different reality. Today, you can get 92% accuracy with voice recognition. Learning models can recognize images, which means they can look for sunrise and find you pictures that have sunrises inside them. A learning model can watch videos and find specific words spoken within them. Learning models within Siri, Alexa and google assistant can have natural conversations with you. They can translate over a 100 different languages for you. A learning model can play you songs you’ve never heard before .. but those that you will really like. A learning model on your watch can recognize what exercise you are doing. A learning model can anticipate what time you need to wake up to minimize traffic on your route to work that morning and a learning model can recommend you your perfect match. All of these learning models are algorithms. Not the algorithms that most of us grew up writing, but those that learn and predict. The algorithms of data science and machine learning.

At its core, data science is about predictions using data. And these predictions can be within a variety of contexts. For example, predicting whether a piece of content is relevant to a person is a context that’s different from predicting how much a driverless car should slow down on that bumpy road. And because the success of prediction depends on knowing as much as possible, in data science we often deal with large amounts of data – popularly known as Big Data. Today, predictions using Big Data have become so successful that many subfields of artificial intelligence are leveraging these techniques. In the past, AI was originally about writing rules that mimicked the human brain, and they were faced with limitations of the combinatorial explosion. The reason why AI has taken a renewed uprising is because machine learning models with big data are creating similar outcomes that were originally envisioned in AI.

Data science is an emerging, interdisciplinary field, so the term could seem ambiguous and change based on the context it’s being applied.. There are many names linked with “data science”.. such as: Data Science, AI, machine learning, pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, adaptive systems, self organizing systems, etc. All these words lead to the same thing : predictions using data. Going forward, in this article I will use the term learning models as the vehicle that’s used to make predictions in all these areas.

Our society is changing, one learning model at a time. Learning models are remaking science, technology, business, politics and even war. Satellites, DNA sequencers, and particle accelerators explore nature in deep detail, and learning models turn the deluge of data into new scientific knowledge. Driverless vehicles pilot themselves across land, sea and air and the stock market is no longer a human endeavor, it’s mostly learning models that taught themselves to trade stocks with other learning models, and as a result, the floor of the stock exchange isn’t filled with traders doing their day jobs anymore, it’s largely a TV set.

With such a large impact, there is a need for all technologists to understand the concepts behind learning models: how they work and how they are built. However, the thought of getting into the details of learning models could seem a little daunting. At least it was to me when I first started this journey. However, just like any other technology paradigm, this field is also made up of a few building blocks. Understanding those building blocks could help us wrap our heads around the basics, and I have tried to pull the building blocks together from three excellent books:

  • The Master Algorithm, by Pedro Domingos. Dr Pedro Domingos, is the co-founder of the International Machine Learning Society, the author or co-author of over 200 technical publications in machine learning, data mining, and other areas. He is the winner of the SIGKDD Innovation Award, the highest honor in data science.
  • Data Science for Business, by Foster Provost. Professor Provost recently retired as Editor-in-Chief of the journal Machine Learning after 6 years. He is a member of the editorial boards of the Journal of Machine Learning Research (JMLR) and the journal Data Mining and Knowledge Discovery. He is also the co-founder of the International Machine Learning Society.
  • Python Machine Learning, by Sebastian Rashcka. Sebastian Raschka is a PhD student at Michigan State University, who develops new computational methods in the field of computational biology. He has been ranked as the number one most influential data scientist on GitHub by Analytics Vidhya.

My goal is for you to be able to have a general understanding of learning models, and as a result, the core of data science in the next 60 minutes. Hang tight and keep reading!

Most of the knowledge in the world in the future is going to be extracted by machines and will reside in machinesYann LeCun, Director of AI Research, Facebook

Where does knowledge come from? Until recently it came from 3 sources:
Evolution: that’s the knowledge that’s encoded in your DNA
Experience: that’s the knowledge that’s encoded in your neurons, and
Culture: which is the knowledge you acquire by talking with other people, reading books, and from the internet. Everything that we do comes from these three sources of knowledge.

Only recently, there’s a fourth source of knowledge on the planet. Computers. There’s more and more knowledge now that comes from computers, which is discovered by computers.

Notice, also, that each of these forms of knowledge discovery is orders of magnitude faster than the previous one and discovers orders of magnitude more knowledge. And indeed, the same thing is true of computers. Computers can discover knowledge orders of magnitude faster than any of these things that went before and that co-exist with them and orders of magnitude more knowledge in the same amount of time. So how do computers discover new knowledge?

Pedro Domingos has classified 5 ways of in which computers discovered knowledge. We will cover each of these and their applications in more detail :

  1. The first way computers discover knowledge is by filling gaps in existing knowledge. Pretty much the same way that scientists work, right? You make observations, you hypothesize theories to explain them, and then you see where they fall short. And then you adapt them, or throw them away and try new ones, and so on. This is based on logic and is the closest to the algorithms that many of us grew up writing. The core algorithms used in this form of knowledge discovery are related to inverse deduction.
  2. The second way computers discover knowledge is by emulating the brain, the greatest learning machine on earth. Through the understanding of the brain from neuroscience and reverse engineering it. The core algorithms used in this form of knowledge discovery are related to backpropagation.
  3. The third one is by simulating evolution. Evolution, by some standards, is actually an even greater learning algorithm than your brain is, because, first of all, it made your brain. It also made your body. And it also made every other life form on Earth. So this is based on evolutionary biology. The core algorithms used in this form of knowledge discovery are related to genetic programming.
  4. The fourth one is based on the fact that all knowledge is uncertain. When something is induced from data,you’re never quite sure about it.So the way to learn is to quantify that uncertainty usingprobability.And then as you see more evidence,the probability of different hypotheses evolves. The core algorithms used in this form of knowledge discovery are related to the Bayes Theorem.
  5. The last approach, is through reasoning by analogy. Humans do this all the time. You’re faced with a new situation, you try to find a matching situation in your experience, and then you transfer the solution from the situation that you already know to the new situation that you’re faced with. The core algorithms used in this form of knowledge discovery are related to Support Vector machines.
#1 Learning Models based on Logic

Deduction means going from general rules to specific facts. The opposite of that, hypothesizing general rules from specific facts is called induction and is the essence of learning models based on logic.

Let’s take an example of 6 data points

  1. Star wars is a science fiction movie
  2. You like Star Wars
  3. Star Trek is a science fiction movie
  4. You like Star trek
  5. Avatar is a science fiction movie
  6. You like Avatar

The generic hypothesis derived from these facts (i.e. induction) is : You like science fiction movies.

Once rules like this are derived from data, they can be combined in all sorts of different ways to answer questions that have never been thought of. Now these examples are in English. So to induce rules from structured and unstructured data, computers use first order logic (predicate calculus) to represent facts and rules. The gaps in knowledge are replaced by general principles which are then validated with data and readjusted. So as an example, if there were these new data points

  1. Aliens is a science fiction movie
  2. You don’t like Aliens

Here the “gap” is something about the movie Aliens (maybe because it’s also in the horror genre). So the generic hypothesis of You like science fiction movies is refined. This continues till as many gaps as possible are filled based on new instances of data to get to the most accurate hypothesis as possible to predict a particular outcome.

One of the amazing examples of inverse deduction is a Robot scientist, which is a complete automated biologist, It starts out with basic knowledge of molecular biology, DNA, proteins, RNA, and all of that stuff. And then what it actually does is it formulates hypotheses using inverse deduction.It designs experiments to test this hypothesis using things like DNA sequences etc. It physically carries out the experiments with no human help. And then, given the results, it refines the hypotheses,or comes up with new ones, and so on. Now there’s only one of these robots in the world today, its name is Eve. And in 2014 Eve discovered a malaria drug. And the thing that’s amazing about this is that, well, once you’ve made one robot scientist like this,there’s nothing stopping you from making a million. And if we have a million scientists working on a problem that only a few were working on before, it could really speed up the progress of science.

Some algorithms are based on this concept are decision trees, random forests, boosted trees, rotation forests, etc.

While learning models based on logic work well to solve some problems, in many cases they might be too abstract and rigid. If you take an example such as image recognition, there could be billions of variations of the same (let’s say a cat’s) image and a rigid method that uses induction based on logic will likely fail in those predictions.

Refer to this post for a deeper dive into logic based learning models

#2 Learning Models based on reverse engineering the brain

Learning models such as neural networks, deep learning all rely on the paradigm of reverse engineering the brain. So how does this work?

The brain is made of neurons and in order to reverse engineer it what we’re going to do is we’re going to build a model, a mathematical model of how a single neuron works. We’re going to make it as simple as we can, provided it’s enough to learn and to do the inferences that we need it to do. And then we’re going to put these models of neurons together into big networks.And then we’re going to train those networks. And at the end of the day, what we have is, in some sense, a miniature brain that’s much simpler than the real one, but hopefully with some of the same properties.

Now, a neuron is a very interesting kind of cell. It’s a cell that actually looks like a tree (see figure A). There’s the cell body, and then there’s the trunk of the tree is what is called the axon, and then the branches are called dendrites. But where neurons get very different from trees is that the branches of one neuron actually connect back with the roots of many others. And that’s how you get a big network of neurons.

Where the dendrites of one neuron join the dendrites of another, that’s called the synapse (see figure C). And to the best of our knowledge, the way humans learn everything we know is encoded in the strings of the synapses between our neurons. Neurons communicate via electric discharges down their axons. They’re literally an electric discharge called an action potential. And what happens is that if you look at all the charge coming into a neuron through its various synapses, if it exceeds a certain threshold, then that neuron itself fires. And of course, then it sends currents to the neurons downstream. The synaptic process itself involves chemistry and whatnot, but those details are not important for us. So learning basically happens when a neuron helps to make another neuron fire. And then the strength of the connection goes up. This is how all our knowledge is encoded, it’s in how strong the synapses are. If the neurons fire together a lot, then the synapses become stronger and it becomes easier for the first neuron to fire the second one.

Now what we have to do is, first of all, turn it into a model. What our neuron’s going to do is it’s going to do a weighted combination of the inputs. Let’s suppose, for example, that the figure B was representing a neuron in a retina. And let’s assume each of the inputs x1, x2.. is a pixel. Each input gets multiplied by a weight that corresponds to the strength of the synapse to the neuron. And if that weighted sum exceeds a threshold computed by f(x) then we get 1 as the output, otherwise we get 0. So, for example this neuron is trying to recognize a cat, if this is a cat then hopefully what happens is that this weighted sum will be high enough that this will fire and the neuron will say yes, this is a cat.

Now, a single neuron cannot perform a complex task like recognize a cat, what we need is a network of neurons. Now how do you train a whole network of these neurons (figure D)? The inputs go to one set of neurons, each of which computes a function, and then those go to another layer and many many layers until finally you get the output.

But what happens if there was an error? A neuron should have been firing, but wasn’t. The key question is, what do I change in that whole big, messy network to try to make it give the right answer tomorrow?
There is no obvious answer to this question. Because, think of one neuron somewhere in the middle of the network. How is it responsible for the error at the output? The error at the output could have come from an infinitude of different places. This is called the credit assignment problem. And this is the problem that backpropagation solves.

How does backpropagation work? Well, let’s think of the difference between our actual output and our desired output. Let’s say the picture was that of a cat and the neural net identified it as a dog. We can call the numeric difference between a cat(1) and a dog(0) as a delta. This is the error. The output should have been 1 (cat), but let’s say it was 0.2. So it needs to go up. What can we tweak in these weights to make the value go up from 0.2 to 1? Well, at that last layer, the neurons with the highest weights are the ones most responsible for the result. And if one neuron is saying 0 but the answer is 1, well, then its weight needs to go up, because it’s preventing it from firing. And if another one was saying 1 but is not firing, then its weight needs to go down. So what we do is compute the derivative of this error with respect to the weights in the last layer.
So now we know an error signal at the neurons at this layer. And I can keep doing the same thing all the way back to the input. And this is why it’s called backpropagation. Because what we are doing is propagating back the errors and then updating the weights, changing the weights in order to make that error as small as possible. This is the backpropagation algorithm. It’s what’s at the heart of deep learning. And these days, this is used for just about everything on Earth.

Very early on, people used backpropagation to do things like predict the stock market. These days, you use it for search, ad placement, video recognition, speech recognition, simultaneous translations and several applications of neural nets are being introduced everyday.

Some algorithms are based on this concept are static neural nets, dynamic neural nets, recurring neural nets, memory networks, deep belief networks, etc

While neural nets may make several types of predictions, but what about things like commonsense reasoning? It involves information that have never been seen together. If there’s a pothole, should a car drive over it? If a self driving car was only trained with backpropagation, then it would only be able to respond to situations it encountered before. If it was trained on a highway then it would run over Manhattan potholes all the time!

#3 Learning Models based on evolution

So where did that brain come from? That brain was produced by evolution. So maybe a powerful learning model can be created by simulating evolution. These are called genetic algorithms. How do these work?

With evolution you have a population of individuals, each of which is described by its genome. And then each of these individuals gets to go out in the world and be evaluated. It gets evaluated at the task that it’s supposed to be doing. And the individuals that do better will have a higher fitness and will therefore have a higher chance of being the parents of the next generation. You get two very fit parents and you cross over their genomes. And so now you have a child genome that is partly the genome of one parent and partly the genome of the other parent. And then you also have random mutation. Some bits just get randomly mutated because of copying errors, etc which also creates a new population.

We could actually start out with a population of simple “components” that is essentially random. And when these components randomly try to work with other components to produce a particular outcome, they could have various degrees of success or failure. After some number of generations of this, we actually have things that are doing a lot of non-trivial functions. For example, we can evolve circuits that start completely random and end up with radio receivers and amps and things like that in just this way. And they often work better than the ones that are designed by humans. In fact there are a whole bunch of patents where the designs were actually invented by genetic algorithms and not by humans.

So let’s try to apply this to computer programs. A program is really a tree of subroutine calls all the way down to simple operations like additions, multiplications, ANDs and ORs. Let’s represent the problems as trees.
Then in order to cross over two programs (figure a) , what we’re going to do is we’re going to randomly pick a node in each of the trees and then we’re going to swap the sub-trees at those nodes. And we can then evaluate the “fitness” of this new program in producing a result. Various techniques like this can be applied, while re-evaluating the fitness, keeping the combinations that are most fit and discarding the others. And then further mutating or crossing over to ultimately lead to a program best suited to produce a result.

In fact, these days, what the genetic folks are having a lot of fun with is something that’s exciting and scary at the same time. They’re not just doing this with software anymore. They’re not doing it as a simulation inside the computer,but actually doing it out in the physical, real world with robots. You literally start with robots that are random piles of parts (in software). And then once those robots are good enough, they actually get printed by a 3-D printer. And then they start crawling, and walking, and doing things in the real world– seeing how fast they can crawl, trying to recover from injury, and so on and so forth. Then what happens is that in each generation, the fittest robots get to program the 3-D printer to produce the next generation of robots!

While learning models based on evolution can be very powerful it has an inherent problem, it’s too slow. If you think about biological evolution, it took billions of years. But learning algorithms should be able to learn in minutes or seconds. Also many of the products of evolution have many obvious faults. For example the brain has many constraints that computers don’t, like limited short term memory. There is no reason to stay within these constraints.

#4 Learning Models based on uncertainty

The basic idea here is that everything that we learn is uncertain. So what we have to do is compute the probability of each one of our hypotheses and then update it as new evidence comes in. And the way we do that is with Bayes’ theorem. So how does Bayes’ theorem work?

The idea of the Bayes’ theorem is this. Let’s suppose that you have all your hypotheses. You define your space of hypotheses in some way : it could be a neural net or a decision tree or any other mechanism. And now the first thing that you’re going to have is the prior probability for each hypothesis. This is how much you believe in that hypothesis before you’ve even seen any data. So the prior is how much you believe in each hypothesis before you see the evidence. But then what happens is that as the evidence comes in, you update the probability of each hypothesis. A hypothesis that is consistent with the data will see its probability go up. A hypothesis that is inconsistent with the data will see its probability go down.
The consistency of the hypothesis of data is measured by what’s called the likelihood function, which is the probability of seeing the data if the hypothesis is true. And this theory is saying that if your hypothesis makes what you’re seeing likely, then conversely, what you’re seeing makes your hypothesis likely. And the Bayesians incorporate that in the likelihood. And the product of the likelihood and the prior is just the posterior, which is how much you believe the hypothesis after you’ve seen the evidence. So as you see more evidence, the probabilities evolve. And hopefully, at the end of the day, one hypothesis will come out as clearly better than the others. There’s also the marginal, which is just something that you have to divide by to make sure that the probabilities add up to 1.

A lot of great things have been done with Bayesian learning. For example, self-driving cars have Bayesian learning in their brains.

One application of Bayesian learning is medical diagnosis. And the way it works there is the following. The hypothesis, before I’ve seen evidence, is that the disease is a flu or that the disease is not flu. And the prior probability is like your prior probability of a disease being flu in the flu season: 90%, 99%, or anything else, take your pick. And then the evidence is the actual symptoms. So for example, if the symptom is “headache” that probably makes it more likely to be flu. If one of the symptoms is “weakness”, that makes it even more likely to be the flu. On the other hand, if there is “no fever”, that actually makes it less likely to be the flu. And so what the naive Bayes classifier does is it incorporates that evidence. And at the end of the day, it computes a probability that the disease is the flu or not the flu taking all of that into account. And then based on that probability, you can decide whether or not to get further medical tests done.

Some algorithms are based on this concept are Naive Bayes, Gaussian Bayes, Independence bayes etc

With all the great applications of Bayesian learning, it’s not without challenges. For one, because it’s all based on probability, bayesians are never 100% certain. Also, bayesian learning works on a single table of data, where each column represents a variable. But if we have more than one table, bayesian learning is stuck. As an example Bayesian learning cannot directly combine variables of a car, with variables of the environment, both of which are needed to be learned operate a self driving car in certain situations. So, it needs to be augmented with other machine learning techniques to solve such problems.

#5 Learning Models based on Reasoning by Analogy

The basic idea here is that everything that we do, everything that we learn is reasoning by analogy. It’s looking at similarities between the new situation that we need to make a decision in and the situations that we’re already familiar with.

So let’s see how this works with this small puzzle.
Here’s a map of two kingdoms out of the seven kingdoms of Westeros in the show The Game of Thrones. Westerlands, whose capital is Casterly rock and The Reach, whose capital is Highgarden. And we don’t know where the frontier between the two kingdoms is. We just know where the major cities are. Let’s say we know the major cities of Westerlands are represented by a plus(+) and the major cities of The Reach are represented by a minus (-). If this is all we know, how can we determine the border between the two kingdoms?

Now the nearest neighbor algorithm has a very simple answer to this question. It just says, I’m going to assume that a point on the map is in Westerlands if it’s closer to a city marked with +, than to a city marked with -. And the effect that this has is to divide the map into the neighborhood of each city. And then Westerlands is just the union of the neighborhoods of the + cities. The neighborhood of a city is the points that are closer to it than to any other city. And then as a result, you get this jagged straight line frontier.

As you can see, the nearest neighbor algorithm is very simple, with enough data the points converge very closely and it can form very complicated frontiers, converging to the best possible hypothesis.

If you look at this map closely, you could actually throw away some of the cities, like Lannisport, or Shield islands and the frontier doesn’t change at all. The only cities you need to keep are the ones defining the frontier, which are called support vectors, because they support the frontier where it is. So often, you can throw away the great majority of your data and it doesn’t change anything.

You have now have a method of classification where if you identify a new city in The Game of Thrones , it can be slotted into belonging to either the Westerlands or The Reach. This similar mechanism can be applied to a completely different application: recommender systems. Let’s say I want to figure out what movies to recommend to you. We can do that by collaborative filtering: find people who have similar tastes as you (as in they rated a movie 5* and you rated the same movie 5* ). And now if they give 5* to a movie you haven’t seen, then I’m going to hypothesize that, by analogy you are “within the same frontier” and you will also like that movie and so I will recommend it to you.

Some algorithms are based on this concept are Linear regression, Logistic regression, K-means clustering and Support Vector machines.

Conclusion

These are the 5 main types of learning models. There are hundreds of learning models and data mining techniques and most of them fall into one of these types, or a combination of these types.
Is there a problem that these learning models can solve in your area of interest? If so, which of these models do you think can help solve that problem. Please post in the comments below.

12 Comments on “The Five Predictive Learning Models”

  1. Hi Cohorts!
    Please post your thoughts on whether there a problem in your area of interest that machine learning models can solve. If so, which of these learning model types do you think can help solve that problem?

  2. Ravi,
     
    You did a great job outlining the basics of this topic and how to approach it through analogies that are easy to understand. Coming to your question about a problem in an area of interest that can be solved by machine learning and data science, I have some thoughts.

    My MTP topic (area of interest)is the Internet of Things and there are so many obvious use cases about how much data we can collect from these devices to process into insights about the world around us. However, I would like to explore one very unique idea in the space of IoT as it relates to data science and machine learning.

    The second method of learning models based on the brain has inspired this answer. Every one of us humans has some kind of inner monologue. We think and talk in our brains constantly. In many ways, what we think is what we address in our monologue. The brain is a fascinating creation. Did you know that when you engage in an inner monologue (talking to yourself) your brain actually triggers subtle movements in your larynx? It’s the same as if you were speaking aloud, but more subtle. Imagine an IoT device that can detect the vibrations in your neck (a necklace wearable and/or a highly sensitive camera or other vision type of sensor). Modern science can currently detect these vibrations and translate them into the monologue with a very low success rate.

    Reference article here: https://www.newscientist.com/article/mg18124401-300-not-mind-reading-but-it-comes-close/.

    In today’s research based testing, scientists are able to get about 5% to 25% accuracy on this detection. Other researchers have claimed slightly higher matching rates of 12% to 35%. While the trend looks to improve, there are many complications involved. A person’s body weight and prominence of Adams apple also contribute to how easy the vibrations are detected.

    Now, back to the subject of learning models. This entire problem of matching the vibrations in the larynx to words (of thought) is very similar to the logic model described in your article. It is a problem of data, data prediction, and unknown variables (i.e. body weight). It’s the perfect recipe for a learning model to tackle.

    If we could manufacture the detection devices at scale, and put humans in these devices (voluntarily, I hope!) then we would be able to learn from the model outcomes and improve the pattern recognition. Now imagine the power if we could get the match rate above 50%…or even 70%. We would not be able to hide from our thoughts. Criminal (or terrorist) interrogations would become so much easier. Marketers would “read your thoughts” and market products to you one-to-one. You would never be able to lie to your friend or spouse.

    In many ways it’s just like the natural language processing advancement we have seen in recent years. Using advanced camera and vision sensors, we would not even require the wearable IoT device and instead could have a network of sensors similar to the huge closed-circuit camera networks found in cities today. We would have a recorded log of consciousness to feed back through the learning model. We could learn so much more about what the population really thinks, or even use it to predict future events. The cycle could continue until we have a very different type of society and world.

    Well done on this article, and the ability to provoke thought. I enjoyed this read, as well as sharing my thoughts on my area of interest.

  3. @Ravi, this is a great summary learning. Very well sumamrized and very well written. Kudos for making a dry topic so interesting for the cohort.

    As far as my MTP topic goes, learning is front and center of the greatness of cloud computing. Infrastructure is just one of the advantages of the cloud, but there is no power if we cannot utilize that scale for learning and deductions.

    The other thing I am most interested in is Commerce. As Chris points out in his session. Found is the new search that will drive Commerce, and that is heavily dependent on learning to work correctly.

    To that end, I feel that the #2 learning model will help drive the right search. You have to analyze all customer interaction points, and decisions need to be made based on each of those points adding up to find the right recommendations for the customers. It cannot be hypothesis based, nor can be slow and evolution based. Analogy can help, but really, the biggest gain is using the multiple decisions a customer makes to “flash” the recommendation.

    @Dan – I like your idea too. Although, I am not sure I want people to know my inner monologues, and I am sure that criminal interrogation will make this voluntary, just like they do with DNA testing (-:

    Jokes apart, I think the biggest challenge with your idea is the data it will need, even more than the technology. Maybe I am an anomaly, but when I have my monologues, I will be hard pressed to remember it exactly, the minute after I have it. So getting me to provide valid data for this to work will be the hardest part. It is a very interesting concept though.

  4. Hey Ravi,

    excellent job done. I prefer learning models on evolution. It is the most natural approach for me. A trial & error approach has brought so far as today, so it should work for machines as well. I just heard that in Frankfurt the planes are now testing a new approach of how to land on the airport with less noice by giving the pilots guidance when to use breaks, acceleration, etc. A machine good learn this even depending on wind and weather by an evolutionary approach. This might even bring up solutions and steps which a human brain would not think of in the first place.

    Thx
    Arne

  5. Hi Ravi,

    Great article. I really learned a lot. My MTP topic is about Performance Media and in there I implemented a prototype of an intelligent platform based on Amazon Alexa. I have been really amazed by how easy things were to program and how well Alexa is able to identify intents from sentences a user is speaking (even people that are non-native speakers).

    I am not sure what is beyond Alexa, but assume that there a different learning models in play. I definitely see brain emulation as one of the key approaches being used to do voice-to-text and interpret what people say. In addition to this I believe that other learning models are also in use, such as Bayesian learning to resolve unclear speaking or different pronunciations.

    This is a very fascinating space and I am truly looking forward to learning more about it once the CMTOu completion madness is over.

    Regards,
    Olivier.

  6. Ravi,

    We have discussed this in the past how our topics are so close to each other. Great post and in this article you touched upon Pedro Domingo talking about the following

    “The fourth one is based on the fact that all knowledge is uncertain. When something is induced from data,you’re never quite sure about it.So the way to learn is to quantify that uncertainty usingprobability.And then as you see more evidence,the probability of different hypotheses evolves. The core algorithms used in this form of knowledge discovery are related to the Bayes Theorem.”

    I have attempted to use probabilistic thinking using services but the core concept of my MTP Submission – https://cognitivemerchandiser.herokuapp.com/ is trying to use Reinforcement learning on top of Linguistic Inquiry in understanding how

    Reinforcement learning on its own is based on # 2 in your article which is what I am trying to practice as the end game of my MTP submission which I plan to complete this year.

  7. I would like to incorporate machine learning into my PoC built for my MTP.

    There are a couple of areas that spring to mind:

    1. Detecting a specific person is speaking.

    Thinking: i could use a number of inputs to detect whether my wife or I (or both) are speaking. A clustering algorithm or classifier might work, but I like the idea of an evolving algorithm that picks up on new data to tune the parameters of the model.

    I will look into this. Similarly;

    2. Sentence boundary detection. Start the model with a prior based on existing research, creating a function. Use NLP to determine if the sentence that the prior determined to test if sentence is well formed. E.g,: is less than complete, complete, or is more than one sentence. Feed that back into model per person to train the model and augment the prior. Could use Bayesian logic here or simple regression. Not sure…

    At a higher level, my goal is to draw conclusions about an individuals propensity to undertake an action in relation to a brand.

    This feels like a probabilistic view of the world and again may benefit from a Bayesian approach.

    I am going to have to get my hands dirty on this to really understand it, however, I’ve enjoyed spending a few hours reading your content l, that of others and a few Andrew Ng videos…

    I most enjoyed the description of the brain and how that is applied in neural nets!! I think that’s sinking in!!!

    Looking forward to getting going.

    Regards

    Chris

    1. Hey Chris,

      You could use Bayesian on top of any other model. So, for sentence boundary detection, it need not be an either/or for linear regression vs Bayesian. You could come up with a boundary “score” of sorts with your linear regression model and then use Naive Bayes to determine the posterior probability on how accurate that score could be.

      On a side note, something as complicated as sentence boundary detection would likely need a combination of supervised, semi-supervised and unsupervised techniques. And you could potentially combine those results using ensemble learning. Which means, you could take the result of one model and use it within another (this is called boosting). Today, the most powerful learners combine models in different ways using ensemble learning techniques such as boosting, stacking or bagging.

      -Ravi

      1. That sounds fabulous Ravi… I will read that link and ping you for more guidance. I intend to update my code to try and do this in next few weeks.

  8. Hi Ravi,
    An excellent and well written knowledge share, you have demonstrated deep insight into your topic and used interesting examples like game of thrones to keep us(atleast some of us) engaged.

    In my MTP I focussed on what can brands do with chatbots and one issue I continue to see in this domain was the lack of capability to develop generative chatbots. The industry has more or less become one sided and everyone is developing retrieval based chatbots instead of generative ones. This is due to lack of support for the generative model and I think machine learning techniques can bridge this gap. The most obvious predictive learning model is “#2 Learning Models based on reverse engineering the brain”. The most suitable deep learning architecture I could find was sequence to sequence (https://arxiv.org/abs/1409.3215) and its just suited for generating text. Do you have an examples you can share?

    btw I also have another(naughty!) idea of applying these techniques and if you are up for it maybe we can discuss that one offline:-)

    – Karan

  9. Hi Ravi,

    This was a great post for taking the many areas of Data Science and boiling them down to the key tenets – a framework for attacking predictions, if you will. And I really like the real-world examples you provided to make the topic approachable.

    One area of interest that I have is in combining some work that @nuttyket has been doing with emotion detection via camera sensors with the work I’ve been doing in the area of integrating Programmatic Advertising with Digital Out of Home (DOOH) signage.

    I think that it would be interesting to create a taxonomy (or even just level-l categories) of emotional states (happy, sad, upset, angry, elated, pensive, etc.) and make predictions of how well those emotions map to the likelihood that someone would be interested in a given consumer product or service, and then via RTB/Programmatic, serve the appropriate ad (or it could be non-ad content as well), based upon the probability we assign to the likelihood of a consumer buying X product, given Y current emotion being exhibited on his/her face.

    For instance, for an in-store venue of a Pharmacy, if we had a camera (and software) that detects your emotion with a 90% accuracy, that kiosk can display ads for perhaps:
    a) Vitamins, if you seem worn down
    b) Lexapro (requires prescription) or melatonin or St. John’s Wort (supposed natural remedies) if you seem depressed
    c) Vitamin D, if you seem to pale (not enough sun has a high correlation with Vitamin D deficiency)

    Or if we have a similar kiosk in the middle of a shopping mall, we might target content/ads based upon variables other than your emotion, such as:
    a) “NEED HELP FINDING A STORE?”, if you are looking around like you don’t know where you’re going
    b) Foot Locker ad/way-finding, if you’re wearing athletic shoes and have a certain physique (possibly a boosting support vector)
    c) NOT targeting a restaurant ad to you, if you already seem “fat and happy”, as if you’ve already eaten (posture, walking pace, etc. might be helpful to analyze)

    For the first scenario I mentioned, based on emotion detection, we could start with the decision tree, but then strengthen it by adding some Bayesian modeling on top of it, so that our ad-to-emotion mappings are strengthened based upon experience. To do this, I think we would need the kiosk visitor to interact by perhaps scanning a QR Code on the kiosk (perhaps a 5% coupon when redeemed at the pharmacy register, by tracking purchases of the advertised product, etc.), because we’d want to be able to track our ad all the way through to a conversion/purchase in order to refine the model.

    What are your thoughts? Would this be a good use of logical and Bayesian?

  10. hi Ravi,

    Very well drafted post that not only details out the different nuances of the learning models but there applicability on the examples we tend to see and experience on a regular basis. Coming to the question about a problem in the area of interest that can be solved by machine learning and data science, I have some thoughts around learning models based on evolution (3) .

    As new technologies are maturing, every business is becoming digital, which means every customer is a digital customer and with the growth in Intelligent interfaces, they will allow for making informed decisions “on the edge”

    Accenture launched the concept of Customer Genome (DNA of what every business should know about digital customers).

    The focus is to convert abundant information into insights on customer behavior to help understand and predict buying patterns, How to create new experiences within product discovery, time-based behavioral events and social connections. Their aim is to position the brand as an extension of customer identity and weave the business seamlessly into customer lifestyles.
    Customers, through a variety of devices, are increasingly using social channels and alternate community forums not only to connect, but to share insights and feedback on shopping experiences.

    Accordingly to me “enabling the decision to be made on the edge is very challenging because it needs to be fast, it needs to factor in is not only the relevant data that plays a role but a combination of relevant + recent transaction information would be the key . In-order to achieve this refinement of data for creating the 360 view of the customer for targeting across multiple channels like (web, mobile, programmatic ad space etc) a combination of learning algorithm 3 (evolution ) 4 (Uncertainty, Likelihood calculation) & 5 (Reasoning) need to be combined to achieve the right outcome.

Leave a Reply

Your email address will not be published. Required fields are marked *