Machine learning and artificial intelligence (AI) are all the rage these days — but with all the buzzwords swirling around them, it's easy to get lost and not see the difference between hype and reality. For example, just because an algorithm is used to calculate information doesn’t mean the label "machine learning" or "artificial intelligence" should be applied.
Before we can even define AI or machine learning, though, I want to take a step back and define a concept that is at the core of both AI and machine learning: algorithm.
An algorithm is a set of rules to be followed when solving problems. In machine learning, algorithms take in data and perform calculations to find an answer. The calculations can be very simple or they can be more on the complex side. Algorithms should deliver the correct answer in the most efficient manner. What good is an algorithm if it takes longer than a human would to analyze the data? What good is it if it provides incorrect information?
Algorithms need to be trained to learn how to classify and process information. The efficiency and accuracy of the algorithm are dependent on how well the algorithm was trained. Using an algorithm to calculate something does not automatically mean machine learning or AI was being used. All squares are rectangles, but not all rectangles are squares.
Unfortunately, today, we often see the machine learning and AI buzzwords being thrown around to indicate that an algorithm was used to analyze data and make a prediction. Using an algorithm to predict an outcome of an event is not machine learning. Using the outcome of your prediction to improve future predictions is.
AI and machine learning are often used interchangeably, especially in the realm of big data. But these aren’t the same thing, and it is important to understand how these can be applied differently.
Artificial intelligence is a broader concept than machine learning, which addresses the use of computers to mimic the cognitive functions of humans. When machines carry out tasks based on algorithms in an “intelligent” manner, that is AI. Machine learning is a subset of AI and focuses on the ability of machines to receive a set of data and learn for themselves, changing algorithms as they learn more about the information they are processing.
Training computers to think like humans is achieved partly through the use of neural networks. Neural networks are a series of algorithms modeled after the human brain. Just as the brain can recognize patterns and help us categorize and classify information, neural networks do the same for computers. The brain is constantly trying to make sense of the information it is processing, and to do this, it labels and assigns items to categories. When we encounter something new, we try to compare it to a known item to help us understand and make sense of it. Neural networks do the same for computers.
Extract meaning from complicated data
Detect trends and identify patterns too complex for humans to notice
Learn by example
Speed advantages
Deep learning goes yet another level deeper and can be considered a subset of machine learning. The concept of deep learning is sometimes just referred to as "deep neural networks," referring to the many layers involved. A neural network may only have a single layer of data, while a deep neural network has two or more. The layers can be seen as a nested hierarchy of related concepts or decision trees. The answer to one question leads to a set of deeper related questions.
Deep learning networks need to see large quantities of items in order to be trained. Instead of being programmed with the edges that define items, the systems learn from exposure to millions of data points. An early example of this is the Google Brain learning to recognize cats after being shown over ten million images. Deep learning networks do not need to be programmed with the criteria that define items; they are able to identify edges through being exposed to large amounts of data.
Whether you are using an algorithm, artificial intelligence, or machine learning, one thing is certain: if the data being used is flawed, then the insights and information extracted will be flawed. What is data cleansing?
“The process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect or irrelevant parts of the data and then replacing, modifying or deleting the dirty or coarse data.”
And according to the CrowdFlower Data Science report, data scientists spend the majority of their time cleansing data — and surprisingly this is also their least favorite part of their job. Despite this, it is also the most important part, as the output can’t be trusted if the data hasn’t been cleansed.
For AI and machine learning to continue to advance, the data driving the algorithms and decisions need to be high-quality. If the data can’t be trusted, how can the insights from the data be trusted?