Making Sense of Data Science, Data Mining, Machine Learning (ML), and Artificial Intelligence (AI)
Data science is a set of tools or methods to understand and make sense of data and information. Data mining can be used to derive existing information to highlight patterns, and serves as foundation for artificial intelligence (AI) and machine learning (ML). AI is a broad term, which in current development trends can often mean using data intelligently to offer solutions to existing problems, and ML is a sub-set of AI in which machines process the information automatically.
There is a further stipulation around AI that it can be weak or narrow, or strong. This relates to the idea that the machine can either simulate a focused human behavior (weak/narrow), or actually think and process a task as a human being would do (strong). Some experts in the field have further broken AI down into three components - narrow, general and super, and others go even further to four types - reactive machines, limited memory, theory of mind and self-awareness - inferring the progressive functionality of the machine.
90% of the data on the internet has been created since 2016, according to an IBM Marketing Cloud study.
Given the degree of buzzword soup out there when discussing data and how it can be used to inform and automate business decisions and decision-making, below are definitions, common examples, and several take aways to make sense of it all.
Definitions (source: Wikipedia)
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.
Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD.
Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data," in order to make predictions or decisions without being explicitly programmed to perform the task. Example: email SPAM filtering
Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart." They work on their own without being encoded with commands. Weak AI Examples: Siri or Alexa; Strong AI Examples: Speech recognition of translation
John McCarthy came up with the term "artificial intelligence" in 1955.
A common data mining project involves the following sequence of 7 steps of Knowledge Discovery in Databases (KDD). KDD is a multi-step process that encourages the conversion of data to useful information.
Data selection
Data collection
Data cleaning / cleansing / preparation (pre-processing)
Data transformation / construction of the new data set / integration with existing data set(s) / formatting to give the right size and structure
Pattern searching (data mining)
Result interpretation (finding presentation, interpretation and evaluation)
Results reporting
Common applications of KDD involve discovering insights to inform marketing, processing in areas like manufacturing or telecommunications, or avoidance of risk such as fraud protection. With the immense volume of data being collected, businesses have found it essential to automate the data discovery and analysis process. Approaches continue to be developed to discover hidden data and make assumptions, which has helped formed a part of artificial intelligence.
Interest in the KDD process has exploded over the past decade. It now houses many different approaches to discovery, which includes inductive learning, Bayesian statistics, semantic query optimization, knowledge acquisition for expert systems and information theory. The ultimate goal is to extract high-level knowledge from low-level data (Technopedia).
Over 2.5 quintillion bytes of data are created every single day, and it's only going to grow from there. By 2020, it's estimated that 1.7MB of data will be created every second for every person on earth. Source: DOMO
Three Key Take Aways
More so than ever before, knowledge is power, and ethics are paramount. The ability to create machines that can think, act and learn independently of human intervention has fueled a serious discussion of what is right, and what is enough, or too much. There are many manifestos that have been developed by the likes of Microsoft, Google, Apple and others to ensure transparent, principled, and ethical considerations around human dignity, rights, freedoms, and cultural diversity are subscribed to. Where the line eventually gets drawn is each of our, and all of our, responsibility.
Data is a key differentiator. Enacted correctly, enabling data to inform decisions and actions will yield winners and losers in corporate, medical, education, government, security, and truly all domains. As important as the disciplines around data are, ensuring that data becomes an equalizer for all populations is as important as ensuring data science and related disciplines are learned and mastered.
The data domain is accelerating. Given the sheer amount of data generated, the importance of learning how to harness the information available to us must become an imperative. It is essential to upskill oneself in the data domain to not be left behind.