let’s talk about Machine Learning

paras mamgain
10 min readMay 27, 2020

At a very high level, machine learning is the process of teaching a computer system how to make accurate predictions when fed data.

In this write up we will try to cover some of the very basic concepts, terminologies, use-cases of machine learning. The different ways on how the Machine learning is aiding us in different sectors. we will also introduce some popular algorithms of machine learning.

Lets start with an example in the Health sector, the Artificial Intelligence can help us in determining whether there exists risk of developing any diseases by observing the growth of cell or Bankers can use the AI to approve or reject the loan applications based on different parameters like credit score of the borrower etc. Recommender systems also use similar magic in Youtube, Netflix etc while recommending you your next favourite program.
As part of this writeup, we will cover some basics on Machine Learning.

What is Machine Learning ?
Machine Learning
is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model based on inputs and using that to make prediction as or decisions, rather than following only explicitly programmed instructions.
Some of the famous quote’s are as follows:

Traditionally, the computer programming used to expect some
data +program and based on these combo we would have received an output. This output is the outcome of the programming logic that is applied on a data set.
But in case of Machine Learning, we would provide Data and possibly an Output(previously known result/outcome) to the computer and we will expect these two to help us a train a model that can then be used later to carry out prediction for other various other scenarios.
One of the earliest definition of Machine Learning by Tom Mitchell in 1998 says that

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Before we begin with a machine learning problem, we must identify following few things:
* the class of tasks (T)
* the measure of performance to be improved (P)
* source of experience (E)
some more examples of learning program are checker learning, handwriting recognition learning program, robot driving learning problem etc

Why and where it is required ?
- used when human expertise does not exists like navigating on Mars, when humans can’t explain their expertise like recognising speech.
- Models must be customised like personalised medicine, models based on huge amounts of data like genomics.
- recognising spoken words, detecting fraudulent use of credit cards or making the vehicle to autonomously drive i.e. the famous Tesla cars ;)

  • Some more applications are:
    - Medical diagnosis
    - Fraud detection in online applications/transactions etc
    - Recommendations systems in online newspaper, social, video streaming websites
    - Spoken works, Handwritten letters, Pattern recognition etc
    - Financial prediction like investments etc

Type of Learning

Lets now discuss about the types of machine learning. There are four majorly discussed types of learning:

  1. Supervised(Inductive) Learning : learning a function that maps an input to an output based on input-output pairs i.e. learn a function f(x) to predict y given x. There are two types of Supervised Learning techniques. They are known as Classification and Regression.
  2. Unsupervised Learning : This is in contrast to supervised learning which involves inferring the patterns within datasets without any prior reference of known outcomes/labels etc. It finds out hidden structure, unknown commonalities between the data sets.. We let the model work on its own to discover information that may not be visible to the human eye. Unsupervised Algorithm trains on the dataset and draws conclusion on UNLABELLED data.
    e.g. Dimensionality Reduction/feature Selection, Market basket analysis, clustering, market segmentation.
  3. Semi Supervised Learning : this falls between Supervised learning and Unsupervised learning. This approach combines a small amount of labelled data with a large amount of unlabelled data during training.
  4. Reinforcement Learning : is the training of machine learning models to make a sequence of decision. Given a sequence of states and actions with(delayed) rewards, output a policy, policy is a mapping from states ->(to) action that tells what to do in a given state.

Now lets talk about the steps involved in Designing A Learning System :

  1. Describe the problem
  2. Choose the Training experience.
  3. Choosing the Target Function i.e. choose exactly what is to be learned — i.e. the target function.
  4. Choose how to represent the target function.
  5. Choosing a function approximation algorithm i.e. choose a learning algorithm to infer the target function from the experience.
  6. Final Design

oftenly used terminologies :

You will surely come across lots of jazzy, fancy words in the field of Artificial Intelligence and Machine Learning. I have given a very short description about them here below but we will revisit them later in the other sections:

  • Regression/Estimation technique to predict a continuous value.
  • Classification to predict the class or category.
  • Clustering to group the similar cases or for customer segmentation.
  • Association for finding events that often co-occur.
  • Anomaly detection to discover abnormal or unusual cases.
  • Sequence mining for predicting next event.
  • Dimension reduction is used to reduce the size of data.
  • Recommendation system for associating people’s preferences with others who have similar tastes and preferences.

We will talk more about these in the later sections of the articles but first lets discuss and differentiate among the famous buzzwords and terms that you would have heard quite often.

  • Artificial Intelligence is a general field with a broad scope including : computer vision, Language processing, creativity, summarisation etc.
  • Machine Learning is the branch of AI that covers the statistical part of artificial intelligence. It teaches the computer to solve problems by looking at hundreds or thousands of examples, learning from them to solve the same problems in new situations.
  • Deep Learning is a very special field of ML where computers can actually learn and make intelligent decisions on their own. It involves deeper & complex level of automation in comparison with ML algorithms.

useful python packages :

  1. Numpy : math library to work with n-dimensional array.
  2. SciPy : is a good library for scientific and high-performance computation. Its a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimisation, statistics and much more.
  3. Matplotlib : very popular plotting package that provides 2D plotting as well as 3D plotting.
  4. Pandas : is a high level python library that provides high-performance, easy to use data structures. It has many functions for data importing, manipulation and analysis.
  5. Scikit-learn : collection of algorithms and tools for machine learning. It has most of the classification, regression and clustering algorithms and is designed to work with Numpy and SciPy. Most of the ML pipelines tasks are implemented already in scikit learn like pre-processing of data, feature selection, feature extraction, train/test splitting, feature extraction, train/test splitting, defining the algorithms, fitting models, tuning parameters, prediction, evaluation and exporting the models.

Regression

One of the most important and widely used machine learning and statistical tool. It does not refer to Linear regression only rather any time we’re trying to predict a numerical value like stock price, box office revenue etc. It allows you to make predictions from data by learning the relationship between features of your data and some observed, continuous-valued response. Following are some of the regression models.

  • Simple Linear Regression
  • Multiple Regression

Classification

In this section we will briefly cover KNN, Decision trees, SVM’s and logistic regression.
In ML, classification is a supervised learning approach which is a means of categorising/classifying the unknown items into a discrete set of classes. Given a set of training data, along with target labels classifcation determines the class label for an unlabelled test case.
e.g. the use of credit score by the banks which is generated based on the previous loan/instalments records of an individual. Banks then use these records to determine if the person availing the loan is defaulter or non-defaulter.
some more places where the classification is used is in classifying email filters, handwriting recognition, document classification, speech recognition etc.
Lets cover some classification algorithms :

1. KNN(K-Nearest Neighbour Algorithm) —
takes bunch of label points to learn how to label new unknown data point. method for classifying cases based on their similarity to other cases. Cases that are near to each other are said to be “neighbours”. Distance between two neighbours is the measure of dissimilarities between these twopoints.

Evaluation metrics in Classification
explains the performance of a model. How accurate is the predicted value. To do this, we compare the actual labels with the predicted labels and test the accuracy of the models.
There are multiple model evaluation metrics but we will list down the popular ones right now :
1.1. Jaccard Index - is a statistic used in understanding the similarities between sample sets.
1.2. F1-Score -
via confusion matrix. F1-score varies from 0.00(worst case) to 1.00(best case).
1.3. Log loss -
performance of a classifier where the predicted output is a probability value between 0 and 1.

2. Decision Trees:
In a sample of binary classifiers you can use the training part of the dataset to build a decision tree and then use it to predict the class of an unknown data point.
Decision trees are build by splitting the training set into distinct nodes where one node contains all of or most of one category of the data.
Each node represent a test and the branch represents the result of that test and each leaf node assigns a class to the unknown data point.
Decision trees are built using recursive partitioning to classify the data. The algorithm chooses the most predictive feature to split the data on. What is important in making a decision tree, is to determine “which attribute is the best, or more predictive, to split data based on the feature.” For this decision we use the Entropy and Information Gain.

3. Logistic Regression:
is a statistical and machine learning technique for classifying records of a dataset based on the values of the input fields. In Logistic regression, we use one or more independent variables to predict an outcome which we call a dependent variable.
Logistic regression is analogous to linear regression but it tries to predict a categorical or discrete target field instead of a numeric one. It can be used for both binary as well as multi-class classification.

4. Support Vector Machines(SVM):
SVM
is a supervised learning algorithm that can classify cases by finding a separator. SVM works by first, mapping data to a high-dimensional feature space so that data points can be categorised, even when the data are not otherwise linearly separable, then a separator is estimated for the data. The data should be transformed in such a way that a separator could be drawn as a hyperplane.
It is useful for image analysis, text mining tasks, detecting spams, sentiment analysis and gene expression classification.

Clustering:

is the process of dividing the data points into number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other group.
The clustering Application and different type of its algorithms are used for customer data segmentation process.
It means finding cluster in a dataset unsupervised , cluster is a group of objects that are similar to other objects in the cluster and dissimilar to data points in other clusters.
e.g. retail marketing, banking, insurance, publication, medicine, biology etc

It can also be used to carry out exploratory data analysis, summary generation, outlier detection, finding duplicates, pre-processing step etc.

Clustering algorithms :
- Partitioned-based algorithms : relatively efficient e.g. K means
- Hierarchical Clustering : produces trees of clusters
- Density-based Clustering : produces arbitrary shaped cluster e.g. :DBScan

(i) K-means Algorithms : It divides the data into non-overlapping subsets(clusters) without any cluster-internal structure.
Determines the similarity and dis-similarity and form the cluster based on that.
Accuracy of the clusters can be measured using two approaches
- External approach
- Internal approach
K-Means is a partitioned-based clustering, which is:
a) Relatively efficient on medium and large sized datasets
b) Produces sphere-like clusters because the clusters are shaped around the centroids.
c) Its drawback is that we should pre-specify the number of clusters, and this is not an easy task.

(ii). Hierarchical Clustering :
is a method of cluster analysis which seeks to build a hierarchy of cluster (i.e. tree type structure based on hierarchy) like dendrogram .
There are 2 strategies for this:
1. top-down(divisive) - divide a big cluster into small cluster
2. bottom-up(agglomerative) - collect things i.e. merge cluster to form a bigger cluster

(iii) DBScan :
it stands for Density-based spatial clustering of applications with noise and is appropriate to use when examining special data. Elements of same shaped cluster might not share some commonness therefore it is used to find associations and structures in data that are hard to find otherwise.

Recommender System

Even though people have different taste, different habits but there are similarities in the things that people tend to like or you can say there are patterns that is unconsciously followed by the people when they decide, like, dislike things in the same category or things sharing the same characteristics.
Recommender system tends to capture these patterns and similar behaviours to help predict what else you might like.
applications of recommender has been in ecommerce websites, movies or OTT content, news etc.
In terms of implementing recommender systems there are 2 types:
- Memory-based and
- Model-based.

  • In memory-based approaches, we use the entire user-item dataset to generate a recommendation system. It uses statistical techniques to approximate users or items. Examples of these techniques include: Pearson Correlation, Cosine Similarity and Euclidean Distance among others.
  • In model-based approaches, a model of users is developed in an attempt to learn their preferences. Models can be created using Machine Learning techniques like regression, clustering, classification, and so on.

Content Based Recommender system : tries to recommend items to users based on their profiles. The user profile revolves around that users preferences and tastes.

Collaborative filtering : based on the fact that there exists relationship between products and people’s interest.It uses two approaches
- User-based and
- Item-based approach.

Summary

Machine learning enables analysis of massive quantities of data. It generally delivers accurate results to identify profitable opportunities or dangerous risks, it may also require additional time and resources to train it properly. Combining Machine learning with AI increases its effectiveness by a considerable margin.

--

--