Detailed description about the parameters of Support Vector Machines

Let’s discuss about SVM algorithm..

SVM algorithm.

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning.

The main goal of SVMs is to divide the datasets into number of classes in order to find a maximum marginal hyperplane (MMH). And SVMs will generate hyperplanes iteratively that separates the classes in the best way. After that it will choose the hyperplane that segregate the classes correctly.

And the optimal hyperplane is the one which has the biggest margin. because the goal of SVMs is not only classifies the existing dataset but also helps predict the class of the unseen data.

Some important words in SVM are as follows:

Support Vectors − They may be defined as the data points which are closest to the hyperplane. Support vectors help in deciding the separating line.

Hyperplane − The decision plane or space that divides set of objects having different classes.

Margin − The gap between two lines on the closet data points of different classes is called margin.

NOTE:: It is very significant to remember for SVM that the input data need to be normalized so that features are on the same scale and compatible.

Let’s discuss about different parameters:

The SVM algorithm, like gradient boosting, is very popular, very effective and provides a large number of hyper parameters to tune.

(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)

Above are the different parameters, which we can tune to get better accuracy. let’s discuss in detail….

1. degree:

It is the degree of the polynomial kernel(“poly”) and is ignored by all other kernels. The default value is 3. if the polynomial kernel works out, then it is a good idea to dive into the ‘degree’ hyper parameter.

It’s basically the degree of the polynomial used to find the hyperplane to split the data. Degree parameter controls the flexibility of the decision boundary. Higher degree kernels yield a more flexible decision boundary.

The lowest degree polynomial is the linear kernel, which is not sufficient when a nonlinear relationship between features exists. Also, increasing this parameters leads to higher training times.

2. decision_function_shape:

Here we have two options, like “ovo” and “ovr”. By using this parameter we can decide whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes — 1) / 2). However, one-vs-one (‘ovo’) is always used as multi-class strategy.

In its most simple type SVM are applied on binary classification, dividing data points either in 1 or 0. The parameter is ignored for binary classification.

The multi class problem is broken down to multiple binary classification cases, which is also called one-vs-one. In sci-kit-learn one-vs-one is not default and needs to be selected explicitly (as can be seen further down in the code). One-vs-rest is set as default. It basically divides the data points in class x and rest. Consecutively a certain class is distinguished from all other classes.

The number of classifiers necessary for one-vs-one multi class classification can be retrieved with the following formula (with n being the number of classes):

In the one-vs-one approach, each classifier separates points of two different classes and comprising all one-vs-one classifiers leads to a multi class classifier.

3. Kernel:

SVMs are also called kernelized SVM due to their kernel that converts the input data space into a higher-dimensional space.

Kernel parameters also have a significant effect on the decision boundary.

kernel parameters selects the type of hyperplane used to separate the data. Using ‘linear’ will use a linear hyperplane (a line in the case of 2D data). ‘rbf’, ‘sigmoid’ and ‘poly’ uses a non linear hyper-plane.

Represents the kernel function that turns the input space into a higher-dimensional space, so that not every data point is explicitly mapped.

Let’s discuss more about kernel function, The most popular kernel functions, that are also available in scikit-learn are linear, polynomial, radial basis function and sigmoid. and the default value of kernel would be radial basis function(‘rbf’).

Kernel function is a method used to take data as input and transform into the required form of processing data.

kernel is used due to set of mathematical function used in SVM provides the window to manipulate the data.

So kernel function generally transforms the training set of data so that a non-linear decision surface is able to transformed to a linear equation in a higher number of dimension spaces.

Basically, It returns the inner product between two points in a standard feature dimension. And,

In the following you can see how these four kernel functions look like:

A. Linear kernel: Used when data is linearly separable.

Formula for Linear function

B. Polynomial kernel: It represents the similarity of vectors in training set of data in a feature space over polynomial of the original variables used in “poly” kernel. In other words, The process of generating new features by using a polynomial combination of all the existing features.

Formula for Polynomial function

C. Radial Basis kernel (RBF): It is used to perform transformation, where there is no prior knowledge about data, and adding radial basis method to improve the transformation. In other words, The process of generating new features calculating the distance between all other dots to a specific dot.

Formula for RBF function

D. Sigmoid kernel: This function is equivalent to a two-layer perceptron model of neural network, which is used as activation function for artificial neurons.

Formula for Sigmoid function

4. gamma:

It is the kernel coefficient for kernels ‘rbf’, ‘poly’ and ‘sigmoid’. If gamma is ‘auto’, then 1/n_features will be used instead. Higher the value of gamma, will try to exact fit the as per training data set i.e. generalization error and cause over-fitting problem.

Low values of gamma indicates a large similarity radius which results in more points being grouped together. For high values of gamma, the points need to be very close to each other in order to be considered in the same group (or class). Therefore, models with very large gamma values tend to overfit.

Here , we have two values like ‘scale’ and ‘auto’. if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma, if ‘auto’, uses 1 / n_features.

5. C parameter:

C is Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.

Penalty parameter C of the error term. It also controls the trade-off between smooth decision boundaries and classifying the training points correctly.

The penalty term that is passed as a hyper parameter in SVM while dealing with both linearly separable and non linear solutions is denoted as ‘C’ that is called as Degree of tolerance. Large value of C results in the more penalty SVM gets when it makes a misclassification. The decision boundary will be dependent on narrow margin and less support vectors.

Increasing C values may lead to overfitting the training data.

Gamma vs C parameter

For a linear kernel, we need to optimize the C parameter. However, if we want to use an RBF kernel, both C and gamma parameter need to optimized simultaneously. If gamma is large, the effect of C becomes negligible. If gamma is small, C affects the model just like how it affects a linear model. Typical values for C and gamma are as follows.

And These are the important parameters need to tune for getting better accuracy. Thanks for reading……