ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

MachineLearning.Predictions on Irish Flowers

2020-11-25 12:28:45  阅读:312  来源: 互联网

标签:Irish models results Predictions dataset train each model Flowers


文章目录

1.Preparing a Machine Learning Environment in Python

There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries:

  • scipy
  • numpy
  • matplotlib
  • pandas
  • sklearn

2.Load Data

We are going to use the iris flowers dataset. The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.

2.1 Import Libraries and Load Data

To load a csv-format data file, we need read_csv function in pandas module:

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

If you do have network problems, you can download the iris.csv file into your working directory and load it using the same method, changing URL to the local file name.

3.Summerize the Dataset

Now it’s time to take a look at the data in a few different ways as follow:

  • Dimensions of the dataset.
  • Peek at the data itself.
  • Statistical summary of all attributes.
  • Breakdown of the data by the class variable.

3.1 Dimensions of Dataset

# shape
print(dataset.shape)
(150, 5)

3.2 Peek at the Data

It is also a good idea to actually eyeball your data.

print(dataset.head(20))

在这里插入图片描述

3.3 Statistical Summary

Now we can take a look at a summary of each attribute which includes the count, mean, min and max as well as some percentiles.

print(dataset.describe())

在这里插入图片描述

3.4 Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

# class distribution
print(dataset.groupby('class').size())

在这里插入图片描述

4 Data Visualization

We now have a basic idea about the data. We need to extend that with some visualizations.

We are going to look at two types of plots:

  1. Univariate plotsto better understand each attribute.
  2. Multivariate plots to better understand the relationships between attributes.

4.1 Univariate Plots

Given that the input variables are numeric, we can create box and whisker plots of each.

# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
pyplot.show()

在这里插入图片描述
We can also create a histogram of each input variable to get an idea of the distribution.

# histograms
dataset.hist()
pyplot.show()

在这里插入图片描述

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

4.2 Multivariate Plots

Now we can look at the interactions between the variables.
First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

# scatter plot matrix
scatter_matrix(dataset)
pyplot.show()

在这里插入图片描述

Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.
The steps are as follows:

  1. Separate out a test dataset.
  2. Set-up the test harness to use 10-fold cross validation.
  3. Build multiple different models to predict species from flower measurements
  4. Select the best model.

5.1 Create a Test Dataset

We need to know that the model we created is good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.

That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two:

  • 80% of which we will use to train, evaluate and select among our models;
  • 20% that we will hold back as a test dataset to make predictions of accuracy of selected model on unseen data .
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.20, random_state=1)

You now have training data in the X_train and Y_train for preparing models and a X_test and Y_test sets that we can use later to make predictions.

5.2 Test Harness

We will use stratified 10-fold cross validation to estimate model accuracy in order to select the best model.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.

Stratified means that each fold or split of the dataset will aim to have the same distribution of example by class as exist in the whole training dataset.

For more on the k-fold cross-validation technique, see the tutorial:

We set the random seed via the random_state argument to a fixed number to ensure that each algorithm is evaluated on the same splits of the training dataset.

The specific random seed does not matter, learn more about pseudorandom number generators here:

This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.

5.3 Build Models

We don’t know which algorithms would be good on this problem or what configurations to use.

Let’s test 6 different algorithms:

  • Logistic Regression (LR)
  • Linear Discriminant Analysis (LDA)
  • K-Nearest Neighbors (KNN).
  • Classification and Regression Trees (CART).
  • Gaussian Naive Bayes (NB).
  • Support Vector Machines (SVM).

This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms.

# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
	cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

5.4 Select Best Model

We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
在这里插入图片描述
In this case, we can see that it looks like Support Vector Machines (SVM) has the largest estimated accuracy score at about 0.98 or 98%.

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (via 10 fold-cross validation).

A useful way to compare the samples of results for each algorithm is to create a box and whisker plot for each distribution and compare the distributions.
在这里插入图片描述

# Compare Algorithms
pyplot.boxplot(results, labels=names)
pyplot.title('Algorithm Comparison')
pyplot.show()

We can see that the box and whisker plots are squashed at the top of the range, with many evaluations achieving 100% accuracy, and some pushing down into the high 80% accuracies.

5.5 Complete Example

# compare algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))

# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
	cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
	
# Compare Algorithms
pyplot.boxplot(results, labels=names)
pyplot.title('Algorithm Comparison')
pyplot.show()

6.Make Predictions

We must choose an algorithm to use to make predictions. The results in the previous section suggest that the SVM was perhaps the most accurate model, then we use this model as our final model.

6.1 Make Predictions

# make predictions on test dataset
model = SVC(gamma='auto')
model.fit(X_train,Y_train)
predictions = model.predict(X_validation)

You might also like to make predictions for single rows of data. For examples on how to do that, see the tutorial:

You might also like to save the model to file and load it later to make predictions on new data. For examples on how to do this, see the tutorial:

6.2 Evaluate Predictions

We can evaluate the predictions by comparing them to the expected results in the test set, then calculate classification accuracy, as well as a confusion matrix and a classification report.

# Evaluate predictions
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
  • The confusion matrix provides an indication of the errors made.
  • The classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small)

You do not need to know how the algorithms work. It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.

标签:Irish,models,results,Predictions,dataset,train,each,model,Flowers
来源: https://blog.csdn.net/m0_46231504/article/details/110120131

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有