Assessment
Analysis of Pima Indians Diabetes Data using WEKA Machine Learning Software Tool the main objective of this paper is to look into the practical aspects machine learning aspect using the WEKA tool.
Introduction
The information was collected from UCI contraption for purpose of learning. All of the patients met the criterion; Pima-Indian ladies of above 21 age, staying in Arizona and Phoenix. The learning repository must have 768 instances, nine aspects that connect to the binary assessment. 268 cases were grouped as one (1) while the patients (500) were grouped as zero (0). Accuracy was considered with regard to a bigger size of people taking part.
Number | Name | Type
|
1 | Number of times pregnant | Numeric value |
2 | Plasma glucose concentration a 2 hours in an oral glucose tolerance test | Numeric value |
3 | Diastolic blood pressure (mm Hg) | Numeric value |
4 | Triceps skin fold thickness (mm) | Numeric value |
5 | 2-hours serum insulin in ( mu U/ml) | Numeric value |
6 | Body mass index ( weight in kg/ height in m) ^2) | Numeric value |
7 | Diabetes pedigree function | Numeric value |
8 | Age (years) | Numeric value |
9 | Class variable ( zero or one) | Numeric value |
10 | Numeric value | |
11 |
Processing the Data Set
The data set was changed to a WEKA, 3, model; “ARFF” in connection to the trial categorization on it. Analysis of the attribute’s values is first done before sorting; this is in addition to preprocessing production on the values so as to acquire a better outcome rate and an elevation of the precision categorization. The ranges were keenly assessed and the statistics were acquired from WEKA’s GUI. The following table shows the minimum – maximum classes.
Data Set Attributes graphs
(A)Frequency of pregnancy ,(B) Plasma glucose concentration a 2 hours in an oral glucose tolerance test, (C) Diastolic blood pressure (mm Hg), (D) Triceps skin fold thickness (mm), (E) 2-hours serum insulin in ( mu U/ml), (F) Body mass index ( weight in kg/ height in m) ^2), (G) Diabetes pedigree function (H) Age (years),(I) Class variable ( zero or one).
With the evaluation of the chosen characteristic, each value was accounted for all the features hence the outcome were acquired automatically and with precision (Witten, n.d.). This is since the features were digits; the WEKA tool similarly issues their numerical attributes. These values are seen in the table below:
Statistical Properties of the First Eight Attributes
Attribute number | Mean value | Standard Deviation |
1. | 3.8 | 3.4 |
2. | 120.9 | 32.0 |
3. | 69.1 | 19.4 |
4. | 20.5 | 16.0 |
5. | 79.8 | 115.2 |
6. | 32.0 | 7.9 |
7. | 0.5 | 0.3 |
8. | 33.2 | 11.8 |
The assessment of the statistical data set was with regard to cleaning, statistical integration, statistical reduction and normalization. There is a variation of 4 and 7. This may bring about a negative outcome; hence normalization can be used on the scale as zero (0) or one (1). This limits the ability of a big variation in the ranges and diminutive ranges of characteristics.
Attribute number | Mean value | Standard Deviation
|
1 | 0.230 | 0.200 |
2 | 0.610 | 0.16 |
3 | 0.570 | 0.160 |
4 | 0.210 | 0.160 |
5 | 0.094 | 0.140 |
6 | 0.480 | 0.180 |
7 | 0.170 | 0.140 |
8 | 0.200 | 0.200 |
Reduction is vital as it will lead to a limited impact on the results. The WEKA tool brought about a justification of the aspects so as to acquire the characteristics of fold percentage as a binary. The following table shows the Domino impact of the WEKA attributive selector with the filters and search.
Folds (%) | Attribute # |
0(0%) | 1 |
10(100%) | 2 |
0(0%) | 3 |
0(0%) | 4 |
1(10%) | 5 |
10(100%) | 6 |
8(80%) | 7 |
10(100%) | 8 |
Classification of outcome
Naïve Bayes Classifier
This was a tutorial basis for the information set. It started with the addition of pre-process process (Stutz, 1996). It later applied a ten fold cross assessment as it was more precise than other classifiers. The outcome is in the table below.
other classifiers used were the multiplier layer perception
The decision Tree Based classifier
As well as the K Star-Instance Based Classifier
Performance Evaluation Metrics
The metrics applied composed of precision, sensitivity, specificity and application of Curves from the ROC. The ROC curves data were acquired from the Maltaba setting to show the classifiers.
Sensitivity | Specificity | ||||
Classifier | Accuracy | Class 0 | Class 1 | Class 0 | Class 1 |
Naïve Bayes | 77.995 | 0.86 | 0.630 | 0.630 | 0.86 |
ML Perceptron | 76.563 | 0.856 | 0.600 | 0.600 | 0.856 |
J48 | 74.219 | 0.838 | 0.560 | 0.560 | 0.838 |
K Star | 69.400 | 0.816 | 0.450 | 0.470 | 0.816 |
ROC curves for the tested classifiers
ROC curves for all the tested classifiers
Discussion
The classifiers discovered that Naïve Bayes acquired an exact outcome (Tsumoto, 1997). The ROC curves showed the Bayes was more effective. The WEKA tool was applied to assess which classifier is better in validation training. The outcome similarly showed Bayes was faster in giving back information.
The Multi stratagem was better in operation though slow. This is ineffective in use with bigger features that goes back to the propagation Rule.
The J48 model was modeled so as to be applied as an option to the Multi stratagem. It is quicker while the K star shows poor efficiency (Witten, n.d). it may show a quicker time since lacks great elements from getting all of the information required when compared to the others.
Bibliography
Stutz J., P. Cheeseman. (1996) Bayesian classification (autoclass): Theory and results. In Advances in Knowledge Discovery and Data Mining. Massachusetts: AAAI/MIT Press
TTsumoto S., (1997). “Automated Discovery of Plausible Rules Based on Rough Sets and Rough Inclusion,” Proceedings of the Third Pacific-Asia Conference (PAKDD). Beijing, China, pp 210-219.
Witten Ian H., E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann Publishers