Analysis of Pima Indians Diabetes Data using WEKA Machine Learning Software Tool the main objective of this paper is to look into the practical aspects machine learning aspect using the WEKA tool. -

Assessment

Analysis of Pima Indians Diabetes Data using WEKA Machine Learning Software Tool the main objective of this paper is to look into the practical aspects machine learning aspect using the WEKA tool.

Introduction

The information was collected from UCI contraption for purpose of learning. All of the patients met the criterion; Pima-Indian ladies of above 21 age, staying in Arizona and Phoenix. The learning repository must have 768 instances, nine aspects that connect to the binary assessment. 268 cases were grouped as one (1) while the patients (500) were grouped as zero (0). Accuracy was considered with regard to a bigger size of people taking part.

Number	Name	Type
1	Number of times pregnant	Numeric value
2	Plasma glucose concentration a 2 hours in an oral glucose tolerance test	Numeric value
3	Diastolic blood pressure (mm Hg)	Numeric value
4	Triceps skin fold thickness (mm)	Numeric value
5	2-hours serum insulin in ( mu U/ml)	Numeric value
6	Body mass index ( weight in kg/ height in m) ^2)	Numeric value
7	Diabetes pedigree function	Numeric value
8	Age (years)	Numeric value
9	Class variable ( zero or one)	Numeric value
10		Numeric value
11

Processing the Data Set

The data set was changed to a WEKA, 3, model; “ARFF” in connection to the trial categorization on it. Analysis of the attribute’s values is first done before sorting; this is in addition to preprocessing production on the values so as to acquire a better outcome rate and an elevation of the precision categorization. The ranges were keenly assessed and the statistics were acquired from WEKA’s GUI. The following table shows the minimum – maximum classes.

Data Set Attributes graphs

(A)Frequency of pregnancy ,(B) Plasma glucose concentration a 2 hours in an oral glucose tolerance test, (C) Diastolic blood pressure (mm Hg), (D) Triceps skin fold thickness (mm), (E) 2-hours serum insulin in ( mu U/ml), (F) Body mass index ( weight in kg/ height in m) ^2), (G) Diabetes pedigree function (H) Age (years),(I) Class variable ( zero or one).

With the evaluation of the chosen characteristic, each value was accounted for all the features hence the outcome were acquired automatically and with precision (Witten, n.d.). This is since the features were digits; the WEKA tool similarly issues their numerical attributes. These values are seen in the table below:

Statistical Properties of the First Eight Attributes

Attribute number	Mean value	Standard Deviation
1.	3.8	3.4
2.	120.9	32.0
3.	69.1	19.4
4.	20.5	16.0
5.	79.8	115.2
6.	32.0	7.9
7.	0.5	0.3
8.	33.2	11.8

The assessment of the statistical data set was with regard to cleaning, statistical integration, statistical reduction and normalization. There is a variation of 4 and 7. This may bring about a negative outcome; hence normalization can be used on the scale as zero (0) or one (1). This limits the ability of a big variation in the ranges and diminutive ranges of characteristics.

Attribute number	Mean value	Standard Deviation
1	0.230	0.200
2	0.610	0.16
3	0.570	0.160
4	0.210	0.160
5	0.094	0.140
6	0.480	0.180
7	0.170	0.140
8	0.200	0.200

Reduction is vital as it will lead to a limited impact on the results. The WEKA tool brought about a justification of the aspects so as to acquire the characteristics of fold percentage as a binary. The following table shows the Domino impact of the WEKA attributive selector with the filters and search.

Folds (%)	Attribute #
0(0%)	1
10(100%)	2
0(0%)	3
0(0%)	4
1(10%)	5
10(100%)	6
8(80%)	7
10(100%)	8

Classification of outcome

Naïve Bayes Classifier

This was a tutorial basis for the information set. It started with the addition of pre-process process (Stutz, 1996). It later applied a ten fold cross assessment as it was more precise than other classifiers. The outcome is in the table below.

other classifiers used were the multiplier layer perception

The decision Tree Based classifier

As well as the K Star-Instance Based Classifier

Performance Evaluation Metrics

The metrics applied composed of precision, sensitivity, specificity and application of Curves from the ROC. The ROC curves data were acquired from the Maltaba setting to show the classifiers.

		Sensitivity		Specificity
Classifier	Accuracy	Class 0	Class 1	Class 0	Class 1
Naïve Bayes	77.995	0.86	0.630	0.630	0.86
ML Perceptron	76.563	0.856	0.600	0.600	0.856
J48	74.219	0.838	0.560	0.560	0.838
K Star	69.400	0.816	0.450	0.470	0.816

ROC curves for the tested classifiers

ROC curves for all the tested classifiers

Discussion

The classifiers discovered that Naïve Bayes acquired an exact outcome (Tsumoto, 1997). The ROC curves showed the Bayes was more effective. The WEKA tool was applied to assess which classifier is better in validation training. The outcome similarly showed Bayes was faster in giving back information.

The Multi stratagem was better in operation though slow. This is ineffective in use with bigger features that goes back to the propagation Rule.

The J48 model was modeled so as to be applied as an option to the Multi stratagem. It is quicker while the K star shows poor efficiency (Witten, n.d). it may show a quicker time since lacks great elements from getting all of the information required when compared to the others.

Bibliography

Stutz J., P. Cheeseman. (1996) Bayesian classification (autoclass): Theory and results. In Advances in Knowledge Discovery and Data Mining. Massachusetts: AAAI/MIT Press

TTsumoto S., (1997). “Automated Discovery of Plausible Rules Based on Rough Sets and Rough Inclusion,” Proceedings of the Third Pacific-Asia Conference (PAKDD). Beijing, China, pp 210-219.

Witten Ian H., E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann Publishers

Latest Assignments