Wednesday, June 5, 2019
K Means Clustering With Decision Tree Computer Science Essay
K Means Clustering With finality corner encipherr Science EssayThe K-means clustering data excavation algorithmic rule is commonly utilize to find the clusters due to its simplicity of implementation and fast execution. After applying the K-means clustering algorithm on a dataset, it is difficult for whiz to interpret and to extract ask results from these clusters, until anformer(a) data mining algorithm is not used. The last tree (ID3) is used for the rendering of the clusters of the K-means algorithm because the ID3 is faster to use, easier to generate understand qualified rules and simpler to explain. In this seek paper we integrate the K-means clustering algorithm with the conclusion tree (ID3) algorithm into a one algorithm using intelligent gene, called Learning Intelligent componentive role (LI factor). This LIAgent capable of to do the mixture and interpretation of the ween dataset. For the visualization of the clusters 2D scattered graphs are drawn.Keywords fleshification, LIAgent, Interpretation, Visualization1. IntroductionThe data mining algorithms are utilise to discover hidden, new patterns and relations from the complex datasets. The uses of intelligent mobile agents in the data mining algorithms further boost their study. The term intelligent mobile agent is a combination of two different disciplines, the agent is constituted from Artificial Intelligence and code mobility is defined from the distributed governing bodys. An agent is an object which has independent thread of control and elicit be initiated. The first step is the agent initialization. The agent will thus start to operate and may stop and start again depending upon the environment and the tasks that it well-tried to accomplish. After the agent finished all the tasks that are required, it will end at its complete state. defer 1 elaborates the different states of an agent 1234.Table 1. States of an agentName of StepDescriptionInitializePerforms one-time setup a ctivity.StartStart its job or task.StopStops its jobs or tasks after saving intermediate results.CompletePerforms completion or termination activity.There is link between Artificial Intelligence (AI) and the Intelligent Agents (IA). The data mining is known as mechanism Learning in Artificial Intelligence. Machine Learning deals with the development of techniques which allows the computer to learn. It is a method of creating computer programs by the analysis of the datasets. The agents must be able to learn to do classification, clustering and prediction using learning algorithms 5678.The remainder of this paper is organized as followos Section 2 reviews the relevant data mining algoritms, namely the K-means clustering and the end tree (ID3). Section 3 is about the methodology a hybrid integration of the data mining algorithms. In air division 4 we discuss the results and dicussion. Finally section 5 presents the conclusion.2. Overview of Data Mining AlgorithmsThe K-means cluste ring data mining algorithm is used for the classification of a dataset by producing the clusters of that dataset. The K-means clustering algorithm is a kind of unsupervised learning of machine learning. The finish tree (ID3) data mining algorithm is used to interpret these clusters by producing the ratiocination rules in if- whence-else form. The decision tree (ID3) algorithm is a type of supervised learning of machine learning. Both of these algorithms are combined in one algorithm through intelligent agents, called Learning Intelligent Agent (LIAgent). In this section we will discuss both of these algorithms.2.1. K-means clustering AlgorithmThe following go explain the K-means clustering algorithmStep 1 Enter the number of clusters and number of iterations, which are the required and basic inputs of the K-means clustering algorithm.Step 2 count on the initial centroids by using the Range Method shown in equations 1 and 2.(1)(2)The initial centroid is C(ci, cj).Where max X, max Y, min X and min Y illustrate maximum and minimum values of X and Y attributes respectively. k represents the number of clusters and i, j and n vary from 1 to k where k is an integer. In this way, we rotter calculate the initial centroids this will be the starting point of the algorithm. The value (maxX minX) will provide the range of X attribute, connaturally the value (maxY minY) will give the range of Y attribute. The value of n varies from 1 to k. The number of iterations should be small otherwise the time and space complexity will be real steep and the value of initial centroids will also become very high and may be out of the range in the disposed dataset. This is a major drawback of the K-means clustering algorithm.Step 3 Calculate the distance using Euclideans distance formula in equation 3. On the basis of the distances, generate the partition by assigning each sample to the closest cluster.Euclidean Distance Formula (3)Where d(xi, xj) is the distance between xi and xj. xi and xj are the attributes of a given object, where i and j vary from 1 to N where N is total number of attributes of a given object. i,j and N are integers.Step 4 Compute new cluster centers as centroids of the clusters, again compute the distances and generate the partition. Repeat this until the cluster memberships stabilizes 910.The strengths and weaknesses of the K-means clustering algorithm are discussed in table 2.Table 2. Strengths and impuissance of the K-means clustering AlgorithmStrengthsWeaknessesTime complexity is O(nkl). Linear time complexity in the size of the dataset.It is easy to implement, it has the drawback of depending on the initial centre provided.Space complexity is O(k + n).If a distance measure does not exist, especially in multidimensional spaces, first define the distance, which is not always easy.It is an order-independent algorithm. It generates same partition of data disregardless of order of samples.The Results obtained from this clusterin g algorithm grass be interpreted in different ways.Not applicableAll clustering techniques do not divvy up all the requirements adequately and concurrently.The following are areas but not limited to where the K-means clustering algorithm can be appliedMarketing Finding groups of customers with similar behavior given large database of customer containing their profiles and past records.Biology folkification of plants and animals given their features.Libraries Book ordering.Insurance Identifying groups of motor insurance policy holders with a high average claim cost identifying frauds.City-planning Identifying groups of houses according to their house type, value and geographically location.Earthquake studies Clustering observed earthquake epicenters to identify dangerous zones.WWW record classification clustering web log data to discover groups of similar access patterns.Medical Sciences secernification of medicines patient records according to their doses etc. 1112.2.2. Decisio n Tree (ID3) AlgorithmThe decision tree (ID3) produces the decision rules as an output. The decision rules obtained from ID3 are in the form of if-then-else, which can be use for the decision support systems, classification and prediction. The decision rules are helpful to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. The function of the decision tree (ID3) is shown in the see 1.Figure 1. The Function of Decision Tree (ID3) algorithmThe cluster is the input data for the decision tree (ID3) algorithm, which produces the decision rules for the cluster.The following steps explain the Decision Tree (ID3) algorithmStep 1 Let S is a training set. If all instances in S are positive, then create YES customer and halt. If all instances in S are negative, create a NO node and halt. Otherwise select a feature F with values v1,,vn and create a decision node.Step 2 Partition the training instances in S into subsets S1, S2, , Sn according to the values of V.Step 3 Apply the algorithm recursively to each of the sets Si 1314.Table 3 shows the strengths and weaknesses of ID3 algorithm.Table 3. Strengths and Weaknesses of Decision Tree (ID3) AlgorithmStrengthsWeaknessesIt generates understandable rules.It is less appropriate for a continuous attribute.It performs classification without requiring often computation.It does not perform better in problems with many class and small number of training examples.It is suitable to handle both continuous and categorical variables.The ripening of a decision tree is expensive in terms of computation because it sorts each node before finding the best split.It provides an indication for prediction or classification.It is suitable for a single field and does not treat well on non-rectangular regions.3. MethodologyWe combine two different data mining algorithms namely the K-means clustering and Decision tree (ID3) into a one algorithm using intelligent agent called Learning Intelligen t Agent (LIAgent). The Learning Intelligent Agent (LIAgent) is capable of clustering and interpretation of the given dataset. The clusters can also be visualized by using 2D scattered graphs. The architecture of this agent system is shown in figure 2.Figure 2. The Architecture of LIAgent SystemThe LIAgent is a combination of two data mining algorithms, the one is the K-means clustering algorithm and the second is the Decision tree (ID3) algorithm. The K-means clustering algorithm produces the clusters of the given dataset which is the classification of that dataset and the Decision tree (ID3) will produce the decision rules for each cluster which are useful for the interpretation of these clusters. The user can access both the clusters and the decision rules from the LIAgent. This LIAgent is used for the classification and the interpretation of the given dataset. The clusters of the LIAgent are further used for visualization using 2D scattered graphs. Decision tree (ID3) is faster t o use, easier to generate understandable rules and simpler to explain since any decision that is made can be understood by believe path of decision. They also help to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. The decision rules are obtained in the form of if-then-else, which can be used for the decision support systems, classification and prediction.A medical dataset Diabetes is used in this research paper. This is a dataset/testbed of 790 records. The data of Diabetes dataset is pre-processed, called the data standardization. The interval scaly data is properly cleansed. The attributes of the dataset/testbed Diabetes areNumber of times pregnant (NTP)(min. age = 21, max. age = 81)Plasma glucose concentration a 2 hours in an oral glucose adjustment test (PGC)Diastolic blood pressure (mm Hg) (DBP)Triceps skin fold thickness (mm) (TSFT)2-Hour serum insulin (m U/ml) (2HSHI)Body mass index (weight in kg/(height in m)2) (BMI)D iabetes pedigree function (DPF)AgeClass (whether diabetes is cat 1 or cat 2) 15.We create the four vertical partitions of the dataset Diabetes, by selecting the proper number of attributes. This is illustrated in tables 4 to 7.Table 4. 1st Vertically partition of Diabetes DatasetNTPDPFClass40.627-ive20.351+ive22.288-iveTable 5. 2nd Vertically partition of Diabetes DatasetDBPAGEClass7250-ive6631+ive6433-iveTable 6. 3rd Vertically partition of Diabetes DatasetTSFTBMIClass3533.6-ive2928.1+ive043.1-iveTable 7. 4th Vertically partition of Diabetes DatasetPGC2HISClass1480-ive8594+ive185168-iveEach partitioned table is a dataset of 790 records only 3 records are exemplary shown in each table. For the LIAgent, the number of clusters k is 4 and the number of iterations n in each case is 50 i.e. value of k =4 and value of n=50. The decision rules of each clusters is obtained. For the visualization of the results of these clusters, 2D scattered graphs are also drawn.4. Results and DiscussionTh e results of the LIAgent are discussed in this section. The LIAgent produces the two outputs, namely, the clusters and the decision rules for the given dataset. The total xvi clusters are obtained for all four partitions, four clusters per partition. Not all the clusters are good for the classification, only the required and useful clusters are discussed for further information. The sextetteteen decision rules are also generated by LIAgent. We are presenting three decision rules of three different clusters. The number of decision rules varies from cluster to cluster it depends upon the number of records in the cluster.The Decision directs of the 4th partition of the dataset DiabetesRule 1if PGC = 165 thenClass = Cat2elseRule 2if PGC = 153 thenClass = Cat2elseRule 3if PGC = 157 thenClass = Cat2elseRule 4if PGC = 139 thenClass = Cat2elseRule 5if HIS = 545 thenClass = Cat2elseRule 6if HIS = 744 thenClass = Cat2elseClass = Cat1Only six decision rules are for the 4th partition of the dataset. It is easy for any one to concur the decision and interpret the results of this cluster.The Decision Rules of the 1st partition of the dataset DiabetesRule 1if DPF = 1.32 thenClass = Cat1elseRule 2if DPF = 2.29 thenClass = Cat1elseRule 3if NTP = 2 thenClass = Cat2elseRule 4if DPF = 2.42 thenClass = Cat1elseRule 5if DPF = 2.14 thenClass = Cat1elseRule 6if DPF = 1.39 thenClass = Cat1elseRule 7if DPF = 1.29 thenClass = Cat1elseRule 8if DPF = 1.26 thenClass = Cat1elseClass = Cat2The eight decision rules are for the 1st partition of the dataset. The interpretation of the cluster is easy through the decision rules and it also helps to take the decision.The Decision Rules of the 3rd partition of the dataset DiabetesRule 1if BMI = 29.9 thenClass = Cat1elseRule 2if BMI = 32.9 thenClass = Cat1elseRule 3if TSFK = 23 thenRule 4if BMI = 25.5 thenClass = Cat1elseRule 5if BMI = 30.1 thenClass = Cat1elseRule 6if BMI = 28.4 thenClass = Cat1elseClass = Cat2elseRule 7if BMI = 22.9 thenClass = Cat1elseRule 8if BMI = 27.6 thenClass = Cat1elseRule 9if BMI = 29.7 thenClass = Cat1elseRule 10if BMI = 27.1 thenClass = Cat1elseRule 11if BMI = 25.8 thenClass = Cat1elseRule 12if BMI = 28.9 thenClass = Cat1elseRule 13if BMI = 23.4 thenClass = Cat1elseRule 14if BMI = 30.5 thenRule 15if TSFK = 18 thenClass = Cat2elseClass = Cat1elseRule 16if BMI = 26.6 thenRule 17if TSFK = 18 thenClass = Cat2elseClass = Cat1elseRule 18if BMI = 32 thenRule 19if TSFK = 15 thenClass = Cat2elseClass = Cat1elseRule 20if BMI = 31.6 thenClass = Cat2 , Cat1elseClass = Cat2The twenty decision rules are for the 3rd partition of the dataset. The number of rules for this cluster is higher than the other two clusters discussed.The visualization is important tool which provides the better understanding of the data and illustrates the family among the attributes of the data. For the visualization of the clusters 2D scattered graphs are drawn for all the clusters. We are presenting the four 2D scattered graphs of four different clusters of different partitions.Figure 3. 2D Scattered interpret between NTP and DPF attributes of Diabetes datasetThe distance between NTP and DPF attributes of Diabetes dataset varies at the beginning of the graph but after some interval the distance becomes constant.Figure 4. 2D Scattered Graph between DBP and AGE attributes of Diabetes datasetThere is a variable distance between DBP and AGE attributes of the dataset. It remains variable throughout this graph.Figure 5. 2D Scattered Graph between TSFT and BMI attributes of Diabetes datasetThe graph shows almost constant distance between TSFT and BMI attributes of the dataset. It remains constant throughout the graph.Figure 6. 2D Scattered Graph between PGC and 2HIS attributes of Diabetes datasetThere is a variable distance between PGC and 2HIS attributes of the dataset. But in the middle of this graph there is some constant distance between these attributes. The structure of this graph is similar to the graph of figure 5.5. ConclusionIt is not simple for all the users that they can interpret and extract the required results from these clusters, until some other data mining algorithms or other tools are not used. In this research paper we have tried to address the issue by integration the K-means clustering algorithm with the Decision tree (ID3) algorithm. The choice of the ID3 is due to the decision rules in the form of if-then-else as an output, which are easy to understand and help to take the decision. It is a hybrid combination of supervised and unsupervised machine learning, using intelligent agent, called a LIAgent. The LIAgent is helpful in the classification and prediction of the given dataset. Furthermore, 2D scattered graphs of the clusters are drawn for the visualization.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.