An Analysis of Machine Learning Algorithms for Condensing Reverse Engineered Class Diagrams Hafeez Osman, Michel R.V. Chaudron and Peter van der Putten Leiden University, Leiden, the Netherlands Chalmers University of Technology and Goteborg University, Gothenburg, Sweden Luiz Paulo Coelho Ferreira Introduction • Up-to-date design documentation is important. • UML models created during the design are often poorly kept up to date during development and maintenance. • For legacy software, up-to-date designs are valuable for maintaining such systems and is hard to find. • This paper is partially motivated by a scenario where new programmers want to join a development team. 2 Luiz Paulo Coelho Ferreira Research Problem • This paper specifically aims at providing suitable classification algorithms to decide which classes should be included in a class diagram. • They seek an automated approach to classify the key classes in a class diagram. 3 Luiz Paulo Coelho Ferreira Contribution • They explore 9 classification algorithms for predicting key classes that should be included in a class diagram. • Evaluated 9 open sources systems, with 59 to 903 classes. 4 Luiz Paulo Coelho Ferreira Research Questions • RQ1: Which individual predictors are influential for the classification? • RQ2: How robust is the classification to the inclusion of categories of predictors? • RQ3: What are suitable classification algorithms in classifying key classes? 5 Luiz Paulo Coelho Ferreira Machine Learning • Univariate Analysis • Checks the predictor who has more influence • Machine Learning Classification Algorithm: • J48 Decision Tree, k-Nearest Neighbor, Logistic Regression, Naive Bayes, Decision Tables, Decision Stumps, Radial Basis Function Networks, Random Forests and Random Trees. 6 Luiz Paulo Coelho Ferreira Machine Learning • Evaluation Method: • Univariate Analysis they used InfoGain Attribute Evaluator (InfoGain). • Classification Algorithms were evaluated by Area Under ROC curve (AUC). 7 Luiz Paulo Coelho Ferreira Approach • Examined Predictors and Tools • Case Studies • Process 8 Luiz Paulo Coelho Ferreira Predictors and Tools • Reverse Engineering: • MagicDraw • Software Metrics: • SDMetrics • Data Mining: • WEKA 9 Luiz Paulo Coelho Ferreira Case Studies • Criteria: • Open Source Project • Must have a forward design class diagram • 50+ classes 10 Luiz Paulo Coelho Ferreira Process 11 Luiz Paulo Coelho Ferreira Evaluation • RQ1: Which individual predictors are influential for the classification? 12 Luiz Paulo Coelho Ferreira Evaluation RQ2: How robust is the classification to the inclusion of categories of predictors? 13 Luiz Paulo Coelho Ferreira Evaluation RQ2: How robust is the classification to the inclusion of categories of predictors? 14 Luiz Paulo Coelho Ferreira Evaluation RQ3: What are suitable classification algorithms in classifying key classes? 15 Luiz Paulo Coelho Ferreira Evaluation RQ3: What are suitable classification algorithms in classifying key classes? 16 Luiz Paulo Coelho Ferreira Discussion and Future Work • Export Coupling Parameter (EC Par), Dependency In (Dep In) and Number of Operation (NumOps) were the most influential predictors. • K-NN(5) and Random Forest were the best algorithms, and they can be combined to find better solutions. • Wasn’t able to produce high values of AUC. • Could use different metrics. • Evolve the “ground truth” to be iterative or use version control mining 17 Luiz Paulo Coelho Ferreira Threats to Validity • This study assumed that all the classes that existed in the forward designs were the important classes. • The input of this study is dependent on the MagicDraw CASE tools. • We only cover 9 open source case studies. 18 Luiz Paulo Coelho Ferreira Conclusion • They propose an approach for condensing reverse engineered class diagram by selecting the key classes in it. • Evaluates the influential predictors in classifying key classes and compares various machine learning classification algorithms on 9 case studies. • Export Coupling Parameter, Dependency In and Number of Operation are the most influential predictors for predicting key classes • On these predictor sets, Random Forest and k-Nearest Neighbor provided the best results 19 Luiz Paulo Coelho Ferreira Questions? ?????????????? 20 Luiz Paulo Coelho Ferreira