Universiti Teknologi Malaysia Institutional Repository

Big data processing on educational data mining using pyspark with jupyter notebook

Ravichandran, Vinitha (2018) Big data processing on educational data mining using pyspark with jupyter notebook. Masters thesis, Universiti Teknologi Malaysia.

[img]
Preview
PDF
322kB

Official URL: http://dms.library.utm.my:8080/vital/access/manage...

Abstract

The rapid advancement of the information technology brings new challenges and put new demands on our education system. The process of teaching and learning have moved from classroom to Computer Aided Learning (CAL) system. Big data technology and machine learning plays an important role in Computer Aided Learning (CAL) system due to the massive information or data generated by the system. This leads to the rapid development of data mining in education denote as Educational Data Mining (EDM). The abundance of data collected by the system can be used to analyse, predict and solve many societal issues in the education field such as improve the quality of education, predict as well as monitor educational outcomes. Effective analysing or predicting the future growth of students’ performance can make the Computer Aided Learning (CAL) system a better platform for learning compared to traditional learning. Machine learning techniques were used to get reliable and accurate prediction on students’ performance. Apache Hadoop has been the backbone for big data technology until the emergence of Apache Spark. However, only several researches are done on EDM using Apache Spark. In this dissertation, PySpark was be integrated with Jupyter Notebook to perform EDM on Educational Process Mining (EPM) data set. The Spark MLlib was used to compare four classification algorithms such as Logistic Regression, Naïve Bayes, Decision Tree and Random Forest to deal with EPM data set. Random Forest classifier outperformed other classifiers in Accuracy, Area Under the Precision-Recall(PR) and Area Under the Receiver Operating Characteristic (ROC) although with slightly slower Execution Time in this study. Random Forest classifier are the best classifier when dealing with EDM.

Item Type:Thesis (Masters)
Uncontrolled Keywords:apache spark, precision-recall, educational process mining
Divisions:Computing
ID Code:81375
Deposited By: Narimah Nawil
Deposited On:23 Aug 2019 04:06
Last Modified:23 Aug 2019 04:06

Repository Staff Only: item control page