|Keywords:||k-clique; Apache Spark; Enumeration; Distributed; Graph Mining|
|Full text PDF:||http://hdl.handle.net/1805/12282|
Indiana University-Purdue University Indianapolis (IUPUI) Network analysis is an important research task which explains the relationshipsamong various entities in a given domain. Most of the existing approaches of networkanalysis compute global properties of a network, such as transitivity, diameter, andall-pair shortest paths. They also study various non-random properties of a network,such as graph densifi cation with shrinking diameter, small diameter, and scale-freeness.Such approaches enable us to understand real-life networks with global properties.However, the discovery of the local topological building blocks within a networkis an important task, and examples include clique enumeration, graphlet counting,and motif counting. In this paper, my focus is to fi nd an efficient solution of k-cliqueenumeration problem. A clique is a small, connected, and complete induced subgraphover a large network. However, enumerating cliques using sequential technologies isvery time-consuming. Another promising direction that is being adopted is a solutionthat runs on distributed clusters of machines using the Hadoop mapreduceframework. However, the solution suffers from a general limitation of the framework,as Hadoop's mapreduce performs substantial amounts of reading and writing to disk.Thus, the running times of Hadoop-based approaches suffer enormously. To avoidthese problems, we propose an e cient, scalable, and distributed solution, kc-spark, for enumerating cliques in real-life networks using the Apache Spark in-memory clustercomputing framework. Experiment results show that kc-spark can enumeratek-cliques from very large real-life networks, whereas a single commodity machine cannotproduce the same desired result in a feasible amount of time. We also comparedkc-spark with Hadoop mapreduce solutions and found the algorithm to be 80-100percent faster in terms of running times. On the other hand, we compared with thetriangle enumeration with Hadoop mapreduce and results shown that kc-spark is8-10 times faster than mapreduce implementation with the same cluster setup. Furthermore,the overall performance of kc-spark is improved by using Spark's inbuiltcaching and broadcast transformations.Advisors/Committee Members: Al Hasan, MOHAMMAD.