Proposal

Summary

We are going to parallelize closeness centrality calculations for large weighted graphs and then apply the algorithm to graphs mined via Facebook Graph API. We want to generate graphs from real Facebook data and make graphs weighted by applying a friendship ranking metric. For these large weighted graphs, we want to develop a fast parallel closeness centrality implementation using OpenMP.

Background

Closeness centrality is a metric expresses the average social distance from each individual to every other individual in the network. If we divide 1 by the average shortest path from an individual to all other individuals in the network, then we have calculated their closeness centrality. In this way an individual with a direct tie to everyone else ends up with a closeness score of 1. One property of closeness centrality is that it tends to give high scores to individuals who are near the center of local clusters (aka network communities) in an overall larger network. High closeness centrality individuals tend to be important influencers within their local network community.

The Challenge

Finding shortest paths from a node to all other nodes is a challenging task in a large weighted graph, because of both space and time constraints. There is no single best algorithm for it, and all conventional versions are sequential. In-memory space is limited, so we need to do many disk reads to compute over the entire graph. We need to efficiently hide the latency of these reads, and also improve the compute time such that, even with the additional latency, we still see a speed-up.

Mining graphs from Facebook is not something that’s available off-the-shelf, so we will need to develop a custom application for our needs. Besides, since we want weighted graphs, we will need to add a metric that would rank friendships.

Resources

There is some information available on parallelizing shortest path algorithms, such as this presentation on Dijkstra’s. For mining facebook graphs, we’ll use Facebook Graph API. We will start off with a scripting language like Python and save graphs on disk, and later might port the code to C++ to pipeline mining into the computation. For weighting graphs, we are planning to base it on this algorithm.

We plan to use the GHC machines for development and testing. We will program in C++ and use OpenMP for parallelization. We will also try using C++ Boost Library functions for graphs and good old pthreads.

Goals and Deliverables

Plan to achieve:

Hope to achieve:

Demo:

Platform Choice

Parallelizing the algorithm on a multicore CPU will provide speedup compared to a sequential version due to the large number of shortest path calculations. GHC cluster has high core count machines, and we will use C++, OpenMP, and pthreads to activate all the cores.

Schedule

Week of Things to do
April 10 Research papers/web on possible parallelization approaches.
April 17 Implement sequential algorithm, start working on parallel version and the mining application, complete checkpoint writeup.
April 24 Have a working parallel implementation, test on mined graphs, iterate.
May 1 Generate large graphs, produce final parallel implementation, collect speedup data.
May 8 Wrap up, stretch goals, final report.