Making Sense out of Large Graphs:

Bridging HCI with Data Mining

 

Christos Faloutsos (PI), Aniket Kittur (co-PI)

 

Funding Agency: NSF Award #IIS-1217559

http://www.nsf.gov/awardsearch/showAward?AWD_ID=1217559

 

Our long-term goal for this work is to develop new techniques that combine both information visualization and data mining to support the interactive analysis of large scale graphs on the order of hundreds of thousands of nodes and edges. Such large scale graphs are becom- ing increasingly common in domains ranging from science to government to the enterprise. Examples include online social networks (who is connected to whom), network traffic (what computers are communicating with each other), intelligence analysis (who is communicating with whom), and online auctions (who is buying from whom).

 

Data mining techniques for large scale graphs have seen significant success in recent years; however, there have been few efforts to make use of scalable and efficient machine learning algorithms for helping end users to interactively understand information. Con- versely, principles of human-computer interaction devised to promote user insight (such as the well-known “overview, zoom & filter, details-on-demand”pattern in visualization, and re- lying on pre-attentive perceptual skills) have trouble scaling to enormously rich information environments with millions of items, which is far more than the number of pixels available on many displays, and more than can be easily handled at interactive speed today.

 

From a scientific standpoint, we seek to bridge the gap between the fields of information visualization and data mining. While both of these fields seek to develop techniques for analyzing and understanding large data sets, there has historically been little interaction between them. As Shneiderman notes,“these tools have been developed by largely separate communities with different philosophies”. By augmenting the cognitive capacity of individuals with novel visualization methods and large scale machine learning techniques, we aim to combine the “best of both worlds” to promote learning, understanding, insight, and discovery.

Interactive Visualization

Figure 1. The CrowdScape interface. (A) is a scatter plot of aggregate behavioral features. Brush on the plot to filter behavioral traces. (B) shows the distribution of each aggregate feature. Brush on the distribution to filter traces based a range of values. (C) shows behavioral traces for each worker/output pair. Mouseover to explore a particular worker’s products. (D) encodes the range of worker outputs. Brush on each axis to select a subset of entries. Next to (B) is a control panel where users can switch between parallel coordinates and a textual view of worker outputs (left buttons), put workers into groups (colored right buttons), or find points similar to their colored groups (two middle button sets).

Multitouch visualization

In the case of highly multivariate data, it is difficult with desktop-based systems to examine more than two dimensions of variables due to their structured approach. Tablets provide a different set of affordances compared to desktops, and might allow us to interact with data in new ways. We have used multitouch gestures on a tablet combined with physics-driven models to create new interaction techniques that help users to explore more dimensions and make deeper analyses. Figures 1 and 2 below demonstrate some of the utility of these techniques. This research was published as an extended abstract at CHI 2013, where it was also shown as a demo and part of the video showcase.

Graph mining algorithms

We are also working on automatically determining important structures in a graph. The idea is to use the so-called Minimum Description Language (MDL) to describe ‘important’ subgraphs. A subgraph is ‘important’, if we can compress it easily: for example a clique  of n nodes can be easily compressed; similarly for a chain, and for a star. Ms. Danai Koutra is working on the topic, developing scalable heuristics to solve the combinatorial problem of subgraph selection. Once such subgraphs are chosen, we plan to show them to the user, in a compressed form, for example, using a ‘box’ glyph, to represent a clique, or a ‘star’ glyph to represent a star.

Figure 3. Wikipedia editors in a "edit war" on the spelling of "Kiev".

Publications

 

Danai Koutra, Yu Gong, Sephira Ryman, Rex Jung, Joshua Vogelstein, Christos Faloutsos. Are all brains wired equally? OHBM 2013, Seattle, WA, June 2013.

 

Danai Koutra, U Kang, Jilles Vreeken, Christos Faloutsos. Under double blind review.

 

Rzeszotarski, J., Kittur, A. (2012). CrowdScape: Interactively visualizing user behavior and output. UIST 2012: Proceedings of the ACM Symposium on User Interface Software and Technology. New York: ACM Press.

 

Rzeszotarski, J., Kittur, A. (2013). TouchViz: (Multi)Touching multivariate data. CHI 2013 Video showcase.

 

Rzeszotarski, J., Kittur, A. (2013). TouchViz: (Multi)Touching multivariate data. CHI 2013 Demo.

 

Rzeszotarski, J., Kittur, A. (2013). TouchViz: (Multi)Touching multivariate data. CHI 2013 Extended Abstracts.

 

One area in which it is challenging to make sense of large data is understanding how people interact with an interface: each mouse movement, click, scroll, keypress, etc. generates potentially useful information but can quickly become overwhelming. Mining such data could be especially useful for the growing field of crowdsourcing, in which employers have little control or visibility into how crowd workers are accomplishing tasks and often have quality control issues. We developed a novel system (CrowdScape) which helps researchers to visualize the behavioral traces of crowd workers and interactively group them into clusters using machine learning.  Each worker’s behavior -- clicks, scrolls, typing, delays, etc. -- is summarized in a compact row, allowing many workers to be easily compared and made sense of, with dynamical queries providing an interactive overview and filtering mechanism.  Furthermore, once a worker’s behavior has been identified as high or low quality, CrowdScape uses machine learning models to propagate these labels to similar work. This research won the Best Paper award at UIST 2012.