January 13, 2016

Speaker: Dr. Ken Pu, Ontario Tech University

Title: Towards Internet Scale Open Data Integration

Abstract: We consider the problem of searching open data repositories. We abstract a dataset as a collection of attributes each with a domain of data values. Attributes correspond to columns in tables, and tree paths in semi-structured data. We are interested in scenarios in which queries themselves are at- tributes, and search results are candidate attributes whose domains overlap with the query’s. The relevance measure of an attribute X given a query Q is the containment, defined as |X ∩ Q|/|Q|. The challenges of open data search are incomplete and poor quality schema information, massive numbers of attributes, distributed storage, and high network latency. Furthermore, based on our empirical observation, in real-life open datasets, attributes are very skewed in domain size, consisting of the very small and very large attributes.

To address these challenges while maintaining high quality and very fast search, our method is based on locality sensitive hashing of minhash signatures, which are small fixed-sized summaries of attributes. Traditional LSH can be used to find attributes using a resemblance threshold, however resemblance performs badly on skewed domains. We present a novel partitioning strategy that permits LSH to be used with a containment threshold (a metric that performs well over skewed data and large query sizes). We show that given a query size, there is an optimal partitioning of a given cardinality distribution that minimizes the false positives returned by the index, and hence maximize the search performance. When the distribution follows the power law and most query sizes are not near the maximum attribute size, we provide a partitioning that approximates the optimal partitioning and is query independent. LSH Ensembles build on a result in LSH Forests showing that the LSH parameters can be dynamically tuned. We extend this result by providing optimal parameters given an attribute cardinality distribution and query threshold, for the containment search problem. We use this to create an optimally tuned dynamic LSH for each partition.

This work is done in collaboration with Eric Zhu, Fatemeh Nargesian and Renee Miller at University of Toronto.