Skip to main content
Ontario Tech acknowledges the lands and people of the Mississaugas of Scugog Island First Nation.

We are thankful to be welcome on these lands in friendship. The lands we are situated on are covered by the Williams Treaties and are the traditional territory of the Mississaugas, a branch of the greater Anishinaabeg Nation, including Algonquin, Ojibway, Odawa and Pottawatomi. These lands remain home to many Indigenous nations and peoples.

We acknowledge this land out of respect for the Indigenous nations who have cared for Turtle Island, also called North America, from before the arrival of settler peoples until this day. Most importantly, we acknowledge that the history of these lands has been tainted by poor treatment and a lack of friendship with the First Nations who call them home.

This history is something we are all affected by because we are all treaty people in Canada. We all have a shared history to reflect on, and each of us is affected by this history in different ways. Our past defines our present, but if we move forward as friends and allies, then it does not have to define our future.

Learn more about Indigenous Education and Cultural Services

January 13, 2016

Speaker: Dr. Ken Pu, Ontario Tech University

Title: Towards Internet Scale Open Data Integration

Abstract: We consider the problem of searching open data repositories. We abstract a dataset as a collection of attributes each with a domain of data values. Attributes correspond to columns in tables, and tree paths in semi-structured data. We are interested in scenarios in which queries themselves are at- tributes, and search results are candidate attributes whose domains overlap with the query’s. The relevance measure of an attribute X given a query Q is the containment, defined as |X ∩ Q|/|Q|. The challenges of open data search are incomplete and poor quality schema information, massive numbers of attributes, distributed storage, and high network latency. Furthermore, based on our empirical observation, in real-life open datasets, attributes are very skewed in domain size, consisting of the very small and very large attributes.

To address these challenges while maintaining high quality and very fast search, our method is based on locality sensitive hashing of minhash signatures, which are small fixed-sized summaries of attributes. Traditional LSH can be used to find attributes using a resemblance threshold, however resemblance performs badly on skewed domains. We present a novel partitioning strategy that permits LSH to be used with a containment threshold (a metric that performs well over skewed data and large query sizes). We show that given a query size, there is an optimal partitioning of a given cardinality distribution that minimizes the false positives returned by the index, and hence maximize the search performance. When the distribution follows the power law and most query sizes are not near the maximum attribute size, we provide a partitioning that approximates the optimal partitioning and is query independent. LSH Ensembles build on a result in LSH Forests showing that the LSH parameters can be dynamically tuned. We extend this result by providing optimal parameters given an attribute cardinality distribution and query threshold, for the containment search problem. We use this to create an optimally tuned dynamic LSH for each partition. 

This work is done in collaboration with Eric Zhu, Fatemeh Nargesian and Renee Miller at University of Toronto.