Publications

On Relationships-centric Views of Semantics: A Brief Research Review

"The process of tying two items together is the important thing." -Vannevar Bush, in his seminal article, As we may think (the Atlantic Monthly, July 1945). He pointed out the inadequacy of contemporary indexing systems at mimicking the "natural" way we as humans seek out information. He stressed that the human mind worked by association. "It [the brain] operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association..." he added.

Two decades of our research on semantics of information has been influenced by the importance of relationships as underlined by Dr. Bush and several others visionaries (e.g., William Woods' "What's in a Link" [Woods1975]). We have recognized that relationships are at the heart of semantics [Sheth2002], observed the changing focus from documents to entities and on to relationships [Sheth2003], and have investigated broad variety of issues related to modeling, validating, discovering and exploiting various types of relationships between entities in content [Sheth+2003].

Our earliest focus on relationships was in terms of mappings to deal with semantic heterogeneity to achieve semantic interoperability and schema integration [Sheth-Larson1990, Sheth+1988, BERDI]. Over a decade ago, we introduces a comprehensive definition of semantic proximity [Sheth-Kashyap 92, Kashyap-Sheth 96] to address the difficulty in modeling a notion that is also termed semantic similarity or semantic distance. Semantic metadata is key to both the semantic Web and techniques that support semantic relationships. An early work on domain specific metadata annotation and search, which led to the earliest commercial product of this type (from Bellcore in 1995) came from our InfoHarness systems [Shklar+1995] [Shah-Sheth99].

The VisualHarness system extended that work beyond textual data to include image data [Sheth+1999]. In 1996, we introduced the concept of MREF (Metadata REFerence link) for associating semantic metadata on physical links represented using HREF [Sheth-Kashyap1996, Shah-Sheth1998]. This proposal seeks to realize a second generation of semantic relationships. As we realized that a single ontology would not be able to capture domain/conceptual models necessary to capture the semantics and agreements, we investigated inter-ontological relationships in our information brokering work [Kashyap-Sheth1994] and multi-ontology query processing in the OBSERVER system [Mena+1996] [Mena+2000]. In the InfoQuilt system, we defined IScapes, a paradigm for validating complex relationships defined using multiple ontologies over heterogeneous Web-based data [Sheth+2002a]. Our most recent work on complex relationships is called Semantic Associations [Anyanwu-Sheth03].

In this work, we create large RDF graph of metadata of heterogeneous data, and discover semantic associations. The work covers main memory RDF query processing, algorithms for finding paths, and for ranking semantic associations [Anyanwu+05] [Aleman-Meza+2005]. Compared to document ranking, ranking relationships is significantly more challenging. The concept of Active Semantic Documents (ASD) make documents actionable. ASD involve automatic lexical and semantic annotation of documents and the automatic evaluation of rules applicable to the annotations, with the ability to take appropriate actions (e.g. see application to electronic medical records).

Our research on semantics, where relationships play critical role has led to commercial products on semantic search [Townley2000] and on Enterprise Semantic Application platform [Sheth+2002], including automatic semantic metadata extraction (also terms metadata annotation) technology [Hammond+2002]. The latter could process one million Web pages per hour per traditional server performing deep ontology-driven metadata extraction, while IBM's WebFountain related technology has demonstrated more scalable but shallower metadata extraction resulting into annotation of over 4 billion pages [Brill+2003]. Many research projects as well as deployed enterprise applications have resulted using these technologies, some of which are reported in [Aleman-Meza+2005b, Sheth+2005a, Sheth2005].

In most of the examples in his article Dr. Bush describes what we term as implicit relationships. We believe that there is a lot of merit in the use of explicit named relationships for the purpose of resource organization. Extracting such relationships from documents has received considerable interest in the field of computational linguistics and information retrieval. Despite a few decades of research this problem remains a very hard problem to solve. One outcome of our research thus far is our ability to create large instance bases for ontologies from multiple trusted sources [Sheth2004]. We have created many populated ontologies in academic and commercial settings in which schemas are populated with corresponding knowledge bases containing multi-million entity instances and relationship instances linking these entities (some of these are being made publicly available, e.g., SWETO, GlycO, ProPreO, while the technologies such as Semagix Freedom have been use to create focused domain ontologies with as many as 14 million entity instances). Such populated ontologies significantly enhance our ability to identify potential relationships from unstructured data as well as from distributed information resources.

Several voices have questioned the viability, reliance and adequacy of formal ontologies. One perspective is represented by Google's Peter Norvig (see AOblog) that questions viability of the Semantic Web because creating domain ontologies is considered as impractical, which we have rebutted (see ShethBlog).. An insightful view, that of (in-)adequacy of crisp logics for question-answering system (which demands far more than Web search engines) is that of Lotfi Zadeh [Zadeh2003]. More recently, we have explored the dimension of implicit, formal and powerful semantic representations [Sheth+2005b], and investigated uses of formal and semi-formal ontologies [Sheth-Ramakrishnan2003]. When focusing on deep domain semantics in biology (e.g. GlycO and ProPreO) we explored the need to have more expressive probabilistic modeling of relationships, involved semantic annotation of scientific experiment data (in addition to textual) and integrated discovery over scientific literature and scientific experiment results.

Amit Sheth, September 2005

Address
Professor Amit P. Sheth Artificial Intelligence Institute Department of Computer Science & Engineering University of South Carolina