ParaSite: Mining the Structural Information on the World-Wide Web

Ellen Spertus

PhD Thesis, Department of EECS, MIT, February 1998.



The World-Wide Web is potentially the world's largest knowledge base but only if new information retrieval techniques are developed to take advantage of its unique characteristics, particularly the semi-structured information within pages, across pages, and in page names. Because these types of structure are represented in such different ways, a large number of specialized tools have been required to gather structural information. I provide a relational database interface to the Web called Squeal, which encapsulates these di erent types of structure in a uniform manner, allowing the user to query the Web in Structured Query Language (SQL) as if it were a database. A novel "just-in-time" interpreter automatically retrieves information from the Web as implicitly demanded by user queries, a technique which could be applied not just to the Internet but to other sources of data too large to be precomputed into a database. The level of abstraction provided by Squeal allows the user to easily create agents that make full use of the previously-untapped information on the Web. One such "ParaSite" is a simple structure-based recommender system that compares favorably to the best text-based system.

Entire document

Related Documents

"A Hyperlink-Based Recommender System Written in Squeal", CIKM'98 Workshop on Web Information and Data Management (WIDM'98), November 6, 1998 with Lynn Andrea Stein.

"Just-In-Time Databases and the World-Wide Web", Seventh International ACM Conference on Information and Knowledge Management, November 1998, with Lynn Andrea Stein

Squeal: SQL Access to Information on the Web, AAAI-98 Workshop on AI and Information Integration.

Mining the Web's Hyperlinks for Recommendations, AAAI-98 Workshop on Recommender Systems, with Lynn Andrea Stein.

"ParaSite: Mining Structural Information on the Web," The Sixth International World Wide Web Conference, April 1997. Also appearing in Computer Networks and ISDN Systems: The International Journal of Computer and Telecommunications Networking 29 (1997) 1205-1215.