One of the difficulties in building an SQL-like query language for the Web is the absence of a database(131)for this huge, heterogeneous repository of information. However, if we are interested in HTML documents only, we can construct a virtual schema from the implicit structure of these files. Thus, at the highest level of(132), every such document is identified by its Uniform. Resource Locator(URL), and a(133)and a text. Also, Web severs provide some additional information such as the type, length, and the last modification date of a document. So for data mining purposes, we can consider the set of all HTML documents as a relation:Document(url, rifle, text, type, length, modif)Where all the(134)are character strings. In this framework, an individual document is identified with a(135)in this relation. Of course, if some optional information is missing from the HTML document, the associate fields will be left blank, but this is not uncommon in any database.






