|
We aim to develop three classes of technologies that facilitate the search for effective data representations. By providing analysts with powerful tools for transforming complex data sets, we hope to enable dramatic improvements in the effectiveness of knowledge discovery. Collectively, these technologies will enable transformative pattern learning. Specifically, we will develop: 1) A database architecture that facilitates rapid transformation of structured data sets, 2) Powerful transformation tools that can alter both the schema and content of structured data sets, and 3) Unique learning algorithms that guide analysts in their search for new paradigm-changing data representations.
When used in concert, these three classes of tools will address a central challenge of knowledge discovery its "chicken and egg" character. The wrong data representation can make it almost impossible to learn important knowledge in a given domain, but constructing the right data representation requires strong knowledge. Domain knowledge is used to decide which data are collected, how they are stored, and how they are presented to algorithms and users. Incorrect assumptions about the domain can lead to data representations that beggar even the most sophisticated learning algorithms. Thus, analysts need a good data representation to discover knowledge, but they need that knowledge to select a good data representation.
The resolution to this challenge, in both natural and artificial systems, is a coevolutionary process of pattern learning and data transformation. Data are used to learn new knowledge within a given data representation, and that knowledge transforms which data are collected and how those data are represented in future analyses. This process of "mutual bootstrapping" can lead to dramatic transformations in our understanding of a given domain. Philosophers of science (Popper 1972; Campbell 1974) identify this process as the core of an evolutionary epistemology the coevolution of learned knowledge and data representation that forms the foundation of all knowledge discovery processes.
Unfortunately, prior research in knowledge discovery has focused on only half of this process learning from data. Prior work focuses on how to construct useful models given a particular data representation (e.g., learning a decision tree or neural network from a given set of propositional data). However, current technologies provide almost no support for the other half of the coevolutionary process the search for useful data representations even though data selection and transformation are widely recognized as essential to successful application of knowledge discovery algorithms (Brodley and Smyth 1997; Fayyad et al. 1996).
One reason for this focus was an early standardization on propositional data, a data representation consisting of structurally homogeneous and statistically independent instances. In propositional data, the possible types of data transformations are extremely limited and the data themselves provide little fodder for reasoning about such transformations in the absence of large amounts of background knowledge. Thus, work in data transformation has not been particularly productive, and the lack of alternative learning algorithms has made it difficult to consider research in non-propositional data. This has produced a form of scholarly lock-in (Kuhn 1964, Arthur 1990) that appears to have held back progress in data transformation.
We believe that recent technical developments offers a unique opportunity to break out of the prevailing paradigm of knowledge discovery, and produce revolutionary new developments in technologies for data transformation. First, a wide range of new techniques for analyzing non-propositional data have recently been developed (see Jensen and Goldberg 1997; De Raedt and Kramer 2000; Getoor and Jensen 2000). Second, technologies developed in our own laboratory including the Proximity database architecture, the QGraph query language, and techniques for iterative classification provide an excellent technical foundation for developing new approaches to data transformation. Third, explosive growth in structured data from the Web, evidence extraction technologies, and large public databases make it possible to evaluate technologies for transformative pattern learning. New technologies for data representation, data transformation, and pattern learning will complete the coevolutionary cycle and enable dramatic improvements in knowledge discovery for a new range of applications.
References
Arthur, W. (1990). Positive feedbacks in the economy. Scientific American. February. 92-99.
Brodley, C., and P. Smyth (1997). Applying classification algorithms in practice. Statistics and Computing 7:45-56.
Campbell, D.(1974). Evolutionary epistemology. In P. A. Schilpp (Ed.), The philosophy of Karl R. Popper (pp. 412-463). LaSalle, IL: Open Court.
De Raedt, L., and S. Kramer (Eds.) (2000). Proceedings of the Workshop on Attribute Value Learning and Relational Learning: Crossing the Boundaries, Workshop held at the 17th International Conference on Machine Learning, Stanford.
Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth (1996). From data mining to knowledge discovery in databases. AI Magazine. Fall. 37-54.
Getoor, L. and D. Jensen (2000). Learning Statistical Models from Relational Data: Papers from the AAAI 2000 Workshop. Menlo Park: AAAI Press. Technical Report WS-00-006.
Jensen, D. and H. Goldberg (Editors) (1998). Artificial Intelligence and Link Analysis. Papers from the 1998 AAAI Fall Symposium. Technical Report FS-98-01. Menlo Park: AAAI Press.
Kuhn, T. (1964). The Structure of Scientific Revolutions. University of Chicago Press.
Popper, K. (1972). Objective Knowledge: An Evolutionary Approach. Oxford: The Clarendon Press.
|