next up previous contents
Next: Cooperating Systems Report Up: Florida University Report Previous: Florida University Report   Contents

Parallel classification

Classification is an important problem in the field of data mining. SPRINT [3] is a well known classifier for large data sets. The compute intensive part of SPRINT recursively partitions the input data set until each subset predominantly belongs to a class. We have parallelized SPRINT using Adlib. Our parallelization strategy uses the following novel two phased approach:

  1. The higher level nodes of the tree use all the processors during the tree construction. This uses parallel processing effectively as the size of the data represented by the higher level nodes is large.
  2. The lower level nodes are divided among the processors such that each processor, in parallel, executes a subset of all the nodes sequentially.
The work is still in progress [51]. Our experience shows that the availability of distributed array structures and high level intrinsic function in Adlib allows the code writer to concentrate on higher level issues. We were able to represent the above strategy easily in Adlib. Further, when developing code for a realistic application, choosing the right distribution (no distribution, block distribution, etc) of different structures is not always clear. The ability of the code writer to quickly modify his code to change from distributed to replicated and vice versa was found to be very beneficial.

Additionally we completed our work on classification of decision trees for large datasets [5].


next up previous contents
Next: Cooperating Systems Report Up: Florida University Report Previous: Florida University Report   Contents
Bryan Carpenter 2002-07-12