Next: Cooperating Systems Report
Up: Florida University Report
Previous: Florida University Report
Contents
Classification is an important problem in the field of data mining.
SPRINT [3] is a well known classifier for large data sets. The
compute intensive part of SPRINT recursively partitions the input data
set until each subset predominantly belongs to a class. We have
parallelized SPRINT using Adlib. Our parallelization strategy uses
the following novel two phased approach:
- The higher level nodes of the tree use all the processors during
the tree construction. This uses parallel processing effectively
as the size of the data represented by the higher level nodes is
large.
- The lower level nodes are divided among the processors such that
each processor, in parallel, executes a subset of all the nodes
sequentially.
The work is still in progress [51]. Our experience
shows that the availability of distributed array structures and high
level intrinsic function in Adlib allows the code writer to concentrate
on higher level issues. We were able to represent the above strategy
easily in Adlib. Further, when developing code for a realistic
application, choosing the right distribution (no distribution, block
distribution, etc) of different structures is not always clear. The
ability of the code writer to quickly modify his code to change from
distributed to replicated and vice versa was found to be very
beneficial.
Additionally we completed our work on classification of decision
trees for large datasets [5].
Next: Cooperating Systems Report
Up: Florida University Report
Previous: Florida University Report
Contents
Bryan Carpenter
2002-07-12