Using AstroWeka to Classify SuperCOSMOS data

This page gives a practical example of performing a typical data mining task on astronomical data - classifying stars and galaxies in the SSA.

The morphology data for the objects, including area profiles, combined with galactic latitude will be used to determine which are stars and which are galaxies. The morphology data tell us which objects look like stars, and we expect to see a higher proportion of stars near the galactic equator.

The Sloan Digital Sky Survey classification of the objects is used to label the training data.

Developing a classification scheme for tabular data typically follows the following steps

  1. Acquiring the data
  2. Filtering and Transforming the Data
  3. Training a classifier
  4. Using this classifier
The line between steps 1 and 2 is often quite blurred. Also step 3 almost always involves training several classifiers and then picking the best one.

Step 1: Acquiring SSA Data

The most convenient way of acquiring AstroGrid hosted data in AstroWeka is to use the AstroExplorer, which can load data from DSA services directly.

To load the SSA data into AstroExplorer, click on file->Open DSA. In the dialog box type ssa AND dsa. Pick the service with an ADQL interface

The following ADQL query gathers the data needed to train our classifier

a.classMagB, a.ellipB, a.l, a.ellipR1, a.ellipI,
blue.ap1, blue.ap2, blue.ap3,blue.ap4, blue.ap5, blue.ap6, blue.ap7,b.sdssType
From Source as a, CrossNeighboursEDR as b, Detection as blue
Where a.objID=b.ssaID and blue.objID=a.objIDB

Step 2: Transforming the Data

Once the data have loaded, the AstroExplorer should look like this:

The data set in its current state can not be used to train a classifier as the label, sdssType, needs to be converted to a nominal attribute. This can be achieved using a filter.

The Discretize filter converts numerical attributes to nominal ones. To do this, click on the filters button on the preprocess panel of the AstroExplorer. From the tree select unsupervised-> attribute->Discretize

Once the Discretize filter has been selected, it bust be configured before it can be used. Left click on the text box where "Discretize" is written, this should pop up a dialog box. In the attribute text box type last and for the number of bins enter 2

Click "Ok", then back on the preprocess panel click "Apply". The AstroExplorer should now look like this, with colours in the histograms to denote the ratio of each class in each bin.

Step 3: Training a Classifier

Classifiers are trained from the Classifier panel, which is accessed by clicking on the "classify" tab.

From the classifier panel, click "Choose" and then select trees -> J48 in a similar manner to the way the filter was chosen.

J48 is a decision tree classifier, based on the C45 algorithm. There are many different parameters for J48 which change the way the tree is constructed on the data. In practice, the best values for these parameters are found by experimentation, but for this example the default values will do.

Click "Start" to build the classifier. One of the main advantages of the J48 classifier is that is relatively quick to train, and should finish almost immediately on a small data set, such as the one we are using.

Once an AstroWeka classifier has finished training, statistics about their effectiveness are displayed. The J48 classifier displays additional information, including a text representation of the tree it uses to perform evaluations.

Once a classifier has been found that performs suitably well on the data, it can be saved to be used later. right click on the classifiers entry in the result list and select "Save Model"

Using the Classifier to Label New Data

Actually using the classifier to label new data is more difficult, and requires using the Java API. Essentially, the data to be labeled needs to be red in and formed into a weka.core.Instances object. Once this is done, each instance in the data can be evaluated using the classifyInstance() method of the classifier.

Brian Walshe