chickadee » dataset-utils

Dataset Utilities

A set of routines to load and manage datasets for machine learning / data mining tasks.

A dataset is a table:


OutlookTemperatureHumidityWindyPlays
sunnyhothighfalseno
sunnyhothightrueno

Each column in the table is an attribute, and each row is an instance. Instances have values for each attribute. The whole table is called a relation, and can be given a name.

Exported Procedures

Creating datasets

make-nominal-attribute name value-1 ...procedure

Creates a nominal attribute with given values, e.g.:

> (make-nominal-attribute 'outlook 'sunny 'overcast 'rainy)
make-numeric-attribute nameprocedure

Creates a numeric attribute, e.g.:

> (make-numeric-attribute 'temperature)
make-relation name attributes dataprocedure

Creates a relation with given name. The attributes must be a list of attribute instances, and the data are a list of lists: each sublist representing an instance, and giving the value for that instance of every attribute.

> (make-relation 'plays-tennis
                  (list (make-nominal-attribute 'outlook 'sunny 'overcast 'rainy)
                        (make-nominal-attribute 'temperature 'hot 'mild 'cool)
                        (make-nominal-attribute 'humidity 'high 'normal)
                        (make-nominal-attribute 'windy 'true 'false)
                        (make-nominal-attribute 'plays 'yes 'no))
                  '((sunny hot high false no)
                    (sunny hot high true no)
                    (overcast hot high false yes)
                    ...
                    (rainy mild high true no)))

Managing datasets

attribute-name attributeprocedure

Returns the name of given attribute.

attribute-definition attributeprocedure

Returns a definition of the type of given attribute. This definition will be one of:

  • '(numeric) for numeric attributes
  • '(nominal value-1 ...) for nominal attributes, listing the possible values
class-probability relation attribute-name valueprocedure

Returns the proportion of instances with the given attribute value.

entropy relation attribute-nameprocedure

Computes entropy of given relation, using attribute-name to divide the relation into groups. attribute-name should be a nominal attribute.

filter-instances relation attribute-name valueprocedure

Returns a new relation containing those instances of relation which have the given value for attribute-name.

find-attribute-index relation attribute-nameprocedure

Returns the index number of given attribute name in relation.

get-attribute-values relation attribute-nameprocedure

Returns the values taken by instances in relation for given attribute name.

information-gain relation target-class attribute-nameprocedure

Computes the information gain from using the given attribute-name to split the data in relation over the entropy of the data as they are; target-class is used to compute the entropy.

relation-attributes relationprocedure

Returns a list of attributes for given relation.

relation-data relationprocedure

Returns a list of the instances in the given relation.

relation-name relationprocedure

Returns the name of given relation.

split-instances relation attribute-nameprocedure

Given a nominal attribute, returns a list of relations, each representing instances in relation with the same value for given attribute-name.

Metrics

euclidean-distance instance-1 instance-2procedure

Computes the euclidean distance between the two instances.

Importing Data

read-arff filenameprocedure

Reads an ARFF definition from given filename, and returns a relation. Currently supports nominal and numeric attribute types, and not sparse files.

Author

Peter Lane.

License

GPL version 3.0.

Version History

in trunk.

Contents »