.. _library_c45:

``c45``
=======

This library implements the C4.5 decision tree learning algorithm. C4.5
is an extension of the ID3 algorithm that uses information gain ratio
instead of information gain for attribute selection, which avoids bias
towards attributes with many values (see below for implementation
details).

The library implements the ``classifier_protocol`` defined in the
``classifier_protocols`` library. It provides predicates for learning a
decision tree from a dataset, optionally prune it, using it to make
predictions, and exporting it as a list of predicate clauses or to a
file.

Datasets are represented as objects implementing the
``dataset_protocol`` protocol from the ``classifier_protocols`` library.
See ``test_files`` directory for examples.

API documentation
-----------------

Open the
`../../apis/library_index.html#c45 <../../apis/library_index.html#c45>`__
link in a web browser.

Loading
-------

To load all entities in this library, load the ``loader.lgt`` file:

::

   | ?- logtalk_load(c45(loader)).

Testing
-------

To test this library predicates, load the ``tester.lgt`` file:

::

   | ?- logtalk_load(c45(tester)).

Implemented features
--------------------

- Information gain ratio for attribute selection (avoids bias towards
  attributes with many values)
- Handling of discrete (categorical) attributes with multi-way splits
- Handling of continuous (numeric) attributes with binary threshold
  splits (selects the threshold with the highest gain ratio from
  midpoints between consecutive sorted values)
- Handling of missing attribute values (represented using anonymous
  variables): examples with missing values for an attribute are
  distributed to all branches during tree construction; gain ratio
  computation uses only examples with known values for the attribute
  being evaluated; prediction with missing values uses majority voting
  by exploring all possible branches and selecting the most common class
- Tree pruning using pessimistic error pruning (PEP): estimates error
  rates using the upper confidence bound of the binomial distribution
  (Wilson score interval) with a configurable confidence factor (default
  0.25) and minimum instances per leaf (default 2); replaces subtrees
  with leaf nodes when doing so would not increase the estimated error;
  helps reduce overfitting and improve generalization
- Export of learned decision trees as predicate clauses
- Pretty-printing of learned decision trees

Limitations
-----------

- No incremental learning (the tree must be rebuilt from scratch when
  new examples are added)

References
----------

- Quinlan, J.R. (1986). Induction of Decision Trees. *Machine Learning*,
  1(1), 81-106. https://doi.org/10.1007/BF00116251

- Quinlan, J.R. (1993). *C4.5: Programs for Machine Learning*. Morgan
  Kaufmann Publishers. https://doi.org/10.1016/C2009-0-27846-9

- Mitchell, T.M. (1997). *Machine Learning*. McGraw-Hill. Chapter 3:
  Decision Tree Learning.

Usage
-----

To learn a decision tree from a dataset:

::

   | ?- c45::learn(play_tennis, Tree).

To prune a learned tree with the default parameters (confidence factor
0.25, minimum instances per leaf 2):

::

   | ?- c45::learn(breast_cancer, Tree),
        c45::prune(breast_cancer, Tree, PrunedTree).

To prune a learned tree with both custom confidence factor and minimum
instances per leaf:

::

   | ?- c45::learn(breast_cancer, Tree),
        c45::prune(breast_cancer, Tree, 0.1, 3, PrunedTree).

To export the tree as a list of predicate clauses:

::

   | ?- c45::learn(play_tennis, Tree),
        c45::tree_to_clauses(play_tennis, Tree, classify, Clauses).

To export the tree to a file:

::

   | ?- c45::learn(play_tennis, Tree),
        c45::tree_to_file(play_tennis, Tree, classify, 'tree.pl').

To print the tree to the current output:

::

   | ?- c45::learn(play_tennis, Tree),
        c45::print_tree(Tree).

To predict the class for a new instance (as a list of attribute-value
pairs):

::

   | ?- c45::learn(play_tennis, Tree),
        c45::predict(Tree, [outlook-sunny, temperature-hot, humidity-high, wind-weak], Class).
