.. _library_stemming:

``stemming``
============

This library provides word stemming predicates for English text, with
support for different word representations: atoms, character lists, or
character code lists.

The library includes implementations of two well-known stemming
algorithms:

- **Porter Stemmer** - The Porter stemming algorithm (Porter, 1980) is a
  widely used algorithm for reducing English words to their root form by
  applying a series of rules that remove common suffixes.

- **Lovins Stemmer** - The Lovins stemming algorithm (Lovins, 1968)
  removes the longest suffix from a word using a list of endings, each
  associated with a condition for removal. It then applies
  transformation rules to fix spelling.

API documentation
-----------------

Open the
`../../apis/library_index.html#stemming <../../apis/library_index.html#stemming>`__
link in a web browser.

Loading
-------

To load all entities in this library, load the ``loader.lgt`` file:

::

   | ?- logtalk_load(stemming(loader)).

Testing
-------

To test this library predicates, load the ``tester.lgt`` file:

::

   | ?- logtalk_load(stemming(tester)).

Usage
-----

The stemming predicates are defined in parametric objects where the
parameter specifies the word representation:

- ``atom`` - words are represented as atoms
- ``chars`` - words are represented as lists of characters
- ``codes`` - words are represented as lists of character codes

The parameter must be bound when sending messages to the objects.

Porter Stemmer
~~~~~~~~~~~~~~

To stem a single word using atoms:

::

   | ?- porter_stemmer(atom)::stem(running, Stem).
   Stem = run
   yes

To stem a list of words:

::

   | ?- porter_stemmer(atom)::stems([running, walks, easily], Stems).
   Stems = [run, walk, easili]
   yes

Using character lists:

::

   | ?- porter_stemmer(chars)::stem([r,u,n,n,i,n,g], Stem).
   Stem = [r,u,n]
   yes

Lovins Stemmer
~~~~~~~~~~~~~~

To stem a single word using atoms:

::

   | ?- lovins_stemmer(atom)::stem(running, Stem).
   Stem = run
   yes

To stem a list of words:

::

   | ?- lovins_stemmer(atom)::stems([running, walks, easily], Stems).
   Stems = [run, walk, eas]
   yes

Algorithms
----------

.. _porter-stemmer-1:

Porter Stemmer
~~~~~~~~~~~~~~

The Porter stemming algorithm, developed by Martin Porter in 1980, is
one of the most widely used stemming algorithms for the English
language. It operates through a series of steps that progressively
remove suffixes from words:

1. **Step 1a**: Handle plurals (e.g., "caresses" → "caress", "ponies" →
   "poni")
2. **Step 1b**: Handle past tense and progressive forms (e.g., "agreed"
   → "agree")
3. **Step 1c**: Replace terminal "y" with "i" when preceded by a vowel
4. **Steps 2-4**: Remove various suffixes based on the "measure" of the
   stem
5. **Step 5**: Clean up final "e" and double consonants

The algorithm uses the concept of "measure" (m), which counts
vowel-consonant sequences in the stem, to determine when suffixes can be
safely removed.

**Reference**: Porter, M.F. (1980). An algorithm for suffix stripping.
Program, 14(3), 130-137.

.. _lovins-stemmer-1:

Lovins Stemmer
~~~~~~~~~~~~~~

The Lovins stemming algorithm, developed by Julie Beth Lovins in 1968,
was one of the earliest stemming algorithms. It takes a different
approach from Porter:

1. **Ending removal**: The algorithm maintains a list of 294 possible
   endings, ordered by length. It removes the longest matching ending
   that satisfies its associated condition (e.g., minimum stem length).

2. **Transformation rules**: After removing the ending, spelling
   transformations are applied to fix common irregularities (e.g., "iev"
   → "ief", "uct" → "uc").

The Lovins algorithm tends to be more aggressive than Porter, sometimes
producing stems that are not actual words but are consistent across
related word forms.

**Reference**: Lovins, J.B. (1968). Development of a stemming algorithm.
Mechanical Translation and Computational Linguistics, 11(1-2), 22-31.

Choosing an Algorithm
---------------------

- **Porter**: More conservative, produces more readable stems, widely
  used in information retrieval and search applications. Good choice for
  most applications.

- **Lovins**: More aggressive, may conflate more word forms together.
  Can be useful when broader matching is desired, but may over-stem in
  some cases.

Both algorithms are designed for English text only.

Known Limitations
-----------------

- Both algorithms work only with English words.
- Stemming is not lemmatization - stems may not be valid dictionary
  words.
- Proper nouns and abbreviations may not be handled correctly.
- Very short words (1-2 characters) are returned unchanged.
