Class MalletCrfTrainer<F,T extends Comparable<T>>

java.lang.Object
org.coordinatekit.crf.mallet.train.MalletCrfTrainer<F,T>
Type Parameters:
F - the type of features produced by the feature extractor
T - the type of tags used for sequence labeling
All Implemented Interfaces:
CrfTrainer

@NullMarked public class MalletCrfTrainer<F,T extends Comparable<T>> extends Object implements CrfTrainer
A CRF trainer implementation using the MALLET (MAchine Learning for LanguagE Toolkit) library.

This class provides functionality to train Conditional Random Field models using MALLET's CRFTrainerByThreadedLabelLikelihood for multithreaded training with L-BFGS optimization. The trainer supports configurable parameters for regularization, threading, and weight storage strategies.

Example usage:

 
 FeatureExtractor<String> extractor = ...;
 TagProvider<String> tagProvider = new StringTagProvider("O");
 TrainingDataSequencer<String> sequencer = new XmlTrainingDataSequencer<>(tagProvider);

 MalletCrfTrainer<String, String> trainer = new MalletCrfTrainer<>(
     extractor, tagProvider, sequencer
 );
 trainer.train(Path.of("training.xml"), Path.of("model.ser"));
 
 
See Also:
  • Field Details

    • featureExtractor

      protected final FeatureExtractor<F> featureExtractor
      The feature extractor for converting tokens to feature sets during training.
    • tagProvider

      protected final TagProvider<T extends Comparable<T>> tagProvider
      The tag provider defining available tags and their encoding/decoding.
    • trainingDataSequencer

      protected final TrainingDataSequencer<T extends Comparable<T>> trainingDataSequencer
      The sequencer for reading training data from files into training sequences.
    • configuration

      protected final MalletCrfTrainerConfiguration configuration
      The configuration parameters controlling the training process.
  • Constructor Details

    • MalletCrfTrainer

      public MalletCrfTrainer(FeatureExtractor<F> featureExtractor, TagProvider<T> tagProvider, TrainingDataSequencer<T> trainingDataSequencer)
      Creates a new trainer with the specified components and default configuration.
      Parameters:
      featureExtractor - the feature extractor for converting tokens to feature sets
      tagProvider - the tag provider defining available tags and encoding
      trainingDataSequencer - the sequencer for reading training data
    • MalletCrfTrainer

      public MalletCrfTrainer(FeatureExtractor<F> featureExtractor, TagProvider<T> tagProvider, TrainingDataSequencer<T> trainingDataSequencer, MalletCrfTrainerConfiguration config)
      Creates a new trainer with the specified components and configuration.
      Parameters:
      featureExtractor - the feature extractor for converting tokens to feature sets
      tagProvider - the tag provider defining available tags and encoding
      trainingDataSequencer - the sequencer for reading training data
      config - the training configuration parameters
  • Method Details

    • createCrf

      protected cc.mallet.fst.CRF createCrf(cc.mallet.types.InstanceList training)
      Creates a CRF model initialized with states derived from the training data.

      The CRF is configured with order-1 states (bigram label dependencies) using the tag provider's starting tag as the initial state. All states except the start state are initialized with impossible weight, ensuring sequences begin from the designated start state.

      Parameters:
      training - the training instances used to initialize the CRF structure
      Returns:
      a new CRF model ready for training
    • createCrfTrainer

      protected cc.mallet.fst.CRFTrainerByThreadedLabelLikelihood createCrfTrainer(cc.mallet.fst.CRF crf)
      Creates and configures a threaded CRF trainer for the given model.

      The trainer is configured with parameters from the current configuration, including the number of threads, Gaussian prior variance for L2 regularization, and weight storage strategy.

      Parameters:
      crf - the CRF model to train
      Returns:
      a configured trainer ready to optimize the model
    • mapSequenceToInstance

      protected cc.mallet.types.Instance mapSequenceToInstance(cc.mallet.types.Alphabet dataAlphabet, cc.mallet.types.LabelAlphabet targetAlphabet, TrainingSequence<T> trainingSequence)
      Converts a training sequence into a MALLET Instance.

      This method extracts features from each token in the sequence using the configured feature extractor, then constructs a FeatureVectorSequence for the input data and a LabelSequence for the target labels. Features and labels are registered in the provided alphabets.

      Parameters:
      dataAlphabet - the alphabet for mapping feature names to indices
      targetAlphabet - the alphabet for mapping label names to indices
      trainingSequence - the training sequence to convert
      Returns:
      a MALLET instance containing feature vectors and label sequence
    • splitTrainingData

      protected TrainingTestSplit splitTrainingData(Collection<Path> trainingPaths) throws IOException
      Reads training data and splits it into training and test sets.

      This method reads sequences from the specified paths using the configured TrainingDataSequencer, converts each sequence to a MALLET Instance, and splits the resulting data according to MalletCrfTrainerConfiguration.trainingFraction().

      The split is performed using MalletCrfTrainerConfiguration.randomSeed() for reproducibility. If the training fraction is 1.0 or greater, all data is placed in the training set and the test set will be empty.

      Parameters:
      trainingPaths - the paths to the training data file
      Returns:
      a TrainingTestSplit containing the partitioned data
      Throws:
      IOException - if an error occurs reading the training data
    • train

      public void train(Collection<Path> trainingPaths, Path modelPath) throws IOException
      Description copied from interface: CrfTrainer
      Trains a CRF model using the training data at the specified paths and saves the model to the specified output path.
      Specified by:
      train in interface CrfTrainer
      Parameters:
      trainingPaths - the paths to the training data files
      modelPath - the path where the trained model should be saved
      Throws:
      IOException - if an error occurs during training or model serialization