Class MalletCrfTrainer<F,T extends Comparable<T>>
- Type Parameters:
F- the type of features produced by the feature extractorT- the type of tags used for sequence labeling
- All Implemented Interfaces:
CrfTrainer
This class provides functionality to train Conditional Random Field models using MALLET's
CRFTrainerByThreadedLabelLikelihood for multithreaded training with L-BFGS optimization.
The trainer supports configurable parameters for regularization, threading, and weight storage
strategies.
Example usage:
FeatureExtractor<String> extractor = ...;
TagProvider<String> tagProvider = new StringTagProvider("O");
TrainingDataSequencer<String> sequencer = new XmlTrainingDataSequencer<>(tagProvider);
MalletCrfTrainer<String, String> trainer = new MalletCrfTrainer<>(
extractor, tagProvider, sequencer
);
trainer.train(Path.of("training.xml"), Path.of("model.ser"));
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final MalletCrfTrainerConfigurationThe configuration parameters controlling the training process.protected final FeatureExtractor<F> The feature extractor for converting tokens to feature sets during training.protected final TagProvider<T> The tag provider defining available tags and their encoding/decoding.protected final TrainingDataSequencer<T> The sequencer for reading training data from files into training sequences. -
Constructor Summary
ConstructorsConstructorDescriptionMalletCrfTrainer(FeatureExtractor<F> featureExtractor, TagProvider<T> tagProvider, TrainingDataSequencer<T> trainingDataSequencer) Creates a new trainer with the specified components and default configuration.MalletCrfTrainer(FeatureExtractor<F> featureExtractor, TagProvider<T> tagProvider, TrainingDataSequencer<T> trainingDataSequencer, MalletCrfTrainerConfiguration config) Creates a new trainer with the specified components and configuration. -
Method Summary
Modifier and TypeMethodDescriptionprotected cc.mallet.fst.CRFcreateCrf(cc.mallet.types.InstanceList training) Creates a CRF model initialized with states derived from the training data.protected cc.mallet.fst.CRFTrainerByThreadedLabelLikelihoodcreateCrfTrainer(cc.mallet.fst.CRF crf) Creates and configures a threaded CRF trainer for the given model.protected cc.mallet.types.InstancemapSequenceToInstance(cc.mallet.types.Alphabet dataAlphabet, cc.mallet.types.LabelAlphabet targetAlphabet, TrainingSequence<T> trainingSequence) Converts a training sequence into a MALLETInstance.protected TrainingTestSplitsplitTrainingData(Collection<Path> trainingPaths) Reads training data and splits it into training and test sets.voidtrain(Collection<Path> trainingPaths, Path modelPath) Trains a CRF model using the training data at the specified paths and saves the model to the specified output path.Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.coordinatekit.crf.core.train.CrfTrainer
train
-
Field Details
-
featureExtractor
The feature extractor for converting tokens to feature sets during training. -
tagProvider
The tag provider defining available tags and their encoding/decoding. -
trainingDataSequencer
The sequencer for reading training data from files into training sequences. -
configuration
The configuration parameters controlling the training process.
-
-
Constructor Details
-
MalletCrfTrainer
public MalletCrfTrainer(FeatureExtractor<F> featureExtractor, TagProvider<T> tagProvider, TrainingDataSequencer<T> trainingDataSequencer) Creates a new trainer with the specified components and default configuration.- Parameters:
featureExtractor- the feature extractor for converting tokens to feature setstagProvider- the tag provider defining available tags and encodingtrainingDataSequencer- the sequencer for reading training data
-
MalletCrfTrainer
public MalletCrfTrainer(FeatureExtractor<F> featureExtractor, TagProvider<T> tagProvider, TrainingDataSequencer<T> trainingDataSequencer, MalletCrfTrainerConfiguration config) Creates a new trainer with the specified components and configuration.- Parameters:
featureExtractor- the feature extractor for converting tokens to feature setstagProvider- the tag provider defining available tags and encodingtrainingDataSequencer- the sequencer for reading training dataconfig- the training configuration parameters
-
-
Method Details
-
createCrf
protected cc.mallet.fst.CRF createCrf(cc.mallet.types.InstanceList training) Creates a CRF model initialized with states derived from the training data.The CRF is configured with order-1 states (bigram label dependencies) using the tag provider's starting tag as the initial state. All states except the start state are initialized with impossible weight, ensuring sequences begin from the designated start state.
- Parameters:
training- the training instances used to initialize the CRF structure- Returns:
- a new CRF model ready for training
-
createCrfTrainer
protected cc.mallet.fst.CRFTrainerByThreadedLabelLikelihood createCrfTrainer(cc.mallet.fst.CRF crf) Creates and configures a threaded CRF trainer for the given model.The trainer is configured with parameters from the current configuration, including the number of threads, Gaussian prior variance for L2 regularization, and weight storage strategy.
- Parameters:
crf- the CRF model to train- Returns:
- a configured trainer ready to optimize the model
-
mapSequenceToInstance
protected cc.mallet.types.Instance mapSequenceToInstance(cc.mallet.types.Alphabet dataAlphabet, cc.mallet.types.LabelAlphabet targetAlphabet, TrainingSequence<T> trainingSequence) Converts a training sequence into a MALLETInstance.This method extracts features from each token in the sequence using the configured feature extractor, then constructs a
FeatureVectorSequencefor the input data and aLabelSequencefor the target labels. Features and labels are registered in the provided alphabets.- Parameters:
dataAlphabet- the alphabet for mapping feature names to indicestargetAlphabet- the alphabet for mapping label names to indicestrainingSequence- the training sequence to convert- Returns:
- a MALLET instance containing feature vectors and label sequence
-
splitTrainingData
Reads training data and splits it into training and test sets.This method reads sequences from the specified paths using the configured
TrainingDataSequencer, converts each sequence to a MALLETInstance, and splits the resulting data according toMalletCrfTrainerConfiguration.trainingFraction().The split is performed using
MalletCrfTrainerConfiguration.randomSeed()for reproducibility. If the training fraction is 1.0 or greater, all data is placed in the training set and the test set will be empty.- Parameters:
trainingPaths- the paths to the training data file- Returns:
- a
TrainingTestSplitcontaining the partitioned data - Throws:
IOException- if an error occurs reading the training data
-
train
Description copied from interface:CrfTrainerTrains a CRF model using the training data at the specified paths and saves the model to the specified output path.- Specified by:
trainin interfaceCrfTrainer- Parameters:
trainingPaths- the paths to the training data filesmodelPath- the path where the trained model should be saved- Throws:
IOException- if an error occurs during training or model serialization
-