Package org.coordinatekit.crf.mallet.train
@NullMarked
package org.coordinatekit.crf.mallet.train
MALLET-based CRF training implementation.
This package provides the MALLET-based implementation for training CRF models using the MALLET (MAchine Learning for LanguagE Toolkit) library.
Example Usage
// Configure model checkpoint output
ModelOutputConfiguration modelOutput = ModelOutputConfiguration.builder()
.outputDirectory(Path.of("checkpoints"))
.iterationInterval(50)
.build();
// Configure CoNLL evaluation output
ConllOutputConfiguration conllOutput = ConllOutputConfiguration.builder()
.outputDirectory(Path.of("evaluations"))
.iterationInterval(50)
.build();
// Configure training parameters
MalletCrfTrainerConfiguration config = MalletCrfTrainerConfiguration.builder()
.iterations(500)
.gaussianVariance(10.0)
.threads(8)
.trainingFraction(0.8)
.modelOutputConfiguration(modelOutput)
.conllOutputConfiguration(conllOutput)
.build();
// Create the trainer
MalletCrfTrainer<String, MyTag> trainer = new MalletCrfTrainer<>(
featureExtractor,
tagProvider,
trainingDataSequencer,
config);
// Train and save the model
trainer.train(Path.of("training-data.xml"), Path.of("model.ser"));
Configuration Options
Key parameters in MalletCrfTrainerConfiguration:
iterations- Maximum training iterations (default: 500)gaussianVariance- L2 regularization strength; higher values = less regularization (default: 10.0)threads- Number of threads for parallel training (default: 6)trainingFraction- Fraction of data for training vs. testing (default: 0.5)weightsType- Memory/speed trade-off for weight storage (default: SOME_DENSE)
Parameters in ModelOutputConfiguration for saving model checkpoints:
outputDirectory- Directory for checkpoint filesiterationInterval- Save model every N iterations (default: 10)filePrefix- Prefix for checkpoint filenames (default: "model_")fileSuffix- Suffix/extension for checkpoint files (default: ".ser")
Parameters in ConllOutputConfiguration for CoNLL-format evaluation output:
outputDirectory- Directory for CoNLL output filesiterationInterval- Write predictions every N iterations (default: 10)filePrefix- Prefix for output filenames (default: "output_iter")fileSuffix- Suffix/extension for output files (default: ".conll")
- See Also:
-
ClassDescriptionA composite evaluator that logs iteration statistics along with both instance and token accuracy metrics on test data.Configuration settings for
ConllOutputEvaluator.Builder for constructingConllOutputConfigurationinstances.An evaluator that outputs predicted and actual labels in CoNLL format.MalletCrfTrainer<F,T extends Comparable<T>> A CRF trainer implementation using the MALLET (MAchine Learning for LanguagE Toolkit) library.Configuration settings forMalletCrfTrainer.Builder for constructingMalletCrfTrainerConfigurationinstances.Configuration settings forModelOutputEvaluator.Builder for constructingModelOutputConfigurationinstances.An evaluator that writes the current transducer (model) to file using Java serialization.A container for training and test data splits.A container for training and test data splits.Specifies the weight storage strategy for CRF training.