Package org.coordinatekit.crf.mallet.train


@NullMarked package org.coordinatekit.crf.mallet.train
MALLET-based CRF training implementation.

This package provides the MALLET-based implementation for training CRF models using the MALLET (MAchine Learning for LanguagE Toolkit) library.

Example Usage

 
 // Configure model checkpoint output
 ModelOutputConfiguration modelOutput = ModelOutputConfiguration.builder()
         .outputDirectory(Path.of("checkpoints"))
         .iterationInterval(50)
         .build();

 // Configure CoNLL evaluation output
 ConllOutputConfiguration conllOutput = ConllOutputConfiguration.builder()
         .outputDirectory(Path.of("evaluations"))
         .iterationInterval(50)
         .build();

 // Configure training parameters
 MalletCrfTrainerConfiguration config = MalletCrfTrainerConfiguration.builder()
         .iterations(500)
         .gaussianVariance(10.0)
         .threads(8)
         .trainingFraction(0.8)
         .modelOutputConfiguration(modelOutput)
         .conllOutputConfiguration(conllOutput)
         .build();

 // Create the trainer
 MalletCrfTrainer<String, MyTag> trainer = new MalletCrfTrainer<>(
         featureExtractor,
         tagProvider,
         trainingDataSequencer,
         config);

 // Train and save the model
 trainer.train(Path.of("training-data.xml"), Path.of("model.ser"));
 
 

Configuration Options

Key parameters in MalletCrfTrainerConfiguration:

  • iterations - Maximum training iterations (default: 500)
  • gaussianVariance - L2 regularization strength; higher values = less regularization (default: 10.0)
  • threads - Number of threads for parallel training (default: 6)
  • trainingFraction - Fraction of data for training vs. testing (default: 0.5)
  • weightsType - Memory/speed trade-off for weight storage (default: SOME_DENSE)

Parameters in ModelOutputConfiguration for saving model checkpoints:

  • outputDirectory - Directory for checkpoint files
  • iterationInterval - Save model every N iterations (default: 10)
  • filePrefix - Prefix for checkpoint filenames (default: "model_")
  • fileSuffix - Suffix/extension for checkpoint files (default: ".ser")

Parameters in ConllOutputConfiguration for CoNLL-format evaluation output:

  • outputDirectory - Directory for CoNLL output files
  • iterationInterval - Write predictions every N iterations (default: 10)
  • filePrefix - Prefix for output filenames (default: "output_iter")
  • fileSuffix - Suffix/extension for output files (default: ".conll")
See Also: