edu.umn.cs.nlp.mt.tools
Class ExtractWordPairs

java.lang.Object
  extended by edu.umn.cs.nlp.mt.tools.ExtractWordPairs

public class ExtractWordPairs
extends Object

Utility to extract aligned word pairs from an aligned corpus.

The files used must use Unix-style newlines.

Version:
$LastChangedDate: 2007-11-14 09:28:40 -0600 (Wed, 14 Nov 2007) $
Author:
Lane Schwartz
See Also:
4.4 of "Statistical Phrase-Based Translation" by Philipp Koehn, Franz Josef Och, & Daniel Marcu (HLT-NAACL, 2003)

Field Summary
static String UNALIGNED_MARKER
          Special marker to use with unaligned words
 
Constructor Summary
ExtractWordPairs()
           
 
Method Summary
static void extract(int number_of_lines, Scanner source_text, Scanner target_text, Scanner alignments, Writer outputFile)
          Extract aligned word pairs from an aligned corpus.
static void main(String[] args)
          Utility to extract aligned word pairs from an aligned corpus
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

UNALIGNED_MARKER

public static final String UNALIGNED_MARKER
Special marker to use with unaligned words

See Also:
Constant Field Values
Constructor Detail

ExtractWordPairs

public ExtractWordPairs()
Method Detail

extract

public static void extract(int number_of_lines,
                           Scanner source_text,
                           Scanner target_text,
                           Scanner alignments,
                           Writer outputFile)
                    throws IOException
Extract aligned word pairs from an aligned corpus.

This method does not convert from upper case to lower case. All input needs to already be in the proper case.

NOTE: The scanners provided for source text, target text, and alignments must all be backed by data that uses Unix-style newlines.

Parameters:
number_of_lines - The number of lines to process from the aligned corpus.
source_text - Scanner backed by the source language text
target_text - Scanner backed by the target language text
alignments - Scanner backed by the sentence alignment data
outputFile - Writer to use when producing output results
Throws:
IOException - Thrown if an I/O error occurs when writing results

main

public static void main(String[] args)
Utility to extract aligned word pairs from an aligned corpus

Parameters:
args - Command line arguments