Data

Tokenizer

TokenizerBase

class texar.torch.data.TokenizerBase(hparams)[source]

Base class inherited by all tokenizer classes. This class handles downloading and loading pre-trained tokenizer and adding tokens to the vocabulary.

Derived class can set up a few special tokens to be used in common scripts and internals: bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, and additional_special_tokens.

We defined an added_tokens_encoder to add new tokens to the vocabulary without having to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece …).

classmethod load(pretrained_model_path, configs=None)[source]

Instantiate a tokenizer from the vocabulary files or the saved tokenizer files.

Parameters
  • pretrained_model_path – The path to a vocabulary file or a folder that contains the saved pre-trained tokenizer files.

  • configs – Tokenizer configurations. You can overwrite the original tokenizer configurations saved in the configuration file by this dictionary.

Returns

A tokenizer instance.

save(save_dir)[source]

Save the tokenizer vocabulary files (with added tokens), tokenizer configuration file and a dictionary mapping special token class attributes (cls_token, unk_token, …) to their values (<unk>, <cls>, …) to a directory, so that it can be re-loaded using the load().

Parameters

save_dir – The path to a folder in which the tokenizer files will be saved.

Returns

The paths to the vocabulary file, added token file, special token mapping file, and the configuration file.

save_vocab(save_dir)[source]

Save the tokenizer vocabulary to a directory. This method does not save added tokens, special token mappings, and the configuration file.

Please use save() to save the full tokenizer state so that it can be reloaded using load().

add_tokens(new_tokens)[source]

Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to the added_tokens_encoder with indices starting from the last index of the current vocabulary.

Parameters

new_tokens – A list of new tokens.

Returns

Number of tokens added to the vocabulary which can be used to correspondingly increase the size of the associated model embedding matrices.

add_special_tokens(special_tokens_dict)[source]

Add a dictionary of special tokens to the encoder and link them to class attributes. If the special tokens are not in the vocabulary, they are added to it and indexed starting from the last index of the current vocabulary.

Parameters

special_tokens_dict – A dictionary mapping special token class attributes (cls_token, unk_token, …) to their values (<unk>, <cls>, …).

Returns

Number of tokens added to the vocabulary which can be used to correspondingly increase the size of the associated model embedding matrices.

map_text_to_token(text, **kwargs)[source]

Maps a string to a sequence of tokens (string), using the tokenizer. Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePiece/WordPiece). This function also takes care of the added tokens.

Parameters

text – A input string.

Returns

A list of tokens.

map_token_to_id(tokens)[source]

Maps a single token or a sequence of tokens to a integer id (resp.) a sequence of ids, using the vocabulary.

Parameters

tokens – A single token or a list of tokens.

Returns

A single token id or a list of token ids.

map_text_to_id(text)[source]

Maps a string to a sequence of ids (integer), using the tokenizer and vocabulary. Same as self.map_token_to_id(self.map_text_to_token(text)).

Parameters

text – A input string.

Returns

A single token id or a list of token ids.

map_id_to_token(token_ids, skip_special_tokens=False)[source]

Maps a single id or a sequence of ids to a token (resp.) a sequence of tokens, using the vocabulary and added tokens.

Parameters
  • token_ids – A single token id or a list of token ids.

  • skip_special_tokens – Whether to skip the special tokens.

Returns

A single token or a list of tokens.

map_token_to_text(tokens)[source]

Maps a sequence of tokens (string) in a single string. The most simple way to do it is ' '.join(tokens), but we often want to remove sub-word tokenization artifacts at the same time.

map_id_to_text(token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True)[source]

Maps a sequence of ids (integer) to a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

Parameters
  • token_ids – A list of token ids.

  • skip_special_tokens – Whether to skip the special tokens.

  • clean_up_tokenization_spaces – Whether to clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms.

encode_text(text_a, text_b=None, max_seq_length=None)[source]

Adds special tokens to a sequence or sequence pair and computes other information such as segment ids, input mask, and sequence length for specific tasks.

property special_tokens_map

A dictionary mapping special token class attributes (cls_token, unk_token, …) to their values (<unk>, <cls>, …)

property all_special_tokens

List all the special tokens (<unk>, <cls>, …) mapped to class attributes (cls_token, unk_token, …).

property all_special_ids

List the vocabulary indices of the special tokens (<unk>, <cls>, …) mapped to class attributes (cls_token, unk_token, …).

static clean_up_tokenization(out_string)[source]

Clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms.

SentencePieceTokenizer

class texar.torch.data.SentencePieceTokenizer(cache_dir=None, hparams=None)[source]

SentencePiece Tokenizer. This class is a wrapper of Google’s SentencePiece with richer ready-to-use functionalities such as adding tokens and saving/loading.

SentencePiece is an unsupervised text tokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements sub-word units (e.g., byte-pair-encoding (BPE) and unigram language model) with the extension of direct training from raw sentences.

The supported algorithms in SentencePiece are: bpe, word, char, and unigram, which is specified in hparams.

Parameters
  • cache_dir (optional) – the path to a folder in which the trained sentencepiece model will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

classmethod train(cmd, cache_dir=None)[source]

Trains the tokenizer from the raw text file. This function is a wrapper of sentencepiece.SentencePieceTrainer.Train function.

Example:

SentencePieceTokenizer.train('--input=test/botchan.txt
--model_prefix=m --vocab_size=1000')
Parameters
  • cmd (str) – the command for the tokenizer training procedure. See sentencepiece.SentencePieceTrainer.Train for the detailed usage.

  • cache_dir (optional) – the path to a folder in which the trained sentencepiece model will be cached. If None (default), a default directory (texar_pytorch folder under user’s home directory) will be used.

Returns

Path to the cache directory.

save_vocab(save_dir)[source]

Save the sentencepiece vocabulary (copy original file) to a directory.

map_token_to_text(tokens)[source]

Maps a sequence of tokens (string) in a single string.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • If hparams[‘vocab_file’] is specified, the tokenizer is directly loaded from the vocabulary file. In this case, all other configurations in hparams are ignored.

  • Otherwise, the tokenizer is automatically trained based on hparams[‘text_file’]. In this case, hparams[‘vocab_size’] must be specified.

  • hparams[‘vocab_file’] and hparams[‘text_file’] can not be None at the same time.

{
    "vocab_file": None,
    "text_file": None,
    "vocab_size": None,
    "model_type": "unigram",
    "bos_token": "<s>",
    "eos_token": "</s>",
    "unk_token": "<unk>",
    "pad_token": "<pad>",
}

Here:

“vocab_file”: str or None

The path to a sentencepiece vocabulary file.

“text_file”: str or None

Comma separated list of input sentences.

“vocab_size”: int or None

Vocabulary size. The user can specify the vocabulary size, and the tokenizer training procedure will train and yield a vocabulary of the specified size.

“model_type”: str

Model algorithm to train the tokenizer. Available algorithms are: bpe, word, char, and unigram.

“bos_token”: str or None

Beginning of sentence token. Set None to disable bos_token.

“eos_token”: str or None

End of sentence token. Set None to disable eos_token.

“unk_token”: str or None

Unknown token. Set None to disable unk_token.

“pad_token”: str or None

Padding token. Set None to disable pad_token.

BERTTokenizer

class texar.torch.data.BERTTokenizer(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Pre-trained BERT Tokenizer.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., bert-base-uncased). Please refer to PretrainedBERTMixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

save_vocab(save_dir)[source]

Save the tokenizer vocabulary to a directory or file.

map_token_to_text(tokens)[source]

Maps a sequence of tokens (string) to a single string.

encode_text(text_a, text_b=None, max_seq_length=None)[source]

Adds special tokens to a sequence or sequence pair and computes the corresponding segment ids and input mask for BERT specific tasks. The sequence will be truncated if its length is larger than max_seq_length.

A BERT sequence has the following format: [cls_token] X [sep_token]

A BERT sequence pair has the following format: [cls_token] A [sep_token] B [sep_token]

Parameters
  • text_a – The first input text.

  • text_b – The second input text.

  • max_seq_length – Maximum sequence length.

Returns

A tuple of (input_ids, segment_ids, input_mask), where

  • input_ids: A list of input token ids with added special token ids.

  • segment_ids: A list of segment ids.

  • input_mask: A list of mask ids. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • The tokenizer is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.

  • Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.

  • If the above two are None, the tokenizer is defined by the configurations in hparams.

{
    "pretrained_model_name": "bert-base-uncased",
    "vocab_file": None,
    "max_len": 512,
    "unk_token": "[UNK]",
    "sep_token": "[SEP]",
    "pad_token": "[PAD]",
    "cls_token": "[CLS]",
    "mask_token": "[MASK]",
    "tokenize_chinese_chars": True,
    "do_lower_case": True,
    "do_basic_tokenize": True,
    "non_split_tokens": None,
    "name": "bert_tokenizer",
}

Here:

“pretrained_model_name”: str or None

The name of the pre-trained BERT model.

“vocab_file”: str or None

The path to a one-wordpiece-per-line vocabulary file.

“max_len”: int

The maximum sequence length that this model might ever be used with.

“unk_token”: str

Unknown token.

“sep_token”: str

Separation token.

“pad_token”: str

Padding token.

“cls_token”: str

Classification token.

“mask_token”: str

Masking token.

“tokenize_chinese_chars”: bool

Whether to tokenize Chinese characters.

“do_lower_case”: bool

Whether to lower case the input Only has an effect when do_basic_tokenize=True

“do_basic_tokenize”: bool

Whether to do basic tokenization before wordpiece.

“non_split_tokens”: list

List of tokens which will never be split during tokenization. Only has an effect when do_basic_tokenize=True

“name”: str

Name of the tokenizer.

GPT2Tokenizer

class texar.torch.data.GPT2Tokenizer(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Pre-trained GPT2 Tokenizer.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., 117M). Please refer to PretrainedGPT2Mixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

save_vocab(save_dir)[source]

Save the tokenizer vocabulary and merge files to a directory.

map_token_to_text(tokens)[source]

Maps a sequence of tokens (string) in a single string.

encode_text(text, max_seq_length=None, append_eos_token=True)[source]

Adds special tokens to a sequence and computes the corresponding sequence length for GPT2 specific tasks. The sequence will be truncated if its length is larger than max_seq_length.

A GPT2 sequence has the following format: [bos_token] X [eos_token] [pad_token]

Parameters
  • text – Input text.

  • max_seq_length – Maximum sequence length.

  • append_eos_token – Whether to append eos_token after the sequence.

Returns

A tuple of (input_ids, seq_len), where

  • input_ids: A list of input token ids with added special tokens.

  • seq_len: The sequence length.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • The tokenizer is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.

  • Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.

  • If the above two are None, the tokenizer is defined by the configurations in hparams.

{
    "pretrained_model_name": "117M",
    "vocab_file": None,
    "merges_file": None,
    "max_len": 1024,
    "bos_token": "<|endoftext|>",
    "eos_token": "<|endoftext|>",
    "unk_token": "<|endoftext|>",
    "pad_token": "<|endoftext|>",
    "errors": "replace",
    "name": "gpt2_tokenizer",
}

Here:

“pretrained_model_name”: str or None

The name of the pre-trained GPT2 model.

“vocab_file”: str or None

The path to a vocabulary json file mapping tokens to ids.

“merges_file”: str or None

The path to a merges file.

“max_len”: int

The maximum sequence length that this model might ever be used with.

“bos_token”: str

Beginning of sentence token

“eos_token”: str

End of sentence token

“unk_token”: str

Unknown token

“pad_token”: str

Padding token

“errors”: str

Response when mapping tokens to text fails. The possible values are ignore, replace, and strict.

“name”: str

Name of the tokenizer.

RoBERTaTokenizer

class texar.torch.data.RoBERTaTokenizer(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Pre-trained RoBERTa Tokenizer.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., roberta-base). Please refer to PretrainedRoBERTaMixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

encode_text(text_a, text_b=None, max_seq_length=None)[source]

Adds special tokens to a sequence or sequence pair and computes the corresponding input mask for RoBERTa specific tasks. The sequence will be truncated if its length is larger than max_seq_length.

A RoBERTa sequence has the following format: [cls_token] X [sep_token]

A RoBERTa sequence pair has the following format: [cls_token] A [spe_token] [sep_token] B [sep_token]

Parameters
  • text_a – The first input text.

  • text_b – The second input text.

  • max_seq_length – Maximum sequence length.

Returns

A tuple of (input_ids, segment_ids, input_mask), where

  • input_ids: A list of input token ids with added special token ids.

  • input_mask: A list of mask ids. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • The tokenizer is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.

  • Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.

  • If the above two are None, the tokenizer is defined by the configurations in hparams.

{
    "pretrained_model_name": "roberta-base",
    "vocab_file": None,
    "merges_file": None,
    "max_len": 512,
    "bos_token": "<s>",
    "eos_token": "</s>",
    "sep_token": "</s>",
    "cls_token": "</s>",
    "unk_token": "<unk>",
    "pad_token": "<pad>",
    "mask_token": "<mask>",
    "errors": "replace",
    "name": "roberta_tokenizer",
}

Here:

“pretrained_model_name”: str or None

The name of the pre-trained RoBERTa model.

“vocab_file”: str or None

The path to a vocabulary json file mapping tokens to ids.

“merges_file”: str or None

The path to a merges file.

“max_len”: int

The maximum sequence length that this model might ever be used with.

“bos_token”: str

Beginning of sentence token.

“eos_token”: str

End of sentence token.

“sep_token”: str

Separation token.

“cls_token”: str

Classification token.

“unk_token”: str

Unknown token.

“pad_token”: str

Padding token.

“mask_token”: str

Masking token.

“errors”: str

Response when decoding fails. The possible values are ignore, replace, and strict.

“name”: str

Name of the tokenizer.

XLNetTokenizer

class texar.torch.data.XLNetTokenizer(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Pre-trained XLNet Tokenizer.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., xlnet-base-uncased). Please refer to PretrainedXLNetMixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

save_vocab(save_dir)[source]

Save the sentencepiece vocabulary (copy original file) to a directory.

map_token_to_text(tokens)[source]

Maps a sequence of tokens (string) in a single string.

encode_text(text_a, text_b=None, max_seq_length=None)[source]

Adds special tokens to a sequence or sequence pair and computes the corresponding segment ids and input mask for XLNet specific tasks. The sequence will be truncated if its length is larger than max_seq_length.

A XLNet sequence has the following format: X [sep_token] [cls_token]

A XLNet sequence pair has the following format: [cls_token] A [sep_token] B [sep_token]

Parameters
  • text_a – The first input text.

  • text_b – The second input text.

  • max_seq_length – Maximum sequence length.

Returns

A tuple of (input_ids, segment_ids, input_mask), where

  • input_ids: A list of input token ids with added special token ids.

  • segment_ids: A list of segment ids.

  • input_mask: A list of mask ids. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.

encode_text_for_generation(text, max_seq_length=None, append_eos_token=True)[source]

Adds special tokens to a sequence and computes the corresponding sequence length for XLNet specific tasks. The sequence will be truncated if its length is larger than max_seq_length.

A XLNet sequence has the following format: [bos_token] X [eos_token] [pad_token]

Parameters
  • text – Input text.

  • max_seq_length – Maximum sequence length.

  • append_eos_token – Whether to append eos_token after the sequence.

Returns

A tuple of (input_ids, seq_len), where

  • input_ids: A list of input token ids with added special tokens.

  • seq_len: The sequence length.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • The tokenizer is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.

  • Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.

  • If the above two are None, the tokenizer is defined by the configurations in hparams.

{
    "pretrained_model_name": "xlnet-base-cased",
    "vocab_file": None,
    "max_len": None,
    "bos_token": "<s>",
    "eos_token": "</s>",
    "unk_token": "<unk>",
    "sep_token": "<sep>",
    "pad_token": "<pad>",
    "cls_token": "<cls>",
    "mask_token": "<mask>",
    "additional_special_tokens": ["<eop>", "<eod>"],
    "do_lower_case": False,
    "remove_space": True,
    "keep_accents": False,
    "name": "xlnet_tokenizer",
}

Here:

“pretrained_model_name”: str or None

The name of the pre-trained XLNet model.

“vocab_file”: str or None

The path to a sentencepiece vocabulary file.

“max_len”: int or None

The maximum sequence length that this model might ever be used with.

“bos_token”: str

Beginning of sentence token.

“eos_token”: str

End of sentence token.

“unk_token”: str

Unknown token.

“sep_token”: str

Separation token.

“pad_token”: str

Padding token.

“cls_token”: str

Classification token.

“mask_token”: str

Masking token.

“additional_special_tokens”: list

A list of additional special tokens.

“do_lower_case”: bool

Whether to lower-case the text.

“remove_space”: bool

Whether to remove the space in the text.

“keep_accents”: bool

Whether to keep the accents in the text.

“name”: str

Name of the tokenizer.

T5Tokenizer

class texar.torch.data.T5Tokenizer(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Pre-trained T5 Tokenizer.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., T5-Small). Please refer to PretrainedT5Mixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • The tokenizer is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.

  • Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.

  • If the above two are None, the tokenizer is defined by the configurations in hparams.

{
    "pretrained_model_name": "T5-Small",
    "vocab_file": None,
    "max_len": 512,
    "bos_token": None,
    "eos_token": "</s>",
    "unk_token": "<unk>",
    "pad_token": "<pad>",
    "extra_ids": 100,
    "additional_special_tokens": [],
    "name": "t5_tokenizer",
}

Here:

“pretrained_model_name”: str or None

The name of the pre-trained T5 model.

“vocab_file”: str or None

The path to a sentencepiece vocabulary file.

“max_len”: int or None

The maximum sequence length that this model might ever be used with.

“bos_token”: str or None

Beginning of sentence token. Set None to disable bos_token.

“eos_token”: str

End of sentence token. Set None to disable eos_token.

“unk_token”: str

Unknown token. Set None to disable unk_token.

“pad_token”: str

Padding token. Set None to disable pad_token.

“extra_ids”: int

Add a number of extra ids added to the end of the vocabulary for use as sentinels. These tokens are accessible as <extra_id_{%d}> where {%d} is a number between 0 and extra_ids-1. Extra tokens are indexed from the end of the vocabulary up to beginning (<extra_id_0> is the last token in the vocabulary) (like in T5 preprocessing) see: https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117

“additional_special_tokens”: list

A list of additional special tokens.

“name”: str

Name of the tokenizer.

Vocabulary

SpecialTokens

class texar.torch.data.SpecialTokens[source]

Special tokens, including PAD, BOS, EOS, UNK. These tokens will by default have token ids 0, 1, 2, 3, respectively.

Vocab

class texar.torch.data.Vocab(filename, pad_token='<PAD>', bos_token='<BOS>', eos_token='<EOS>', unk_token='<UNK>')[source]

Vocabulary class that loads vocabulary from file, and maintains mapping tables between token strings and indexes.

Each line of the vocab file should contains one vocabulary token, e.g.:

vocab_token_1
vocab token 2
vocab       token | 3 .
...
Parameters
  • filename (str) – Path to the vocabulary file where each line contains one token.

  • bos_token (str) – A special token that will be added to the beginning of sequences.

  • eos_token (str) – A special token that will be added to the end of sequences.

  • unk_token (str) – A special token that will replace all unknown tokens (tokens not included in the vocabulary).

  • pad_token (str) – A special token that is used to do padding.

load(filename)[source]

Loads the vocabulary from the file.

Parameters

filename (str) – Path to the vocabulary file.

Returns

A tuple of mapping tables between word string and index, (id_to_token_map_py, token_to_id_map_py), where and token_to_id_map_py are python defaultdict instances.

map_ids_to_tokens_py(ids)[source]

Maps ids into text tokens.

The input ids and returned tokens are both python arrays or list.

Parameters

ids – An int numpy array or (possibly nested) list of token ids.

Returns

A numpy array of text tokens of the same shape as ids.

map_tokens_to_ids_py(tokens)[source]

Maps text tokens into ids.

The input tokens and returned ids are both python arrays or list.

Parameters

tokens – A numpy array or (possibly nested) list of text tokens.

Returns

A numpy array of token ids of the same shape as tokens.

property id_to_token_map_py

The dictionary instance that maps from token index to the string form.

property token_to_id_map_py

The dictionary instance that maps from token string to the index.

property size

The vocabulary size.

property bos_token

A string of the special token indicating the beginning of sequence.

property bos_token_id

The int index of the special token indicating the beginning of sequence.

property eos_token

A string of the special token indicating the end of sequence.

property eos_token_id

The int index of the special token indicating the end of sequence.

property unk_token

A string of the special token indicating unknown token.

property unk_token_id

The int index of the special token indicating unknown token.

property pad_token

A string of the special token indicating padding token. The default padding token is an empty string.

property pad_token_id

The int index of the special token indicating padding token.

property special_tokens

The list of special tokens [pad_token, bos_token, eos_token, unk_token].

map_ids_to_strs

texar.torch.data.map_ids_to_strs(ids, vocab, join=True, strip_pad='<PAD>', strip_bos='<BOS>', strip_eos='<EOS>')[source]

Transforms int indexes to strings by mapping ids to tokens, concatenating tokens into sentences, and stripping special tokens, etc.

Parameters
  • ids – An n-D numpy array or (possibly nested) list of int indexes.

  • vocab – An instance of Vocab.

  • join (bool) – Whether to concatenate along the last dimension of the the tokens into a string separated with a space character.

  • strip_pad (str) – The PAD token to strip from the strings (i.e., remove the leading and trailing PAD tokens of the strings). Default is "<PAD>" as defined in SpecialTokens.PAD. Set to None or False to disable the stripping.

  • strip_bos (str) – The BOS token to strip from the strings (i.e., remove the leading BOS tokens of the strings). Default is "<BOS>" as defined in SpecialTokens.BOS. Set to None or False to disable the stripping.

  • strip_eos (str) – The EOS token to strip from the strings (i.e., remove the EOS tokens and all subsequent tokens of the strings). Default is "<EOS>" as defined in SpecialTokens.EOS. Set to None or False to disable the stripping.

Returns

If join is True, returns a (n-1)-D numpy array (or list) of concatenated strings. If join is False, returns an n-D numpy array (or list) of str tokens.

Example

text_ids = [[1, 9, 6, 2, 0, 0], [1, 28, 7, 8, 2, 0]]

text = map_ids_to_strs(text_ids, data.vocab)
# text == ['a sentence', 'parsed from ids']

text = map_ids_to_strs(
    text_ids, data.vocab, join=False,
    strip_pad=None, strip_bos=None, strip_eos=None)
# text == [['<BOS>', 'a', 'sentence', '<EOS>', '<PAD>', '<PAD>'],
#          ['<BOS>', 'parsed', 'from', 'ids', '<EOS>', '<PAD>']]

Embedding

Embedding

class texar.torch.data.Embedding(vocab, hparams=None)[source]

Embedding class that loads token embedding vectors from file. Token embeddings not in the embedding file are initialized as specified in hparams.

Parameters
  • vocab (dict) – A dictionary that maps token strings to integer index.

  • hparams (dict) – Hyperparameters. See default_hparams() for the defaults.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values:

{
    "file": "",
    "dim": 50,
    "read_fn": "load_word2vec",
    "init_fn": {
        "type": "numpy.random.uniform",
        "kwargs": {
            "low": -0.1,
            "high": 0.1,
        }
    },
}

Here:

“file”: str

Path to the embedding file. If not provided, all embeddings are initialized with the initialization function.

“dim”: int

Dimension size of each embedding vector

“read_fn”: str or callable

Function to read the embedding file. This can be the function, or its string name or full module path. For example,

"read_fn": texar.torch.data.load_word2vec
"read_fn": "load_word2vec"
"read_fn": "texar.torch.data.load_word2vec"
"read_fn": "my_module.my_read_fn"

If function string name is used, the function must be in one of the modules: texar.torch.data or texar.torch.custom.

The function must have the same signature as with load_word2vec().

“init_fn”: dict

Hyperparameters of the initialization function used to initialize embedding of tokens missing in the embedding file.

The function must accept argument named size or shape to specify the output shape, and return a numpy array of the shape.

The dict has the following fields:

“type”: str or callable

The initialization function. Can be either the function, or its string name or full module path.

“kwargs”: dict

Keyword arguments for calling the function. The function is called with init_fn(size=[.., ..], **kwargs).

property word_vecs

2D numpy array of shape [vocab_size, embedding_dim].

property vector_size

The embedding dimension size.

load_word2vec

texar.torch.data.load_word2vec(filename, vocab, word_vecs)[source]

Loads embeddings in the word2vec binary format which has a header line containing the number of vectors and their dimensionality (two integers), followed with number-of-vectors lines each of which is formatted as <word-string> <embedding-vector>.

Parameters
  • filename (str) – Path to the embedding file.

  • vocab (dict) – A dictionary that maps token strings to integer index. Tokens not in vocab are not read.

  • word_vecs – A 2D numpy array of shape [vocab_size, embed_dim] which is updated as reading from the file.

Returns

The updated word_vecs.

load_glove

texar.torch.data.load_glove(filename, vocab, word_vecs)[source]

Loads embeddings in the glove text format in which each line is <word-string> <embedding-vector>. Dimensions of the embedding vector are separated with whitespace characters.

Parameters
  • filename (str) – Path to the embedding file.

  • vocab (dict) – A dictionary that maps token strings to integer index. Tokens not in vocab are not read.

  • word_vecs – A 2D numpy array of shape [vocab_size, embed_dim] which is updated as reading from the file.

Returns

The updated word_vecs.

Data Sources

DataSource

class texar.torch.data.DataSource(*args, **kwds)[source]

Base class for all data sources. A data source represents the source of the data, from which raw data examples are read and returned.

Different to PyTorch Dataset, subclasses of this class are not required to implement __getitem__() (default implementation raises TypeError), which is beneficial for certain sources that only supports iteration (reading from text files, reading Python iterators, etc.)

SequenceDataSource

class texar.torch.data.SequenceDataSource(sequence)[source]

Data source for reading from Python sequences.

This data source supports indexing.

Parameters

sequence – The Python sequence to read from. Note that a sequence should be iterable and supports len.

IterDataSource

class texar.torch.data.IterDataSource(iterable)[source]

Data source for reading from Python iterables. Please note: if passed an iterator and caching strategy is set to ‘none’, then the data source can only be iterated over once.

This data source does not support indexing.

Parameters

iterable – The Python iterable to read from.

ZipDataSource

class texar.torch.data.ZipDataSource(*sources)[source]

Data source by combining multiple sources. The raw examples returned from this data source are tuples, with elements being raw examples from each of the constituting data sources.

This data source supports indexing if all the constituting data sources support indexing.

Parameters

sources – The list of data sources to combine.

FilterDataSource

class texar.torch.data.FilterDataSource(source, filter_fn)[source]

Data source for filtering raw examples with a user-specified filter function. Only examples for which the filter functions returns True are returned.

This data source supports indexing if the wrapped data source supports indexing.

Parameters
  • source – The data source to filter.

  • filter_fn – A callable taking a raw example as argument and returning a boolean value, indicating whether the raw example should be kept.

RecordDataSource

class texar.torch.data.RecordDataSource(sources)[source]

Data source by structuring multiple sources. The raw examples returned from this data source are dictionaries, with values being raw examples from each of the constituting data sources.

This data source supports indexing if all the constituting data sources support indexing.

Parameters

sources – A dictionary mapping names to data sources, containing the data sources to combine.

TextLineDataSource

class texar.torch.data.TextLineDataSource(file_paths, compression_type=None, encoding=None, delimiter=None, max_length=None)[source]

Data source for reading from (multiple) text files. Each line is tokenized and yielded as an example.

This data source does not support indexing.

Parameters
  • file_paths (str or list[str]) – Paths to the text files.

  • compression_type (str, optional) – The compression type for the text files, "gzip" and "zlib" are supported. Default is None, in which case files are treated as plain text files.

  • encoding (str, optional) – Encoding for the files. By default uses the default locale of the system (usually UTF-8).

  • delimiter (str, optional) – Delimiter for tokenization purposes. This is used in combination with max_length. If None, text is split on any blank character.

  • max_length (int, optional) – Maximum length for data examples. Length is measured as the number of tokens in a line after being tokenized using the provided delimiter. Lines with more than max_length tokens will be dropped.

PickleDataSource

class texar.torch.data.PickleDataSource(file_paths, lists_are_examples=True, **pickle_kwargs)[source]

Data source for reading from (multiple) pickled binary files. Each file could contain multiple pickled objects, and each object is yielded as an example.

This data source does not support indexing.

Parameters
  • file_paths (str or list[str]) – Paths to pickled binary files.

  • lists_are_examples (bool) –

    If True, lists will be treated as a single example; if False, each element in the list will be treated as separate examples. Default is True. Set this to False if the entire pickled binary file is a list.

    Note

    It is recommended against storing all examples as a list, because in this case, all examples can only be accessed after the whole list is parsed.

  • pickle_kwargs – Additional keyword arguments to pass to pickle.load().

Data Loaders

DatasetBase

class texar.torch.data.DatasetBase(source, hparams=None, device=None)[source]

Base class inherited by all data classes.

Parameters
  • source – An instance of type DataSource,

  • hparams – A dict or instance of HParams containing hyperparameters. See default_hparams() for the defaults.

  • device

    The device of the produced batches. For GPU training, set to current CUDA device.

    Note

    When device is set to a CUDA device, tensors in the batch will be automatically moved to the specified device. This may result in performance issues if your data examples contain complex structures (e.g., nested lists with many elements). In this case, it is recommended to set device to None and manually move your data.

    For more details, see collate().

Users can also directly inherit from this class to implement customized data processing routines. Two methods should be implemented in the subclass:

  • process(): Process a single data example read from the data source (raw example). Default implementation returns the raw example as is.

  • collate(): Combine a list of processed examples into a single batch, and return an object of type Batch.

Example

Here, we define a custom data class named MyDataset, which is equivalent to the most basic usage of MonoTextData.

class MyDataset(tx.data.DatasetBase):
    def __init__(self, data_path, vocab, hparams=None, device=None):
        source = tx.data.TextLineDataSource(data_path)
        self.vocab = vocab
        super().__init__(source, hparams, device)

    def process(self, raw_example):
        # `raw_example` is a data example read from `self.source`,
        # in this case, a line of tokenized text, represented as a
        # list of `str`.
        return {
            "text": raw_example,
            "ids": self.vocab.map_tokens_to_ids_py(raw_example),
        }

    def collate(self, examples):
        # `examples` is a list of objects returned from the
        # `process` method. These data examples should be collated
        # into a batch.

        # `text` is a list of list of `str`, storing the tokenized
        # sentences for each example in the batch.
        text = [ex["text"] for ex in examples]
        # `ids` is the NumPy tensor built from the token IDs of each
        # sentence, and `lengths` the lengths of each sentence.
        # The `tx.data.padded_batch` function pads IDs to the same
        # length and then stack them together. This function is
        # commonly used in `collate` methods.
        ids, lengths = tx.data.padded_batch(
            [ex["ids"] for ex in examples])
        return tx.data.Batch(
            len(examples),
            text=text,
            text_ids=torch.from_numpy(ids),
            lengths=torch.tensor(lengths))

vocab = tx.data.Vocab("vocab.txt")
hparams = {'batch_size': 1}
data = MyDataset("data.txt", vocab, hparams)
iterator = DataIterator(data)
for batch in iterator:
    # batch contains the following
    # batch_ == {
    #    'text': [['<BOS>', 'example', 'sequence', '<EOS>']],
    #    'text_ids': [[1, 5, 10, 2]],
    #    'length': [4]
    # }
static default_hparams()[source]

Returns a dictionary of default hyperparameters.

{
    "num_epochs": 1,
    "batch_size": 64,
    "allow_smaller_final_batch": True,
    "shuffle": True,
    "shuffle_buffer_size": None,
    "shard_and_shuffle": False,
    "num_parallel_calls": 1,
    "prefetch_buffer_size": 0,
    "max_dataset_size": -1,
    "seed": None,
    "lazy_strategy": 'none',
    "cache_strategy": 'processed',
    "parallelize_processing": True,
    "name": "data"
}

Here:

“num_epochs”: int

Number of times the dataset should be repeated.

Note

This option only exists for compatibility, and will be ignored. A warning will be generated is any value other than 1 is used.

“batch_size”: int

Batch size, i.e., the number of consecutive elements of the dataset to combine in a single batch.

“allow_smaller_final_batch”: bool

Whether to allow the final batch to be smaller if there are insufficient elements left. If False, the final batch is discarded if it is smaller than batch size. Note that, if True, output_shapes of the resulting dataset will have a a static batch_size dimension equal to “batch_size”.

“shuffle”: bool

Whether to randomly shuffle the elements of the dataset.

“shuffle_buffer_size”: int

The buffer size for data shuffling. The larger, the better the resulting data is mixed.

If None (default), buffer size is set to the size of the whole dataset (i.e., make the shuffling the maximally effective).

“shard_and_shuffle”: bool

Whether to first shard the dataset and then shuffle each block respectively. Useful when the whole data is too large to be loaded efficiently into the memory.

If True, shuffle_buffer_size must be specified to determine the size of each shard.

Warning

Sharding is not yet supported. This option will be ignored.

“num_parallel_calls”: int

Number of elements from the datasets to process in parallel. When "num_parallel_calls" equals 0, no worker processes will be created; when the value is greater than 0, the number of worker processes will be equal to "num_parallel_calls".

“prefetch_buffer_size”: int

The maximum number of elements that will be buffered when prefetching.

Note

This option exists only for compatibility. Currently data is only prefetched when "num_parallel_calls" is greater than 1, and the number of examples to prefetch is controlled internally by PyTorch DataLoader.

“max_dataset_size”: int

Maximum number of instances to include in the dataset. If set to -1 or greater than the size of dataset, all instances will be included. This constraint is imposed after data shuffling and filtering.

“seed”: int, optional

The random seed for shuffle.

Note that if a seed is set, the shuffle order will be exact the same every time when going through the (repeated) dataset.

Warning

Manual seeding is not yet supported. This option will be ignored.

“lazy_strategy”: str

Lazy strategy for data examples. Lazy loading/processing defers data loading/processing until when it’s being accessed. Non-lazy (eager) loading/processing would load/process all data upon construction of dataset. Available options are:

  • none: Perform eager loading and processing.

  • process: Perform eager loading and lazy processing.

  • all: Perform lazy loading and processing.

Defaults to all. Note that currently, all eager operations are performed on a single process only.

“cache_strategy”: str

Caching strategy for data examples. Available options are:

  • none: No data is cached. Data is always loaded from source (e.g. file) and processed upon access.

  • loaded: Only cache raw data loaded from source, processing routines are performed upon access.

  • processed: Processed data is cached. Note: raw data will not be cached in this case, because raw data is only used to construct the processed data.

Default value is loaded. This option depends on the value of lazy_strategy, specifically:

  • When lazy_strategy is none, all choices of cache_strategy are equivalent to processed.

  • When lazy_strategy is process, none is equivalent to loaded.

“parallelize_processing”: bool

Whether to perform parallelized processing of data. Since multi-processing parallelism is utilized, this flag should be False if your process routine involves modifying a shared object across examples.

Note that this only affects cases where lazy_strategy is not none. If lazy_strategy is none, processing will be performed on a single process regardless of this value.

“max_batch_size”: int

AdaptDL parameter. Maximum global batch size used for distributed training with AdaptDL.

“local_bsz_bounds”: (int, int)

AdaptDL parameter. Local batch size bounds (min, max) per replica for distributed training.

“gradient_accumulation”: bool

AdaptDL parameter. Enable gradient accumulation.

“name”: str

Name of the data.

to(device)[source]

Move the dataset to the specific device. Note that we don’t actually move data or do anything here – data will be moved to the appropriate device after DataIterator fetches the batch.

process(raw_example)[source]

The process routine. A default implementation of no-op is provided, but subclasses are free to override this behavior.

The process routine would take raw examples loaded from the data source as input, and return processed examples. If parallelize_processing is True, this method must not access shared variables that are modified during iterator (e.g., constructing vocabularies on-the-fly).

Parameters

raw_example – The raw example loaded from data.

Returns

The processed example.

property num_epochs

Number of epochs.

property batch_size

The batch size.

property hparams

A HParams instance of the data hyperparameters.

property name

Name of the module.

property dataset

The data source.

collate(examples)[source]

The collate routine. Subclasses must implement this method.

The collate routine is called to collate (combine) examples into batches. This function takes a list of processed examples, and returns an instance of Batch.

Note

Implementation should make sure that the returned callable is safe and efficient under multi-processing scenarios. Basically, do not rely on variables that could be modified during iteration, and avoid accessing unnecessary variables, as each access would result in a cross-process memory copy.

Warning

The recommended pattern is not to move tensor storage within this method, but you are free to do so.

However, if multiple workers are used (num_parallel_calls > 0), moving tensors to CUDA devices within this method would result in CUDA errors being thrown.

Parameters

examples – A list of processed examples in a batch.

Returns

The collated batch.

MonoTextData

class texar.torch.data.MonoTextData(hparams, device=None, vocab=None, embedding=None, data_source=None)[source]

Text data processor that reads single set of text files. This can be used for, e.g., language models, auto-encoders, etc.

Parameters
  • hparams – A dict or instance of HParams containing hyperparameters. See default_hparams() for the defaults.

  • device – The device of the produced batches. For GPU training, set to current CUDA device.

By default, the processor reads raw data files, performs tokenization, batching and other pre-processing steps, and results in a Dataset whose element is a python dict including three fields:

“text”:

A list of [batch_size] elements each containing a list of raw text tokens of the sequences. Short sequences in the batch are padded with empty string. By default only EOS token is appended to each sequence. Out-of-vocabulary tokens are NOT replaced with UNK.

“text_ids”:

A list of [batch_size] elements each containing a list of token indexes of source sequences in the batch.

“length”:

A list of [batch_size] elements of integers containing the length of each source sequence in the batch (including BOS and EOS if added).

The above field names can be accessed through text_name, text_id_name, length_name.

Example

hparams={
    'dataset': { 'files': 'data.txt', 'vocab_file': 'vocab.txt' },
    'batch_size': 1
}
data = MonoTextData(hparams)
iterator = DataIterator(data)
for batch in iterator:
    # batch contains the following
    # batch_ == {
    #    'text': [['<BOS>', 'example', 'sequence', '<EOS>']],
    #    'text_ids': [[1, 5, 10, 2]],
    #    'length': [4]
    # }
static default_hparams()[source]

Returns a dictionary of default hyperparameters:

{
    # (1) Hyperparameters specific to text dataset
    "dataset": {
        "files": [],
        "compression_type": None,
        "vocab_file": "",
        "embedding_init": {},
        "delimiter": None,
        "max_seq_length": None,
        "length_filter_mode": "truncate",
        "pad_to_max_seq_length": False,
        "bos_token": "<BOS>"
        "eos_token": "<EOS>"
        "other_transformations": [],
        "variable_utterance": False,
        "utterance_delimiter": "|||",
        "max_utterance_cnt": 5,
        "data_name": None,
    }
    # (2) General hyperparameters
    "num_epochs": 1,
    "batch_size": 64,
    "allow_smaller_final_batch": True,
    "shuffle": True,
    "shuffle_buffer_size": None,
    "shard_and_shuffle": False,
    "num_parallel_calls": 1,
    "prefetch_buffer_size": 0,
    "max_dataset_size": -1,
    "seed": None,
    "name": "mono_text_data",
    # (3) Bucketing
    "bucket_boundaries": [],
    "bucket_batch_sizes": None,
    "bucket_length_fn": None,
}

Here:

  1. For the hyperparameters in the "dataset" field:

“files”: str or list

A (list of) text file path(s).

Each line contains a single text sequence.

“compression_type”: str, optional

One of None (no compression), "ZLIB", or "GZIP".

“vocab_file”: str

Path to vocabulary file. Each line of the file should contain one vocabulary token.

Used to create an instance of Vocab.

“embedding_init”: dict

The hyperparameters for pre-trained embedding loading and initialization.

The structure and default values are defined in texar.torch.data.Embedding.default_hparams().

“delimiter”: str, optional

The delimiter to split each line of the text files into tokens. If None (default), behavior will be equivalent to str.split(), i.e. split on any blank character.

“max_seq_length”: int, optional

Maximum length of output sequences. Data samples exceeding the length will be truncated or discarded according to "length_filter_mode". The length does not include any added "bos_token" or "eos_token". If None (default), no filtering is performed.

“length_filter_mode”: str

Either "truncate" or "discard". If "truncate" (default), tokens exceeding "max_seq_length" will be truncated. If "discard", data samples longer than "max_seq_length" will be discarded.

“pad_to_max_seq_length”: bool

If True, pad all data instances to length "max_seq_length". Raises error if "max_seq_length" is not provided.

“bos_token”: str

The Begin-Of-Sequence token prepended to each sequence.

Set to an empty string to avoid prepending.

“eos_token”: str

The End-Of-Sequence token appended to each sequence.

Set to an empty string to avoid appending.

“other_transformations”: list

A list of transformation functions or function names/paths to further transform each single data instance.

(More documentations to be added.)

“variable_utterance”: bool

If True, each line of the text file is considered to contain multiple sequences (utterances) separated by "utterance_delimiter".

For example, in dialog data, each line can contain a series of dialog history utterances. See the example in examples/hierarchical_dialog for a use case.

Warning

Variable utterances is not yet supported. This option (and related ones below) will be ignored.

“utterance_delimiter”: str

The delimiter to split over utterance level. Should not be the same with "delimiter". Used only when "variable_utterance" is True.

“max_utterance_cnt”: int

Maximally allowed number of utterances in a data instance. Extra utterances are truncated out.

“data_name”: str

Name of the dataset.

2. For the general hyperparameters, see texar.torch.data.DatasetBase.default_hparams() for details.

3. Bucketing is to group elements of the dataset together by length and then pad and batch. For bucketing hyperparameters:

“bucket_boundaries”: list

An int list containing the upper length boundaries of the buckets.

Set to an empty list (default) to disable bucketing.

“bucket_batch_sizes”: list

An int list containing batch size per bucket. Length should be len(bucket_boundaries) + 1.

If None, every bucket will have the same batch size specified in batch_size.

“bucket_length_fn”: str or callable

Function maps dataset element to int, determines the length of the element.

This can be a function, or the name or full module path to the function. If function name is given, the function must be in the texar.torch.custom module.

If None (default), length is determined by the number of tokens (including BOS and EOS if added) of the element.

Warning

Bucketing is not yet supported. These options will be ignored.

list_items()[source]

Returns the list of item names that the data can produce.

Returns

A list of strings.

property vocab

The vocabulary, an instance of Vocab.

property text_name

The name for the text field

property text_id_name

The name for text ids

property length_name

The name for text length

property embedding_init_value

The Tensor containing the embedding value loaded from file. None if embedding is not specified.

PairedTextData

class texar.torch.data.PairedTextData(hparams, device=None)[source]

Text data processor that reads parallel source and target text. This can be used in, e.g., seq2seq models.

Parameters
  • hparams (dict) – Hyperparameters. See default_hparams() for the defaults.

  • device – The device of the produced batches. For GPU training, set to current CUDA device.

By default, the processor reads raw data files, performs tokenization, batching and other pre-processing steps, and results in a Dataset whose element is a python dict including six fields:

“source_text”:

A list of [batch_size] elements each containing a list of raw text tokens of source sequences. Short sequences in the batch are padded with empty string. By default only EOS token is appended to each sequence. Out-of-vocabulary tokens are NOT replaced with UNK.

“source_text_ids”:

A list of [batch_size] elements each containing a list of token indexes of source sequences in the batch.

“source_length”:

A list of [batch_size] elements of integers containing the length of each source sequence in the batch.

“target_text”:

A list same as “source_text” but for target sequences. By default both BOS and EOS are added.

“target_text_ids”:

A list same as “source_text_ids” but for target sequences.

“target_length”:

An list same as “source_length” but for target sequences.

The above field names can be accessed through source_text_name, source_text_id_name, source_length_name, and those prefixed with target_, respectively.

Example:

hparams={
    'source_dataset': {'files': 's', 'vocab_file': 'vs'},
    'target_dataset': {'files': ['t1', 't2'], 'vocab_file': 'vt'},
    'batch_size': 1
}
data = PairedTextData(hparams)
iterator = DataIterator(data)

for batch in iterator:
    # batch contains the following
    # batch_ == {
    #    'source_text': [['source', 'sequence', '<EOS>']],
    #    'source_text_ids': [[5, 10, 2]],
    #    'source_length': [3]
    #    'target_text': [['<BOS>', 'target', 'sequence', '1',
                        '<EOS>']],
    #    'target_text_ids': [[1, 6, 10, 20, 2]],
    #    'target_length': [5]
    # }
static default_hparams()[source]

Returns a dictionary of default hyperparameters.

{
    # (1) Hyperparams specific to text dataset
    "source_dataset": {
        "files": [],
        "compression_type": None,
        "vocab_file": "",
        "embedding_init": {},
        "delimiter": None,
        "max_seq_length": None,
        "length_filter_mode": "truncate",
        "pad_to_max_seq_length": False,
        "bos_token": None,
        "eos_token": "<EOS>",
        "other_transformations": [],
        "variable_utterance": False,
        "utterance_delimiter": "|||",
        "max_utterance_cnt": 5,
        "data_name": "source",
    },
    "target_dataset": {
        # ...
        # Same fields are allowed as in "source_dataset" with the
        # same default values, except the
        # following new fields/values:
        "bos_token": "<BOS>"
        "vocab_share": False,
        "embedding_init_share": False,
        "processing_share": False,
        "data_name": "target"
    }
    # (2) General hyperparams
    "num_epochs": 1,
    "batch_size": 64,
    "allow_smaller_final_batch": True,
    "shuffle": True,
    "shuffle_buffer_size": None,
    "shard_and_shuffle": False,
    "num_parallel_calls": 1,
    "prefetch_buffer_size": 0,
    "max_dataset_size": -1,
    "seed": None,
    "name": "paired_text_data",
    # (3) Bucketing
    "bucket_boundaries": [],
    "bucket_batch_sizes": None,
    "bucket_length_fn": None,
}

Here:

  1. Hyperparameters in the "source_dataset" and "target_dataset" fields have the same definition as those in texar.torch.data.MonoTextData.default_hparams(), for source and target text, respectively.

    For the new hyperparameters in "target_dataset":

    “vocab_share”: bool

    Whether to share the vocabulary of source. If True, the vocab file of target is ignored.

    “embedding_init_share”: bool

    Whether to share the embedding initial value of source. If True, "embedding_init" of target is ignored.

    "vocab_share" must be true to share the embedding initial value.

    “processing_share”: bool

    Whether to share the processing configurations of source, including “delimiter”, “bos_token”, “eos_token”, and “other_transformations”.

  2. For the general hyperparameters, see texar.torch.data.DatasetBase.default_hparams() for details.

  3. For bucketing hyperparameters, see texar.torch.data.MonoTextData.default_hparams() for details, except that the default “bucket_length_fn” is the maximum sequence length of source and target sequences.

    Warning

    Bucketing is not yet supported. These options will be ignored.

list_items()[source]

Returns the list of item names that the data can produce.

Returns

A list of strings.

property vocab

A pair instances of Vocab that are source and target vocabs, respectively.

property source_vocab

The source vocab, an instance of Vocab.

property target_vocab

The target vocab, an instance of Vocab.

property source_text_name

The name for source text

property source_text_id_name

The name for source text id

property source_length_name

The name for source length

property target_text_name

The name for target text

property target_text_id_name

The name for target text id

property target_length_name

The name for target length

embedding_init_value()[source]

A pair of Tensor containing the embedding values of source and target data loaded from file.

ScalarData

class texar.torch.data.ScalarData(hparams, device=None, data_source=None)[source]

Scalar data where each line of the files is a scalar (int or float), e.g., a data label.

Parameters
  • hparams (dict) – Hyperparameters. See default_hparams() for the defaults.

  • device – The device of the produced batches. For GPU training, set to current CUDA device.

The processor reads and processes raw data and results in a dataset whose element is a python dict including one field. The field name is specified in hparams["dataset"]["data_name"]. If not specified, the default name is “data”. The field name can be accessed through data_name.

This field is a Tensor of shape [batch_size] containing a batch of scalars, of either int or float type as specified in hparams.

Example

hparams={
    'dataset': { 'files': 'data.txt', 'data_name': 'label' },
    'batch_size': 2
}
data = ScalarData(hparams)
iterator = DataIterator(data)
for batch in iterator:
    # batch contains the following
    # batch == {
    #     'label': [2, 9]
    # }
static default_hparams()[source]

Returns a dictionary of default hyperparameters.

{
    # (1) Hyperparams specific to scalar dataset
    "dataset": {
        "files": [],
        "compression_type": None,
        "data_type": "int",
        "other_transformations": [],
        "data_name": "data",
    }
    # (2) General hyperparams
    "num_epochs": 1,
    "batch_size": 64,
    "allow_smaller_final_batch": True,
    "shuffle": True,
    "shuffle_buffer_size": None,
    "shard_and_shuffle": False,
    "num_parallel_calls": 1,
    "prefetch_buffer_size": 0,
    "max_dataset_size": -1,
    "seed": None,
    "name": "scalar_data",
}

Here:

  1. For the hyperparameters in the "dataset" field:

    “files”: str or list

    A (list of) file path(s).

    Each line contains a single scalar number.

    “compression_type”: str, optional

    One of “” (no compression), “ZLIB”, or “GZIP”.

    “data_type”: str

    The scalar type. Types defined in get_supported_scalar_types() are supported.

    “other_transformations”: list

    A list of transformation functions or function names/paths to further transform each single data instance.

    (More documentations to be added.)

    “data_name”: str

    Name of the dataset.

  2. For the general hyperparameters, see texar.torch.data.DatasetBase.default_hparams() for details.

list_items()[source]

Returns the list of item names that the data can produce.

Returns

A list of strings.

property data_name

The name of the data tensor, “data” by default if not specified in hparams.

MultiAlignedData

class texar.torch.data.MultiAlignedData(hparams, device=None)[source]

Data consisting of multiple aligned parts.

Parameters
  • hparams (dict) – Hyperparameters. See default_hparams() for the defaults.

  • device – The device of the produced batches. For GPU training, set to current CUDA device.

The processor can read any number of parallel fields as specified in the “datasets” list of hparams, and result in a Dataset whose element is a python dict containing data fields from each of the specified datasets. Fields from a text dataset or Record dataset have names prefixed by its "data_name". Fields from a scalar dataset are specified by its "data_name".

Example

hparams={
    'datasets': [
        {'files': 'a.txt', 'vocab_file': 'v.a', 'data_name': 'x'},
        {'files': 'b.txt', 'vocab_file': 'v.b', 'data_name': 'y'},
        {'files': 'c.txt', 'data_type': 'int', 'data_name': 'z'}
    ]
    'batch_size': 1
}
data = MultiAlignedData(hparams)
iterator = DataIterator(data)

for batch in iterator:
    # batch contains the following
    # batch == {
    #    'x_text': [['<BOS>', 'x', 'sequence', '<EOS>']],
    #    'x_text_ids': [['1', '5', '10', '2']],
    #    'x_length': [4]
    #    'y_text': [['<BOS>', 'y', 'sequence', '1', '<EOS>']],
    #    'y_text_ids': [['1', '6', '10', '20', '2']],
    #    'y_length': [5],
    #    'z': [1000],
    # }

...

hparams={
    'datasets': [
        {'files': 'd.txt', 'vocab_file': 'v.d', 'data_name': 'm'},
        {
            'files': 'd.tfrecord',
            'data_type': 'tf_record',
            "feature_types": {
                'image': ['tf.string', 'stacked_tensor']
            },
            'image_options': {
                'image_feature_name': 'image',
                'resize_height': 512,
                'resize_width': 512,
            },
            'data_name': 't',
        }
    ]
    'batch_size': 1
}
data = MultiAlignedData(hparams)
iterator = DataIterator(data)
for batch in iterator:
    # batch contains the following
    # batch_ == {
    #    'x_text': [['<BOS>', 'NewYork', 'City', 'Map', '<EOS>']],
    #    'x_text_ids': [['1', '100', '80', '65', '2']],
    #    'x_length': [5],
    #
    #    # "t_image" is a list of a "numpy.ndarray" image
    #    # in this example. Its width is equal to 512 and
    #    # its height is equal to 512.
    #    't_image': [...]
    # }
static default_hparams()[source]

Returns a dictionary of default hyperparameters:

{
    # (1) Hyperparams specific to text dataset
    "datasets": []
    # (2) General hyperparams
    "num_epochs": 1,
    "batch_size": 64,
    "allow_smaller_final_batch": True,
    "shuffle": True,
    "shuffle_buffer_size": None,
    "shard_and_shuffle": False,
    "num_parallel_calls": 1,
    "prefetch_buffer_size": 0,
    "max_dataset_size": -1,
    "seed": None,
    "name": "multi_aligned_data",
}

Here:

  1. “datasets” is a list of dict each of which specifies a dataset which can be text, scalar or Record. The "data_name" field of each dataset is used as the name prefix of the data fields from the respective dataset. The "data_name" field of each dataset should not be the same.

    1. For scalar dataset, the allowed hyperparameters and default values are the same as the “dataset” field of texar.torch.data.ScalarData.default_hparams(). Note that "data_type" must be explicitly specified (either “int” or “float”).

    2. For Record dataset, the allowed hyperparameters and default values are the same as the “dataset” field of texar.torch.data.RecordData.default_hparams(). Note that "data_type" must be explicitly specified (“record”).

    3. For text dataset, the allowed hyperparameters and default values are the same as the “dataset” filed of texar.torch.data.MonoTextData.default_hparams(), with several extra hyperparameters:

      “data_type”: str

      The type of the dataset, one of {“text”, “int”, “float”, “record”}. If set to “int” or “float”, the dataset is considered to be a scalar dataset. If set to “record”, the dataset is considered to be a Record dataset.

      If not specified or set to “text”, the dataset is considered to be a text dataset.

      “vocab_share_with”: int, optional

      Share the vocabulary of a preceding text dataset with the specified index in the list (starting from 0). The specified dataset must be a text dataset, and must have an index smaller than the current dataset.

      If specified, the vocab file of current dataset is ignored. Default is None which disables the vocab sharing.

      “embedding_init_share_with”: int, optional

      Share the embedding initial value of a preceding text dataset with the specified index in the list (starting from 0). The specified dataset must be a text dataset, and must have an index smaller than the current dataset.

      If specified, the "embedding_init" field of the current dataset is ignored. Default is None which disables the initial value sharing.

      “processing_share_with”: int, optional

      Share the processing configurations of a preceding text dataset with the specified index in the list (starting from 0). The specified dataset must be a text dataset, and must have an index smaller than the current dataset.

      If specified, relevant field of the current dataset are ignored, including delimiter, bos_token, eos_token, and “other_transformations”. Default is None which disables the processing sharing.

2. For the general hyperparameters, see texar.torch.data.DatasetBase.default_hparams() for details.

list_items()[source]

Returns the list of item names that the data can produce.

Returns

A list of strings.

vocab(name_or_id)[source]

Returns the Vocab of text dataset by its name or id. None if the dataset is not of text type.

Parameters

name_or_id (str or int) – Data name or the index of text dataset.

embedding_init_value(name_or_id)[source]

Returns the Tensor of embedding initial value of the dataset by its name or id. None if the dataset is not of text type.

text_name(name_or_id)[source]

The name of text tensor of text dataset by its name or id. If the dataset is not of text type, returns None.

length_name(name_or_id)[source]

The name of length tensor of text dataset by its name or id. If the dataset is not of text type, returns None.

text_id_name(name_or_id)[source]

The name of length tensor of text dataset by its name or id. If the dataset is not of text type, returns None.

data_name(name_or_id)[source]

The name of the data tensor of scalar dataset by its name or id.. If the dataset is not a scalar data, returns None.

RecordData

class texar.torch.data.RecordData(hparams=None, device=None, data_source=None)[source]

Record data which loads and processes pickled files.

This module can be used to process image data, features, etc.

Parameters
  • hparams (dict) – Hyperparameters. See default_hparams() for the defaults.

  • device – The device of the produced batches. For GPU training, set to current CUDA device.

The module reads and restores data from pickled files and results in a dataset whose element is a Python dict that maps feature names to feature values. The features names and dtypes are specified in hparams.dataset.feature_types.

The module also provides simple processing options for image data, such as image resize.

Example

# Read data from pickled file
hparams={
    'dataset': {
        'files': 'image1.pkl',
        'feature_types': {
            'height': ['int64', 'list'],  # or 'stacked_tensor'
            'width': ['int64', 'list'],   # or 'stacked_tensor'
            'label': ['int64', 'stacked_tensor'],
            'image_raw': ['bytes', 'stacked_tensor'],
        }
    },
    'batch_size': 1
}
data = RecordData(hparams)
iterator = DataIterator(data)

batch = next(iter(iterator))  # get the first batch in dataset
# batch == {
#    'data': {
#        'height': [239],
#        'width': [149],
#        'label': tensor([1]),
#
#        # 'image_raw' is a NumPy ndarray of raw image bytes in this
#        # example.
#        'image_raw': [...],
#    }
# }
# Read image data from pickled file and do resizing
hparams={
    'dataset': {
        'files': 'image2.pkl',
        'feature_types': {
            'label': ['int64', 'stacked_tensor'],
            'image_raw': ['bytes', 'stacked_tensor'],
        },
        'image_options': {
            'image_feature_name': 'image_raw',
            'resize_height': 512,
            'resize_width': 512,
        }
    },
    'batch_size': 1
}
data = RecordData(hparams)
iterator = DataIterator(data)

batch = next(iter(iterator))  # get the first batch in dataset
# batch == {
#    'data': {
#        'label': tensor([1]),
#
#        # "image_raw" is a tensor of image pixel data in this
#        # example. Each image has a width of 512 and height of 512.
#        'image_raw': tensor([...])
#    }
# }
classmethod writer(file_path, feature_types)[source]

Construct a file writer object that saves records in pickled format.

Example:

file_path = "data/train.pkl"
feature_types = {
    "input_ids": ["int64", "stacked_tensor", 128],
    "label_ids": ["int64", "stacked_tensor"],
}
with tx.data.RecordData.writer(file_path, feature_types) as writer:
    writer.write({
        "input_ids": np.random.randint(0, 100, size=128),
        "label_ids": np.random.randint(0, 100),
    })
Parameters
  • file_path (str) – Path to save the dataset.

  • feature_types – Feature names and types. Please refer to default_hparams() for details.

Returns

A file writer object.

static default_hparams()[source]

Returns a dictionary of default hyperparameters.

{
    # (1) Hyperparameters specific to the record data
    'dataset': {
        'files': [],
        'feature_types': {},
        'feature_convert_types': {},
        'image_options': {},
        "num_shards": None,
        "shard_id": None,
        "other_transformations": [],
        "data_name": None,
    }
    # (2) General hyperparameters
    "num_epochs": 1,
    "batch_size": 64,
    "allow_smaller_final_batch": True,
    "shuffle": True,
    "shuffle_buffer_size": None,
    "shard_and_shuffle": False,
    "num_parallel_calls": 1,
    "prefetch_buffer_size": 0,
    "max_dataset_size": -1,
    "seed": None,
    "name": "tfrecord_data",
}

Here:

  1. For the hyperparameters in the "dataset" field:

    “files”: str or list

    A (list of) pickled file path(s).

    “feature_types”: dict

    The feature names (str) with their descriptions in the form of feature_name: [dtype, feature_collate_method, shape]:

    • dtype is a Python type (int, str), dtype instance from PyTorch (torch.float), NumPy (np.int64), or TensorFlow (tf.string), or their stringified names such as "torch.float" and "np.int64". The feature will be read from the files and parsed into this dtype.

    • feature_collate_method is of type str, and describes how features are collated in the batch. Available values are:

      • "stacked_tensor": Features are assumed to be tensors of a fixed shape (or scalars). When collating, features are stacked, with the batch dimension being the first dimension. This is the default value if feature_collate_method is not specified. For example:

        • 5 scalar features -> a tensor of shape [5].

        • 4 tensor features, each of shape [6, 5] -> a tensor of shape [4, 6, 5].

      • "padded_tensor": Features are assumed to be tensors, with all dimensions except the first having the same size. When collating, features are padded with zero values along the end of the first dimension so that every tensor has the same size, and then stacked, with the batch dimension being the first dimension. For example:

        • 3 tensor features, with shapes [4, 7, 8], [5, 7, 8], and [4, 7, 8] -> a tensor of shape [3, 5, 7, 8].

      • "list": Features can be any objects. When collating, the features are stored in a Python list.

    • shape is optional, and can be of type int, tuple`, or torch.Size. If specified, shapes of tensor features will be checked, depending on the feature_collate_method:

      • "stacked_tensor": The shape of every feature tensor must be shape.

      • "padded_tensor": The shape (excluding first dimension) of every feature tensor must be shape.

      • "list": shape is ignored.

      Note

      Shape check is performed before any transformations are applied.

    Example:

    feature_types = {
        "input_ids": ["int64", "stacked_tensor", 128],
        "label_ids": ["int64", "stacked_tensor"],
        "name_lists": ["string", "list"],
    }
    

    Note

    This field is named “feature_original_types” in Texar-TF. This name is still supported, but is deprecated in favor of “feature_types”.

    Texar-TF also uses different names for feature types:

    • "FixedLenFeature" corresponds to "stacked_tensor".

    • "FixedLenSequenceFeature" corresponds to "padded_tensor".

    • "VarLenFeature" corresponds to "list".

    These names are also accepted in Texar-PyTorch, but are deprecated in favor of the new names.

    “feature_convert_types”: dict, optional

    Specifies dtype converting after reading the data files. This dict maps feature names to desired data dtypes. For example, you can first read a feature into dtype torch.int32 by specifying in "feature_types" above, and convert the feature to dtype "torch.long" by specifying here. Features not specified here will not do dtype-convert.

    • dtype is a Python type (int, str), dtype instance from PyTorch (torch.float), NumPy (np.int64), or TensorFlow (tf.string), or their stringified names such as "torch.float" and "np.int64".

    Note that this converting process happens after all the data are restored.

    Example:

    feature_convert_types = {
        "input_ids": "int32",
        "label_ids": "int32",
    }
    
    “image_options”: dict, optional

    Specifies the image feature name and performs image resizing, includes three fields:

    • “image_feature_name”: str

      The name of the feature which contains the image data. If set, the image data will be restored in a numpy.ndarray.

    • “resize_height”: int

      The height of the image after resizing.

    • “resize_width”: int

      The width of the image after resizing.

    If any of "resize_height" or "resize_width" is not set, image data will be restored with original shape.

    “num_shards”: int, optional

    The number of data shards in distributed mode. Usually set to the number of processes in distributed computing. Used in combination with "shard_id".

    Warning

    Sharding is not yet supported. This option (and related ones below) will be ignored.

    “shard_id”: int, optional

    Sets the unique id to identify a shard. The module will processes only the corresponding shard of the whole data. Used in combination with "num_shards".

    For example, in a case of distributed computing on 2 GPUs, the hyperparameters of the data module for the two processes can be configured as below, respectively.

    For GPU 0:

    dataset: {
        ...
        "num_shards": 2,
        "shard_id": 0
    }
    

    For GPU 1:

    dataset: {
        ...
        "num_shards": 2,
        "shard_id": 1
    }
    

    Also refer to examples/bert for a use case.

    “other_transformations”: list

    A list of transformation functions or function names/paths to further transform each single data instance.

    “data_name”: str

    Name of the dataset.

  2. For the general hyperparameters, see texar.torch.data.DatasetBase.default_hparams() for details.

list_items()[source]

Returns the list of item names that the data can produce.

Returns

A list of strings.

property feature_names

A list of feature names.

Data Iterators

Batch

class texar.torch.data.Batch(batch_size, batch=None, **kwargs)[source]

Wrapper over Python dictionaries representing a batch. It provides a dictionary-like interface to access its fields. This class can be used in the followed way

hparams = {
    'dataset': { 'files': 'data.txt', 'vocab_file': 'vocab.txt' },
    'batch_size': 1
}

data = MonoTextData(hparams)
iterator = DataIterator(data)

for batch in iterator:
    # batch is Batch object and contains the following fields
    # batch == {
    #    'text': [['<BOS>', 'example', 'sequence', '<EOS>']],
    #    'text_ids': [[1, 5, 10, 2]],
    #    'length': [4]
    # }

    input_ids = torch.tensor(batch['text_ids'])

    # we can also access the elements using dot notation
    input_text = batch.text

DataIterator

class texar.torch.data.DataIterator(datasets, batching_strategy=None, pin_memory=None)[source]

Data iterator that switches and iterates through multiple datasets.

This is a wrapper of SingleDatasetIterator.

Parameters
  • datasets

    Datasets to iterate through. This can be:

  • batching_strategy – The batching strategy to use when performing dynamic batching. If None, fixed-sized batching is used.

  • pin_memory

    If True, tensors will be moved onto page-locked memory before returning. This argument is passed into the constructor for DataLoader.

    Defaults to None, which will set the value to True if the DatasetBase instance is set to use a CUDA device. Set to True or False to override this behavior.

Example

Create an iterator over two datasets and generating fixed-sized batches:

train_data = MonoTextData(hparams_train)
test_data = MonoTextData(hparams_test)
iterator = DataIterator({'train': train_data, 'test': test_data})

for epoch in range(200): # Run 200 epochs of train/test
    # Starts iterating through training data from the beginning.
    iterator.switch_to_dataset('train')
    for batch in iterator:
        ... # Do training with the batch.

    # Starts iterating through test data from the beginning
    for batch in iterator.get_iterator('test'):
        ... # Do testing with the batch.

Dynamic batching based on total number of tokens:

iterator = DataIterator(
    {'train': train_data, 'test': test_data},
    batching_strategy=TokenCountBatchingStrategy(max_tokens=1000))

Dynamic batching with custom strategy (e.g. total number of tokens in examples from PairedTextData, including padding):

class CustomBatchingStrategy(BatchingStrategy):
    def __init__(self, max_tokens: int):
        self.max_tokens = max_tokens
        self.reset_batch()

    def reset_batch(self) -> None:
        self.max_src_len = 0
        self.max_tgt_len = 0
        self.cur_batch_size = 0

    def add_example(self, ex: Tuple[List[str], List[str]]) -> bool:
        max_src_len = max(self.max_src_len, len(ex[0]))
        max_tgt_len = max(self.max_tgt_len, len(ex[0]))
        if (max(max_src_len + max_tgt_len) *
                (self.cur_batch_size + 1) > self.max_tokens):
            return False
        self.max_src_len = max_src_len
        self.max_tgt_len = max_tgt_len
        self.cur_batch_size += 1
        return True

iterator = DataIterator(
    {'train': train_data, 'test': test_data},
    batching_strategy=CustomBatchingStrategy(max_tokens=1000))
property num_datasets

Number of datasets.

property dataset_names

A list of dataset names.

switch_to_dataset(dataset_name=None)[source]

Re-initializes the iterator of a given dataset and starts iterating over the dataset (from the beginning).

Parameters

dataset_name (optional) – Name of the dataset. If not provided, there must be only one Dataset.

get_iterator(dataset_name=None)[source]

Re-initializes the iterator of a given dataset and starts iterating over the dataset (from the beginning).

Parameters

dataset_name (optional) – Name of the dataset. If not provided, there must be only one Dataset.

TrainTestDataIterator

class texar.torch.data.TrainTestDataIterator(train=None, val=None, test=None, batching_strategy=None, pin_memory=None)[source]

Data iterator that alternates between training, validation, and test datasets.

train, val, and test are instances of DatasetBase. At least one of them must be provided.

This is a wrapper of DataIterator.

Parameters
  • train (optional) – Training data.

  • val (optional) – Validation data.

  • test (optional) – Test data.

  • batching_strategy – The batching strategy to use when performing dynamic batching. If None, fixed-sized batching is used.

  • pin_memory

    If True, tensors will be moved onto page-locked memory before returning. This argument is passed into the constructor for DataLoader.

    Defaults to None, which will set the value to True if the DatasetBase instance is set to use a CUDA device. Set to True or False to override this behavior.

Example

train_data = MonoTextData(hparams_train)
val_data = MonoTextData(hparams_val)
iterator = TrainTestDataIterator(train=train_data, val=val_data)

for epoch in range(200): # Run 200 epochs of train/val
    # Starts iterating through training data from the beginning.
    iterator.switch_to_train_data(sess)
    for batch in iterator:
        ... # Do training with the batch.

    # Starts iterating through val data from the beginning.
    for batch in iterator.get_val_iterator():
        ... # Do validation on the batch.
switch_to_train_data()[source]

Switch to training data.

switch_to_val_data()[source]

Switch to validation data.

switch_to_test_data()[source]

Switch to test data.

get_train_iterator()[source]

Obtain an iterator over training data.

get_val_iterator()[source]

Obtain an iterator over validation data.

get_test_iterator()[source]

Obtain an iterator over test data.

BatchingStrategy

class texar.torch.data.BatchingStrategy(*args, **kwds)[source]

Decides batch boundaries in dynamic batching. Please refer to TokenCountBatchingStrategy for a concrete example.

reset_batch()[source]

Reset the internal state of the batching strategy. This method is called at the start of iteration, and after each batch is yielded.

add_example(example)[source]

Add an example into the current batch, and modify internal states accordingly. If the example should not be added to the batch, this method does not modify the internal state, and returns False.

Parameters

example – The example to add to the batch.

Returns

A boolean value indicating whether example should be added to the batch.

TokenCountBatchingStrategy

class texar.torch.data.TokenCountBatchingStrategy(max_tokens, max_batch_size=None, length_fn=None)[source]

Create dynamically-sized batches so that the total number of tokens inside each batch is constrained.

Parameters
  • max_tokens (int) – The maximum number of tokens inside each batch.

  • max_batch_size (int, optional) – The maximum number of examples for each batch. If None, batches can contain arbitrary number of examples as long as the total number of tokens does not exceed max_tokens.

  • length_fn (callable, optional) – A function taking a data example as argument, and returning the number of tokens in the example. By default, len is used, which is the desired behavior if the dataset in question is a MonoTextData.

Data Utilities

maybe_download

texar.torch.data.maybe_download(urls, path, filenames=None, extract=False)[source]

Downloads a set of files.

Parameters
  • urls – A (list of) URLs to download files.

  • path (str) – The destination path to save the files.

  • filenames – A (list of) strings of the file names. If given, must have the same length with urls. If None, filenames are extracted from urls.

  • extract (bool) – Whether to extract compressed files.

Returns

A list of paths to the downloaded files.

read_words

texar.torch.data.read_words(filename, newline_token=None)[source]

Reads word from a file.

Parameters
  • filename (str) – Path to the file.

  • newline_token (str, optional) – The token to replace the original newline token “\n”. For example, tx.data.SpecialTokens.EOS. If None, no replacement is performed.

Returns

A list of words.

make_vocab

texar.torch.data.make_vocab(filenames, max_vocab_size=- 1, newline_token=None, return_type='list', return_count=False)[source]

Builds vocab of the files.

Parameters
  • filenames (str) – A (list of) files.

  • max_vocab_size (int) – Maximum size of the vocabulary. Low frequency words that exceeding the limit will be discarded. Set to -1 (default) if no truncation is wanted.

  • newline_token (str, optional) – The token to replace the original newline token “\n”. For example, tx.data.SpecialTokens.EOS. If None, no replacement is performed.

  • return_type (str) – Either list or dict. If list (default), this function returns a list of words sorted by frequency. If dict, this function returns a dict mapping words to their index sorted by frequency.

  • return_count (bool) – Whether to return word counts. If True and return_type is dict, then a count dict is returned, which is a mapping from words to their frequency.

Returns

  • If return_count is False, returns a list or dict containing the vocabulary words.

  • If return_count if True, returns a pair of list or dict (a, b), where a is a list or dict containing the vocabulary words, b is a list or dict containing the word counts.

count_file_lines

texar.torch.data.count_file_lines(filenames)[source]

Counts the number of lines in the file(s).