Data¶
Tokenizer¶
TokenizerBase¶
- class texar.torch.data.TokenizerBase(hparams)[source]¶
Base class inherited by all tokenizer classes. This class handles downloading and loading pre-trained tokenizer and adding tokens to the vocabulary.
Derived class can set up a few special tokens to be used in common scripts and internals:
bos_token
,eos_token
,unk_token
,sep_token
,pad_token
,cls_token
,mask_token
, andadditional_special_tokens
.We defined an
added_tokens_encoder
to add new tokens to the vocabulary without having to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece …).- classmethod load(pretrained_model_path, configs=None)[source]¶
Instantiate a tokenizer from the vocabulary files or the saved tokenizer files.
- Parameters
pretrained_model_path – The path to a vocabulary file or a folder that contains the saved pre-trained tokenizer files.
configs – Tokenizer configurations. You can overwrite the original tokenizer configurations saved in the configuration file by this dictionary.
- Returns
A tokenizer instance.
- save(save_dir)[source]¶
Save the tokenizer vocabulary files (with added tokens), tokenizer configuration file and a dictionary mapping special token class attributes (
cls_token
,unk_token
, …) to their values (<unk>, <cls>, …) to a directory, so that it can be re-loaded using theload()
.- Parameters
save_dir – The path to a folder in which the tokenizer files will be saved.
- Returns
The paths to the vocabulary file, added token file, special token mapping file, and the configuration file.
- save_vocab(save_dir)[source]¶
Save the tokenizer vocabulary to a directory. This method does not save added tokens, special token mappings, and the configuration file.
Please use
save()
to save the full tokenizer state so that it can be reloaded usingload()
.
- add_tokens(new_tokens)[source]¶
Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to the
added_tokens_encoder
with indices starting from the last index of the current vocabulary.- Parameters
new_tokens – A list of new tokens.
- Returns
Number of tokens added to the vocabulary which can be used to correspondingly increase the size of the associated model embedding matrices.
- add_special_tokens(special_tokens_dict)[source]¶
Add a dictionary of special tokens to the encoder and link them to class attributes. If the special tokens are not in the vocabulary, they are added to it and indexed starting from the last index of the current vocabulary.
- Parameters
special_tokens_dict – A dictionary mapping special token class attributes (
cls_token
,unk_token
, …) to their values (<unk>, <cls>, …).- Returns
Number of tokens added to the vocabulary which can be used to correspondingly increase the size of the associated model embedding matrices.
- map_text_to_token(text, **kwargs)[source]¶
Maps a string to a sequence of tokens (string), using the tokenizer. Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePiece/WordPiece). This function also takes care of the added tokens.
- Parameters
text – A input string.
- Returns
A list of tokens.
- map_token_to_id(tokens)[source]¶
Maps a single token or a sequence of tokens to a integer id (resp.) a sequence of ids, using the vocabulary.
- Parameters
tokens – A single token or a list of tokens.
- Returns
A single token id or a list of token ids.
- map_text_to_id(text)[source]¶
Maps a string to a sequence of ids (integer), using the tokenizer and vocabulary. Same as self.map_token_to_id(self.map_text_to_token(text)).
- Parameters
text – A input string.
- Returns
A single token id or a list of token ids.
- map_id_to_token(token_ids, skip_special_tokens=False)[source]¶
Maps a single id or a sequence of ids to a token (resp.) a sequence of tokens, using the vocabulary and added tokens.
- Parameters
token_ids – A single token id or a list of token ids.
skip_special_tokens – Whether to skip the special tokens.
- Returns
A single token or a list of tokens.
- map_token_to_text(tokens)[source]¶
Maps a sequence of tokens (string) in a single string. The most simple way to do it is
' '.join(tokens)
, but we often want to remove sub-word tokenization artifacts at the same time.
- map_id_to_text(token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True)[source]¶
Maps a sequence of ids (integer) to a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.
- Parameters
token_ids – A list of token ids.
skip_special_tokens – Whether to skip the special tokens.
clean_up_tokenization_spaces – Whether to clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms.
- encode_text(text_a, text_b=None, max_seq_length=None)[source]¶
Adds special tokens to a sequence or sequence pair and computes other information such as segment ids, input mask, and sequence length for specific tasks.
- property special_tokens_map¶
A dictionary mapping special token class attributes (
cls_token
,unk_token
, …) to their values (<unk>, <cls>, …)
- property all_special_tokens¶
List all the special tokens (<unk>, <cls>, …) mapped to class attributes (
cls_token
,unk_token
, …).
- property all_special_ids¶
List the vocabulary indices of the special tokens (<unk>, <cls>, …) mapped to class attributes (
cls_token
,unk_token
, …).
SentencePieceTokenizer¶
- class texar.torch.data.SentencePieceTokenizer(cache_dir=None, hparams=None)[source]¶
SentencePiece Tokenizer. This class is a wrapper of Google’s SentencePiece with richer ready-to-use functionalities such as adding tokens and saving/loading.
SentencePiece is an unsupervised text tokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements sub-word units (e.g., byte-pair-encoding (BPE) and unigram language model) with the extension of direct training from raw sentences.
The supported algorithms in SentencePiece are:
bpe
,word
,char
, andunigram
, which is specified inhparams
.- Parameters
cache_dir (optional) – the path to a folder in which the trained sentencepiece model will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- classmethod train(cmd, cache_dir=None)[source]¶
Trains the tokenizer from the raw text file. This function is a wrapper of sentencepiece.SentencePieceTrainer.Train function.
Example:
SentencePieceTokenizer.train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
- Parameters
cmd (str) – the command for the tokenizer training procedure. See
sentencepiece.SentencePieceTrainer.Train
for the detailed usage.cache_dir (optional) – the path to a folder in which the trained sentencepiece model will be cached. If None (default), a default directory (texar_pytorch folder under user’s home directory) will be used.
- Returns
Path to the cache directory.
- save_vocab(save_dir)[source]¶
Save the sentencepiece vocabulary (copy original file) to a directory.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
If hparams[‘vocab_file’] is specified, the tokenizer is directly loaded from the vocabulary file. In this case, all other configurations in hparams are ignored.
Otherwise, the tokenizer is automatically trained based on hparams[‘text_file’]. In this case, hparams[‘vocab_size’] must be specified.
hparams[‘vocab_file’] and hparams[‘text_file’] can not be None at the same time.
{ "vocab_file": None, "text_file": None, "vocab_size": None, "model_type": "unigram", "bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", }
Here:
- “vocab_file”: str or None
The path to a sentencepiece vocabulary file.
- “text_file”: str or None
Comma separated list of input sentences.
- “vocab_size”: int or None
Vocabulary size. The user can specify the vocabulary size, and the tokenizer training procedure will train and yield a vocabulary of the specified size.
- “model_type”: str
Model algorithm to train the tokenizer. Available algorithms are:
bpe
,word
,char
, andunigram
.- “bos_token”: str or None
Beginning of sentence token. Set None to disable
bos_token
.- “eos_token”: str or None
End of sentence token. Set None to disable
eos_token
.- “unk_token”: str or None
Unknown token. Set None to disable
unk_token
.- “pad_token”: str or None
Padding token. Set None to disable
pad_token
.
BERTTokenizer¶
- class texar.torch.data.BERTTokenizer(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Pre-trained BERT Tokenizer.
- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., bert-base-uncased). Please refer to
PretrainedBERTMixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- encode_text(text_a, text_b=None, max_seq_length=None)[source]¶
Adds special tokens to a sequence or sequence pair and computes the corresponding segment ids and input mask for BERT specific tasks. The sequence will be truncated if its length is larger than
max_seq_length
.A BERT sequence has the following format: [cls_token] X [sep_token]
A BERT sequence pair has the following format: [cls_token] A [sep_token] B [sep_token]
- Parameters
text_a – The first input text.
text_b – The second input text.
max_seq_length – Maximum sequence length.
- Returns
A tuple of (input_ids, segment_ids, input_mask), where
input_ids
: A list of input token ids with added special token ids.segment_ids
: A list of segment ids.input_mask
: A list of mask ids. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The tokenizer is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored.Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the tokenizer is defined by the configurations in hparams.
{ "pretrained_model_name": "bert-base-uncased", "vocab_file": None, "max_len": 512, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": True, "do_lower_case": True, "do_basic_tokenize": True, "non_split_tokens": None, "name": "bert_tokenizer", }
Here:
- “pretrained_model_name”: str or None
The name of the pre-trained BERT model.
- “vocab_file”: str or None
The path to a one-wordpiece-per-line vocabulary file.
- “max_len”: int
The maximum sequence length that this model might ever be used with.
- “unk_token”: str
Unknown token.
- “sep_token”: str
Separation token.
- “pad_token”: str
Padding token.
- “cls_token”: str
Classification token.
- “mask_token”: str
Masking token.
- “tokenize_chinese_chars”: bool
Whether to tokenize Chinese characters.
- “do_lower_case”: bool
Whether to lower case the input Only has an effect when do_basic_tokenize=True
- “do_basic_tokenize”: bool
Whether to do basic tokenization before wordpiece.
- “non_split_tokens”: list
List of tokens which will never be split during tokenization. Only has an effect when do_basic_tokenize=True
- “name”: str
Name of the tokenizer.
GPT2Tokenizer¶
- class texar.torch.data.GPT2Tokenizer(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Pre-trained GPT2 Tokenizer.
- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., 117M). Please refer to
PretrainedGPT2Mixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- encode_text(text, max_seq_length=None, append_eos_token=True)[source]¶
Adds special tokens to a sequence and computes the corresponding sequence length for GPT2 specific tasks. The sequence will be truncated if its length is larger than
max_seq_length
.A GPT2 sequence has the following format: [bos_token] X [eos_token] [pad_token]
- Parameters
text – Input text.
max_seq_length – Maximum sequence length.
append_eos_token – Whether to append
eos_token
after the sequence.
- Returns
A tuple of (input_ids, seq_len), where
input_ids
: A list of input token ids with added special tokens.seq_len
: The sequence length.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The tokenizer is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored.Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the tokenizer is defined by the configurations in hparams.
{ "pretrained_model_name": "117M", "vocab_file": None, "merges_file": None, "max_len": 1024, "bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "unk_token": "<|endoftext|>", "pad_token": "<|endoftext|>", "errors": "replace", "name": "gpt2_tokenizer", }
Here:
- “pretrained_model_name”: str or None
The name of the pre-trained GPT2 model.
- “vocab_file”: str or None
The path to a vocabulary json file mapping tokens to ids.
- “merges_file”: str or None
The path to a merges file.
- “max_len”: int
The maximum sequence length that this model might ever be used with.
- “bos_token”: str
Beginning of sentence token
- “eos_token”: str
End of sentence token
- “unk_token”: str
Unknown token
- “pad_token”: str
Padding token
- “errors”: str
Response when mapping tokens to text fails. The possible values are ignore, replace, and strict.
- “name”: str
Name of the tokenizer.
RoBERTaTokenizer¶
- class texar.torch.data.RoBERTaTokenizer(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Pre-trained RoBERTa Tokenizer.
- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., roberta-base). Please refer to
PretrainedRoBERTaMixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- encode_text(text_a, text_b=None, max_seq_length=None)[source]¶
Adds special tokens to a sequence or sequence pair and computes the corresponding input mask for RoBERTa specific tasks. The sequence will be truncated if its length is larger than
max_seq_length
.A RoBERTa sequence has the following format: [cls_token] X [sep_token]
A RoBERTa sequence pair has the following format: [cls_token] A [spe_token] [sep_token] B [sep_token]
- Parameters
text_a – The first input text.
text_b – The second input text.
max_seq_length – Maximum sequence length.
- Returns
A tuple of (input_ids, segment_ids, input_mask), where
input_ids
: A list of input token ids with added special token ids.input_mask
: A list of mask ids. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The tokenizer is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored.Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the tokenizer is defined by the configurations in hparams.
{ "pretrained_model_name": "roberta-base", "vocab_file": None, "merges_file": None, "max_len": 512, "bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": "<mask>", "errors": "replace", "name": "roberta_tokenizer", }
Here:
- “pretrained_model_name”: str or None
The name of the pre-trained RoBERTa model.
- “vocab_file”: str or None
The path to a vocabulary json file mapping tokens to ids.
- “merges_file”: str or None
The path to a merges file.
- “max_len”: int
The maximum sequence length that this model might ever be used with.
- “bos_token”: str
Beginning of sentence token.
- “eos_token”: str
End of sentence token.
- “sep_token”: str
Separation token.
- “cls_token”: str
Classification token.
- “unk_token”: str
Unknown token.
- “pad_token”: str
Padding token.
- “mask_token”: str
Masking token.
- “errors”: str
Response when decoding fails. The possible values are ignore, replace, and strict.
- “name”: str
Name of the tokenizer.
XLNetTokenizer¶
- class texar.torch.data.XLNetTokenizer(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Pre-trained XLNet Tokenizer.
- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., xlnet-base-uncased). Please refer to
PretrainedXLNetMixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- save_vocab(save_dir)[source]¶
Save the sentencepiece vocabulary (copy original file) to a directory.
- encode_text(text_a, text_b=None, max_seq_length=None)[source]¶
Adds special tokens to a sequence or sequence pair and computes the corresponding segment ids and input mask for XLNet specific tasks. The sequence will be truncated if its length is larger than
max_seq_length
.A XLNet sequence has the following format: X [sep_token] [cls_token]
A XLNet sequence pair has the following format: [cls_token] A [sep_token] B [sep_token]
- Parameters
text_a – The first input text.
text_b – The second input text.
max_seq_length – Maximum sequence length.
- Returns
A tuple of (input_ids, segment_ids, input_mask), where
input_ids
: A list of input token ids with added special token ids.segment_ids
: A list of segment ids.input_mask
: A list of mask ids. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.
- encode_text_for_generation(text, max_seq_length=None, append_eos_token=True)[source]¶
Adds special tokens to a sequence and computes the corresponding sequence length for XLNet specific tasks. The sequence will be truncated if its length is larger than
max_seq_length
.A XLNet sequence has the following format: [bos_token] X [eos_token] [pad_token]
- Parameters
text – Input text.
max_seq_length – Maximum sequence length.
append_eos_token – Whether to append
eos_token
after the sequence.
- Returns
A tuple of (input_ids, seq_len), where
input_ids
: A list of input token ids with added special tokens.seq_len
: The sequence length.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The tokenizer is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored.Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the tokenizer is defined by the configurations in hparams.
{ "pretrained_model_name": "xlnet-base-cased", "vocab_file": None, "max_len": None, "bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "<sep>", "pad_token": "<pad>", "cls_token": "<cls>", "mask_token": "<mask>", "additional_special_tokens": ["<eop>", "<eod>"], "do_lower_case": False, "remove_space": True, "keep_accents": False, "name": "xlnet_tokenizer", }
Here:
- “pretrained_model_name”: str or None
The name of the pre-trained XLNet model.
- “vocab_file”: str or None
The path to a sentencepiece vocabulary file.
- “max_len”: int or None
The maximum sequence length that this model might ever be used with.
- “bos_token”: str
Beginning of sentence token.
- “eos_token”: str
End of sentence token.
- “unk_token”: str
Unknown token.
- “sep_token”: str
Separation token.
- “pad_token”: str
Padding token.
- “cls_token”: str
Classification token.
- “mask_token”: str
Masking token.
- “additional_special_tokens”: list
A list of additional special tokens.
- “do_lower_case”: bool
Whether to lower-case the text.
- “remove_space”: bool
Whether to remove the space in the text.
- “keep_accents”: bool
Whether to keep the accents in the text.
- “name”: str
Name of the tokenizer.
T5Tokenizer¶
- class texar.torch.data.T5Tokenizer(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Pre-trained T5 Tokenizer.
- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., T5-Small). Please refer to
PretrainedT5Mixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The tokenizer is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored.Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the tokenizer is defined by the configurations in hparams.
{ "pretrained_model_name": "T5-Small", "vocab_file": None, "max_len": 512, "bos_token": None, "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "extra_ids": 100, "additional_special_tokens": [], "name": "t5_tokenizer", }
Here:
- “pretrained_model_name”: str or None
The name of the pre-trained T5 model.
- “vocab_file”: str or None
The path to a sentencepiece vocabulary file.
- “max_len”: int or None
The maximum sequence length that this model might ever be used with.
- “bos_token”: str or None
Beginning of sentence token. Set None to disable
bos_token
.- “eos_token”: str
End of sentence token. Set None to disable
eos_token
.- “unk_token”: str
Unknown token. Set None to disable
unk_token
.- “pad_token”: str
Padding token. Set None to disable
pad_token
.- “extra_ids”: int
Add a number of extra ids added to the end of the vocabulary for use as sentinels. These tokens are accessible as <extra_id_{%d}> where {%d} is a number between 0 and extra_ids-1. Extra tokens are indexed from the end of the vocabulary up to beginning (<extra_id_0> is the last token in the vocabulary) (like in T5 preprocessing) see: https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117
- “additional_special_tokens”: list
A list of additional special tokens.
- “name”: str
Name of the tokenizer.
Vocabulary¶
SpecialTokens¶
Vocab¶
- class texar.torch.data.Vocab(filename, pad_token='<PAD>', bos_token='<BOS>', eos_token='<EOS>', unk_token='<UNK>')[source]¶
Vocabulary class that loads vocabulary from file, and maintains mapping tables between token strings and indexes.
Each line of the vocab file should contains one vocabulary token, e.g.:
vocab_token_1 vocab token 2 vocab token | 3 . ...
- Parameters
filename (str) – Path to the vocabulary file where each line contains one token.
bos_token (str) – A special token that will be added to the beginning of sequences.
eos_token (str) – A special token that will be added to the end of sequences.
unk_token (str) – A special token that will replace all unknown tokens (tokens not included in the vocabulary).
pad_token (str) – A special token that is used to do padding.
- load(filename)[source]¶
Loads the vocabulary from the file.
- Parameters
filename (str) – Path to the vocabulary file.
- Returns
A tuple of mapping tables between word string and index, (
id_to_token_map_py
,token_to_id_map_py
), where andtoken_to_id_map_py
are python defaultdict instances.
- map_ids_to_tokens_py(ids)[source]¶
Maps ids into text tokens.
The input
ids
and returned tokens are both python arrays or list.- Parameters
ids – An int numpy array or (possibly nested) list of token ids.
- Returns
A numpy array of text tokens of the same shape as
ids
.
- map_tokens_to_ids_py(tokens)[source]¶
Maps text tokens into ids.
The input
tokens
and returned ids are both python arrays or list.- Parameters
tokens – A numpy array or (possibly nested) list of text tokens.
- Returns
A numpy array of token ids of the same shape as
tokens
.
- property id_to_token_map_py¶
The dictionary instance that maps from token index to the string form.
- property token_to_id_map_py¶
The dictionary instance that maps from token string to the index.
- property size¶
The vocabulary size.
- property bos_token¶
A string of the special token indicating the beginning of sequence.
- property bos_token_id¶
The int index of the special token indicating the beginning of sequence.
- property eos_token¶
A string of the special token indicating the end of sequence.
- property eos_token_id¶
The int index of the special token indicating the end of sequence.
- property unk_token¶
A string of the special token indicating unknown token.
- property unk_token_id¶
The int index of the special token indicating unknown token.
- property pad_token¶
A string of the special token indicating padding token. The default padding token is an empty string.
- property pad_token_id¶
The int index of the special token indicating padding token.
map_ids_to_strs¶
- texar.torch.data.map_ids_to_strs(ids, vocab, join=True, strip_pad='<PAD>', strip_bos='<BOS>', strip_eos='<EOS>')[source]¶
Transforms
int
indexes to strings by mapping ids to tokens, concatenating tokens into sentences, and stripping special tokens, etc.- Parameters
ids – An n-D numpy array or (possibly nested) list of
int
indexes.vocab – An instance of
Vocab
.join (bool) – Whether to concatenate along the last dimension of the the tokens into a string separated with a space character.
strip_pad (str) – The PAD token to strip from the strings (i.e., remove the leading and trailing PAD tokens of the strings). Default is
"<PAD>"
as defined inSpecialTokens
.PAD. Set to None or False to disable the stripping.strip_bos (str) – The BOS token to strip from the strings (i.e., remove the leading BOS tokens of the strings). Default is
"<BOS>"
as defined inSpecialTokens
.BOS. Set to None or False to disable the stripping.strip_eos (str) – The EOS token to strip from the strings (i.e., remove the EOS tokens and all subsequent tokens of the strings). Default is
"<EOS>"
as defined inSpecialTokens
.EOS. Set to None or False to disable the stripping.
- Returns
If
join
is True, returns a (n-1)-D numpy array (or list) of concatenated strings. Ifjoin
is False, returns an n-D numpy array (or list) of str tokens.
Example
text_ids = [[1, 9, 6, 2, 0, 0], [1, 28, 7, 8, 2, 0]] text = map_ids_to_strs(text_ids, data.vocab) # text == ['a sentence', 'parsed from ids'] text = map_ids_to_strs( text_ids, data.vocab, join=False, strip_pad=None, strip_bos=None, strip_eos=None) # text == [['<BOS>', 'a', 'sentence', '<EOS>', '<PAD>', '<PAD>'], # ['<BOS>', 'parsed', 'from', 'ids', '<EOS>', '<PAD>']]
Embedding¶
Embedding¶
- class texar.torch.data.Embedding(vocab, hparams=None)[source]¶
Embedding class that loads token embedding vectors from file. Token embeddings not in the embedding file are initialized as specified in
hparams
.- Parameters
vocab (dict) – A dictionary that maps token strings to integer index.
hparams (dict) – Hyperparameters. See
default_hparams()
for the defaults.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values:
{ "file": "", "dim": 50, "read_fn": "load_word2vec", "init_fn": { "type": "numpy.random.uniform", "kwargs": { "low": -0.1, "high": 0.1, } }, }
Here:
- “file”: str
Path to the embedding file. If not provided, all embeddings are initialized with the initialization function.
- “dim”: int
Dimension size of each embedding vector
- “read_fn”: str or callable
Function to read the embedding file. This can be the function, or its string name or full module path. For example,
"read_fn": texar.torch.data.load_word2vec "read_fn": "load_word2vec" "read_fn": "texar.torch.data.load_word2vec" "read_fn": "my_module.my_read_fn"
If function string name is used, the function must be in one of the modules:
texar.torch.data
ortexar.torch.custom
.The function must have the same signature as with
load_word2vec()
.- “init_fn”: dict
Hyperparameters of the initialization function used to initialize embedding of tokens missing in the embedding file.
The function must accept argument named size or shape to specify the output shape, and return a numpy array of the shape.
The dict has the following fields:
- “type”: str or callable
The initialization function. Can be either the function, or its string name or full module path.
- “kwargs”: dict
Keyword arguments for calling the function. The function is called with
init_fn(size=[.., ..], **kwargs)
.
- property word_vecs¶
2D numpy array of shape [vocab_size, embedding_dim].
- property vector_size¶
The embedding dimension size.
load_word2vec¶
- texar.torch.data.load_word2vec(filename, vocab, word_vecs)[source]¶
Loads embeddings in the word2vec binary format which has a header line containing the number of vectors and their dimensionality (two integers), followed with number-of-vectors lines each of which is formatted as
<word-string> <embedding-vector>
.
load_glove¶
Data Sources¶
DataSource¶
- class texar.torch.data.DataSource(*args, **kwds)[source]¶
Base class for all data sources. A data source represents the source of the data, from which raw data examples are read and returned.
Different to PyTorch
Dataset
, subclasses of this class are not required to implement__getitem__()
(default implementation raises TypeError), which is beneficial for certain sources that only supports iteration (reading from text files, reading Python iterators, etc.)
SequenceDataSource¶
IterDataSource¶
- class texar.torch.data.IterDataSource(iterable)[source]¶
Data source for reading from Python iterables. Please note: if passed an iterator and caching strategy is set to ‘none’, then the data source can only be iterated over once.
This data source does not support indexing.
- Parameters
iterable – The Python iterable to read from.
ZipDataSource¶
- class texar.torch.data.ZipDataSource(*sources)[source]¶
Data source by combining multiple sources. The raw examples returned from this data source are tuples, with elements being raw examples from each of the constituting data sources.
This data source supports indexing if all the constituting data sources support indexing.
- Parameters
sources – The list of data sources to combine.
FilterDataSource¶
- class texar.torch.data.FilterDataSource(source, filter_fn)[source]¶
Data source for filtering raw examples with a user-specified filter function. Only examples for which the filter functions returns True are returned.
This data source supports indexing if the wrapped data source supports indexing.
- Parameters
source – The data source to filter.
filter_fn – A callable taking a raw example as argument and returning a boolean value, indicating whether the raw example should be kept.
RecordDataSource¶
- class texar.torch.data.RecordDataSource(sources)[source]¶
Data source by structuring multiple sources. The raw examples returned from this data source are dictionaries, with values being raw examples from each of the constituting data sources.
This data source supports indexing if all the constituting data sources support indexing.
- Parameters
sources – A dictionary mapping names to data sources, containing the data sources to combine.
TextLineDataSource¶
- class texar.torch.data.TextLineDataSource(file_paths, compression_type=None, encoding=None, delimiter=None, max_length=None)[source]¶
Data source for reading from (multiple) text files. Each line is tokenized and yielded as an example.
This data source does not support indexing.
- Parameters
compression_type (str, optional) – The compression type for the text files,
"gzip"
and"zlib"
are supported. Default is None, in which case files are treated as plain text files.encoding (str, optional) – Encoding for the files. By default uses the default locale of the system (usually UTF-8).
delimiter (str, optional) – Delimiter for tokenization purposes. This is used in combination with
max_length
. If None, text is split on any blank character.max_length (int, optional) – Maximum length for data examples. Length is measured as the number of tokens in a line after being tokenized using the provided
delimiter
. Lines with more thanmax_length
tokens will be dropped.
PickleDataSource¶
- class texar.torch.data.PickleDataSource(file_paths, lists_are_examples=True, **pickle_kwargs)[source]¶
Data source for reading from (multiple) pickled binary files. Each file could contain multiple pickled objects, and each object is yielded as an example.
This data source does not support indexing.
- Parameters
file_paths (str or list[str]) – Paths to pickled binary files.
lists_are_examples (bool) –
If True, lists will be treated as a single example; if False, each element in the list will be treated as separate examples. Default is True. Set this to False if the entire pickled binary file is a list.
Note
It is recommended against storing all examples as a list, because in this case, all examples can only be accessed after the whole list is parsed.
pickle_kwargs – Additional keyword arguments to pass to
pickle.load()
.
Data Loaders¶
DatasetBase¶
- class texar.torch.data.DatasetBase(source, hparams=None, device=None)[source]¶
Base class inherited by all data classes.
- Parameters
source – An instance of type
DataSource
,hparams – A dict or instance of
HParams
containing hyperparameters. Seedefault_hparams()
for the defaults.device –
The device of the produced batches. For GPU training, set to current CUDA device.
Note
When
device
is set to a CUDA device, tensors in the batch will be automatically moved to the specified device. This may result in performance issues if your data examples contain complex structures (e.g., nested lists with many elements). In this case, it is recommended to setdevice
to None and manually move your data.For more details, see
collate()
.
Users can also directly inherit from this class to implement customized data processing routines. Two methods should be implemented in the subclass:
process()
: Process a single data example read from the data source (raw example). Default implementation returns the raw example as is.collate()
: Combine a list of processed examples into a single batch, and return an object of typeBatch
.
Example
Here, we define a custom data class named
MyDataset
, which is equivalent to the most basic usage ofMonoTextData
.class MyDataset(tx.data.DatasetBase): def __init__(self, data_path, vocab, hparams=None, device=None): source = tx.data.TextLineDataSource(data_path) self.vocab = vocab super().__init__(source, hparams, device) def process(self, raw_example): # `raw_example` is a data example read from `self.source`, # in this case, a line of tokenized text, represented as a # list of `str`. return { "text": raw_example, "ids": self.vocab.map_tokens_to_ids_py(raw_example), } def collate(self, examples): # `examples` is a list of objects returned from the # `process` method. These data examples should be collated # into a batch. # `text` is a list of list of `str`, storing the tokenized # sentences for each example in the batch. text = [ex["text"] for ex in examples] # `ids` is the NumPy tensor built from the token IDs of each # sentence, and `lengths` the lengths of each sentence. # The `tx.data.padded_batch` function pads IDs to the same # length and then stack them together. This function is # commonly used in `collate` methods. ids, lengths = tx.data.padded_batch( [ex["ids"] for ex in examples]) return tx.data.Batch( len(examples), text=text, text_ids=torch.from_numpy(ids), lengths=torch.tensor(lengths)) vocab = tx.data.Vocab("vocab.txt") hparams = {'batch_size': 1} data = MyDataset("data.txt", vocab, hparams) iterator = DataIterator(data) for batch in iterator: # batch contains the following # batch_ == { # 'text': [['<BOS>', 'example', 'sequence', '<EOS>']], # 'text_ids': [[1, 5, 10, 2]], # 'length': [4] # }
- static default_hparams()[source]¶
Returns a dictionary of default hyperparameters.
{ "num_epochs": 1, "batch_size": 64, "allow_smaller_final_batch": True, "shuffle": True, "shuffle_buffer_size": None, "shard_and_shuffle": False, "num_parallel_calls": 1, "prefetch_buffer_size": 0, "max_dataset_size": -1, "seed": None, "lazy_strategy": 'none', "cache_strategy": 'processed', "parallelize_processing": True, "name": "data" }
Here:
- “num_epochs”: int
Number of times the dataset should be repeated.
Note
This option only exists for compatibility, and will be ignored. A warning will be generated is any value other than 1 is used.
- “batch_size”: int
Batch size, i.e., the number of consecutive elements of the dataset to combine in a single batch.
- “allow_smaller_final_batch”: bool
Whether to allow the final batch to be smaller if there are insufficient elements left. If False, the final batch is discarded if it is smaller than batch size. Note that, if True, output_shapes of the resulting dataset will have a a static batch_size dimension equal to “batch_size”.
- “shuffle”: bool
Whether to randomly shuffle the elements of the dataset.
- “shuffle_buffer_size”: int
The buffer size for data shuffling. The larger, the better the resulting data is mixed.
If None (default), buffer size is set to the size of the whole dataset (i.e., make the shuffling the maximally effective).
- “shard_and_shuffle”: bool
Whether to first shard the dataset and then shuffle each block respectively. Useful when the whole data is too large to be loaded efficiently into the memory.
If True,
shuffle_buffer_size
must be specified to determine the size of each shard.Warning
Sharding is not yet supported. This option will be ignored.
- “num_parallel_calls”: int
Number of elements from the datasets to process in parallel. When
"num_parallel_calls"
equals 0, no worker processes will be created; when the value is greater than 0, the number of worker processes will be equal to"num_parallel_calls"
.- “prefetch_buffer_size”: int
The maximum number of elements that will be buffered when prefetching.
Note
This option exists only for compatibility. Currently data is only prefetched when
"num_parallel_calls"
is greater than 1, and the number of examples to prefetch is controlled internally by PyTorch DataLoader.- “max_dataset_size”: int
Maximum number of instances to include in the dataset. If set to -1 or greater than the size of dataset, all instances will be included. This constraint is imposed after data shuffling and filtering.
- “seed”: int, optional
The random seed for shuffle.
Note that if a seed is set, the shuffle order will be exact the same every time when going through the (repeated) dataset.
Warning
Manual seeding is not yet supported. This option will be ignored.
- “lazy_strategy”: str
Lazy strategy for data examples. Lazy loading/processing defers data loading/processing until when it’s being accessed. Non-lazy (eager) loading/processing would load/process all data upon construction of dataset. Available options are:
none: Perform eager loading and processing.
process: Perform eager loading and lazy processing.
all: Perform lazy loading and processing.
Defaults to all. Note that currently, all eager operations are performed on a single process only.
- “cache_strategy”: str
Caching strategy for data examples. Available options are:
none: No data is cached. Data is always loaded from source (e.g. file) and processed upon access.
loaded: Only cache raw data loaded from source, processing routines are performed upon access.
processed: Processed data is cached. Note: raw data will not be cached in this case, because raw data is only used to construct the processed data.
Default value is loaded. This option depends on the value of lazy_strategy, specifically:
When lazy_strategy is none, all choices of cache_strategy are equivalent to processed.
When lazy_strategy is process, none is equivalent to loaded.
- “parallelize_processing”: bool
Whether to perform parallelized processing of data. Since multi-processing parallelism is utilized, this flag should be False if your process routine involves modifying a shared object across examples.
Note that this only affects cases where lazy_strategy is not none. If lazy_strategy is none, processing will be performed on a single process regardless of this value.
- “max_batch_size”: int
AdaptDL parameter. Maximum global batch size used for distributed training with AdaptDL.
- “local_bsz_bounds”: (int, int)
AdaptDL parameter. Local batch size bounds (min, max) per replica for distributed training.
- “gradient_accumulation”: bool
AdaptDL parameter. Enable gradient accumulation.
- “name”: str
Name of the data.
- to(device)[source]¶
Move the dataset to the specific device. Note that we don’t actually move data or do anything here – data will be moved to the appropriate device after
DataIterator
fetches the batch.
- process(raw_example)[source]¶
The process routine. A default implementation of no-op is provided, but subclasses are free to override this behavior.
The process routine would take raw examples loaded from the data source as input, and return processed examples. If parallelize_processing is True, this method must not access shared variables that are modified during iterator (e.g., constructing vocabularies on-the-fly).
- Parameters
raw_example – The raw example loaded from data.
- Returns
The processed example.
- property num_epochs¶
Number of epochs.
- property batch_size¶
The batch size.
- property name¶
Name of the module.
- property dataset¶
The data source.
- collate(examples)[source]¶
The collate routine. Subclasses must implement this method.
The collate routine is called to collate (combine) examples into batches. This function takes a list of processed examples, and returns an instance of
Batch
.Note
Implementation should make sure that the returned callable is safe and efficient under multi-processing scenarios. Basically, do not rely on variables that could be modified during iteration, and avoid accessing unnecessary variables, as each access would result in a cross-process memory copy.
Warning
The recommended pattern is not to move tensor storage within this method, but you are free to do so.
However, if multiple workers are used (
num_parallel_calls
> 0), moving tensors to CUDA devices within this method would result in CUDA errors being thrown.- Parameters
examples – A list of processed examples in a batch.
- Returns
The collated batch.
MonoTextData¶
- class texar.torch.data.MonoTextData(hparams, device=None, vocab=None, embedding=None, data_source=None)[source]¶
Text data processor that reads single set of text files. This can be used for, e.g., language models, auto-encoders, etc.
- Parameters
hparams – A dict or instance of
HParams
containing hyperparameters. Seedefault_hparams()
for the defaults.device – The device of the produced batches. For GPU training, set to current CUDA device.
By default, the processor reads raw data files, performs tokenization, batching and other pre-processing steps, and results in a Dataset whose element is a python dict including three fields:
- “text”:
A list of
[batch_size]
elements each containing a list of raw text tokens of the sequences. Short sequences in the batch are padded with empty string. By default onlyEOS
token is appended to each sequence. Out-of-vocabulary tokens are NOT replaced withUNK
.- “text_ids”:
A list of
[batch_size]
elements each containing a list of token indexes of source sequences in the batch.- “length”:
A list of
[batch_size]
elements of integers containing the length of each source sequence in the batch (includingBOS
andEOS
if added).
The above field names can be accessed through
text_name
,text_id_name
,length_name
.Example
hparams={ 'dataset': { 'files': 'data.txt', 'vocab_file': 'vocab.txt' }, 'batch_size': 1 } data = MonoTextData(hparams) iterator = DataIterator(data) for batch in iterator: # batch contains the following # batch_ == { # 'text': [['<BOS>', 'example', 'sequence', '<EOS>']], # 'text_ids': [[1, 5, 10, 2]], # 'length': [4] # }
- static default_hparams()[source]¶
Returns a dictionary of default hyperparameters:
{ # (1) Hyperparameters specific to text dataset "dataset": { "files": [], "compression_type": None, "vocab_file": "", "embedding_init": {}, "delimiter": None, "max_seq_length": None, "length_filter_mode": "truncate", "pad_to_max_seq_length": False, "bos_token": "<BOS>" "eos_token": "<EOS>" "other_transformations": [], "variable_utterance": False, "utterance_delimiter": "|||", "max_utterance_cnt": 5, "data_name": None, } # (2) General hyperparameters "num_epochs": 1, "batch_size": 64, "allow_smaller_final_batch": True, "shuffle": True, "shuffle_buffer_size": None, "shard_and_shuffle": False, "num_parallel_calls": 1, "prefetch_buffer_size": 0, "max_dataset_size": -1, "seed": None, "name": "mono_text_data", # (3) Bucketing "bucket_boundaries": [], "bucket_batch_sizes": None, "bucket_length_fn": None, }
Here:
For the hyperparameters in the
"dataset"
field:
- “files”: str or list
A (list of) text file path(s).
Each line contains a single text sequence.
- “compression_type”: str, optional
One of None (no compression),
"ZLIB"
, or"GZIP"
.- “vocab_file”: str
Path to vocabulary file. Each line of the file should contain one vocabulary token.
Used to create an instance of
Vocab
.- “embedding_init”: dict
The hyperparameters for pre-trained embedding loading and initialization.
The structure and default values are defined in
texar.torch.data.Embedding.default_hparams()
.- “delimiter”: str, optional
The delimiter to split each line of the text files into tokens. If None (default), behavior will be equivalent to str.split(), i.e. split on any blank character.
- “max_seq_length”: int, optional
Maximum length of output sequences. Data samples exceeding the length will be truncated or discarded according to
"length_filter_mode"
. The length does not include any added"bos_token"
or"eos_token"
. If None (default), no filtering is performed.- “length_filter_mode”: str
Either
"truncate"
or"discard"
. If"truncate"
(default), tokens exceeding"max_seq_length"
will be truncated. If"discard"
, data samples longer than"max_seq_length"
will be discarded.- “pad_to_max_seq_length”: bool
If True, pad all data instances to length
"max_seq_length"
. Raises error if"max_seq_length"
is not provided.- “bos_token”: str
The Begin-Of-Sequence token prepended to each sequence.
Set to an empty string to avoid prepending.
- “eos_token”: str
The End-Of-Sequence token appended to each sequence.
Set to an empty string to avoid appending.
- “other_transformations”: list
A list of transformation functions or function names/paths to further transform each single data instance.
(More documentations to be added.)
- “variable_utterance”: bool
If True, each line of the text file is considered to contain multiple sequences (utterances) separated by
"utterance_delimiter"
.For example, in dialog data, each line can contain a series of dialog history utterances. See the example in examples/hierarchical_dialog for a use case.
Warning
Variable utterances is not yet supported. This option (and related ones below) will be ignored.
- “utterance_delimiter”: str
The delimiter to split over utterance level. Should not be the same with
"delimiter"
. Used only when"variable_utterance"
is True.- “max_utterance_cnt”: int
Maximally allowed number of utterances in a data instance. Extra utterances are truncated out.
- “data_name”: str
Name of the dataset.
2. For the general hyperparameters, see
texar.torch.data.DatasetBase.default_hparams()
for details.3. Bucketing is to group elements of the dataset together by length and then pad and batch. For bucketing hyperparameters:
- “bucket_boundaries”: list
An int list containing the upper length boundaries of the buckets.
Set to an empty list (default) to disable bucketing.
- “bucket_batch_sizes”: list
An int list containing batch size per bucket. Length should be len(bucket_boundaries) + 1.
If None, every bucket will have the same batch size specified in
batch_size
.- “bucket_length_fn”: str or callable
Function maps dataset element to
int
, determines the length of the element.This can be a function, or the name or full module path to the function. If function name is given, the function must be in the
texar.torch.custom
module.If None (default), length is determined by the number of tokens (including BOS and EOS if added) of the element.
Warning
Bucketing is not yet supported. These options will be ignored.
- list_items()[source]¶
Returns the list of item names that the data can produce.
- Returns
A list of strings.
- property text_name¶
The name for the text field
- property text_id_name¶
The name for text ids
- property length_name¶
The name for text length
- property embedding_init_value¶
The Tensor containing the embedding value loaded from file. None if embedding is not specified.
PairedTextData¶
- class texar.torch.data.PairedTextData(hparams, device=None)[source]¶
Text data processor that reads parallel source and target text. This can be used in, e.g., seq2seq models.
- Parameters
hparams (dict) – Hyperparameters. See
default_hparams()
for the defaults.device – The device of the produced batches. For GPU training, set to current CUDA device.
By default, the processor reads raw data files, performs tokenization, batching and other pre-processing steps, and results in a Dataset whose element is a python dict including six fields:
- “source_text”:
A list of
[batch_size]
elements each containing a list of raw text tokens of source sequences. Short sequences in the batch are padded with empty string. By default onlyEOS
token is appended to each sequence. Out-of-vocabulary tokens are NOT replaced withUNK
.- “source_text_ids”:
A list of
[batch_size]
elements each containing a list of token indexes of source sequences in the batch.- “source_length”:
A list of
[batch_size]
elements of integers containing the length of each source sequence in the batch.- “target_text”:
A list same as “source_text” but for target sequences. By default both BOS and EOS are added.
- “target_text_ids”:
A list same as “source_text_ids” but for target sequences.
- “target_length”:
An list same as “source_length” but for target sequences.
The above field names can be accessed through
source_text_name
,source_text_id_name
,source_length_name
, and those prefixed withtarget_
, respectively.Example:
hparams={ 'source_dataset': {'files': 's', 'vocab_file': 'vs'}, 'target_dataset': {'files': ['t1', 't2'], 'vocab_file': 'vt'}, 'batch_size': 1 } data = PairedTextData(hparams) iterator = DataIterator(data) for batch in iterator: # batch contains the following # batch_ == { # 'source_text': [['source', 'sequence', '<EOS>']], # 'source_text_ids': [[5, 10, 2]], # 'source_length': [3] # 'target_text': [['<BOS>', 'target', 'sequence', '1', '<EOS>']], # 'target_text_ids': [[1, 6, 10, 20, 2]], # 'target_length': [5] # }
- static default_hparams()[source]¶
Returns a dictionary of default hyperparameters.
{ # (1) Hyperparams specific to text dataset "source_dataset": { "files": [], "compression_type": None, "vocab_file": "", "embedding_init": {}, "delimiter": None, "max_seq_length": None, "length_filter_mode": "truncate", "pad_to_max_seq_length": False, "bos_token": None, "eos_token": "<EOS>", "other_transformations": [], "variable_utterance": False, "utterance_delimiter": "|||", "max_utterance_cnt": 5, "data_name": "source", }, "target_dataset": { # ... # Same fields are allowed as in "source_dataset" with the # same default values, except the # following new fields/values: "bos_token": "<BOS>" "vocab_share": False, "embedding_init_share": False, "processing_share": False, "data_name": "target" } # (2) General hyperparams "num_epochs": 1, "batch_size": 64, "allow_smaller_final_batch": True, "shuffle": True, "shuffle_buffer_size": None, "shard_and_shuffle": False, "num_parallel_calls": 1, "prefetch_buffer_size": 0, "max_dataset_size": -1, "seed": None, "name": "paired_text_data", # (3) Bucketing "bucket_boundaries": [], "bucket_batch_sizes": None, "bucket_length_fn": None, }
Here:
Hyperparameters in the
"source_dataset"
and"target_dataset"
fields have the same definition as those intexar.torch.data.MonoTextData.default_hparams()
, for source and target text, respectively.For the new hyperparameters in
"target_dataset"
:- “vocab_share”: bool
Whether to share the vocabulary of source. If True, the vocab file of target is ignored.
- “embedding_init_share”: bool
Whether to share the embedding initial value of source. If True,
"embedding_init"
of target is ignored."vocab_share"
must be true to share the embedding initial value.- “processing_share”: bool
Whether to share the processing configurations of source, including “delimiter”, “bos_token”, “eos_token”, and “other_transformations”.
For the general hyperparameters, see
texar.torch.data.DatasetBase.default_hparams()
for details.For bucketing hyperparameters, see
texar.torch.data.MonoTextData.default_hparams()
for details, except that the default “bucket_length_fn” is the maximum sequence length of source and target sequences.Warning
Bucketing is not yet supported. These options will be ignored.
- list_items()[source]¶
Returns the list of item names that the data can produce.
- Returns
A list of strings.
- property source_text_name¶
The name for source text
- property source_text_id_name¶
The name for source text id
- property source_length_name¶
The name for source length
- property target_text_name¶
The name for target text
- property target_text_id_name¶
The name for target text id
- property target_length_name¶
The name for target length
ScalarData¶
- class texar.torch.data.ScalarData(hparams, device=None, data_source=None)[source]¶
Scalar data where each line of the files is a scalar (int or float), e.g., a data label.
- Parameters
hparams (dict) – Hyperparameters. See
default_hparams()
for the defaults.device – The device of the produced batches. For GPU training, set to current CUDA device.
The processor reads and processes raw data and results in a dataset whose element is a python dict including one field. The field name is specified in
hparams["dataset"]["data_name"]
. If not specified, the default name is “data”. The field name can be accessed throughdata_name
.This field is a Tensor of shape [batch_size] containing a batch of scalars, of either int or float type as specified in
hparams
.Example
hparams={ 'dataset': { 'files': 'data.txt', 'data_name': 'label' }, 'batch_size': 2 } data = ScalarData(hparams) iterator = DataIterator(data) for batch in iterator: # batch contains the following # batch == { # 'label': [2, 9] # }
- static default_hparams()[source]¶
Returns a dictionary of default hyperparameters.
{ # (1) Hyperparams specific to scalar dataset "dataset": { "files": [], "compression_type": None, "data_type": "int", "other_transformations": [], "data_name": "data", } # (2) General hyperparams "num_epochs": 1, "batch_size": 64, "allow_smaller_final_batch": True, "shuffle": True, "shuffle_buffer_size": None, "shard_and_shuffle": False, "num_parallel_calls": 1, "prefetch_buffer_size": 0, "max_dataset_size": -1, "seed": None, "name": "scalar_data", }
Here:
For the hyperparameters in the
"dataset"
field:- “files”: str or list
A (list of) file path(s).
Each line contains a single scalar number.
- “compression_type”: str, optional
One of “” (no compression), “ZLIB”, or “GZIP”.
- “data_type”: str
The scalar type. Types defined in
get_supported_scalar_types()
are supported.- “other_transformations”: list
A list of transformation functions or function names/paths to further transform each single data instance.
(More documentations to be added.)
- “data_name”: str
Name of the dataset.
For the general hyperparameters, see
texar.torch.data.DatasetBase.default_hparams()
for details.
- list_items()[source]¶
Returns the list of item names that the data can produce.
- Returns
A list of strings.
- property data_name¶
The name of the data tensor, “data” by default if not specified in
hparams
.
MultiAlignedData¶
- class texar.torch.data.MultiAlignedData(hparams, device=None)[source]¶
Data consisting of multiple aligned parts.
- Parameters
hparams (dict) – Hyperparameters. See
default_hparams()
for the defaults.device – The device of the produced batches. For GPU training, set to current CUDA device.
The processor can read any number of parallel fields as specified in the “datasets” list of
hparams
, and result in a Dataset whose element is a python dict containing data fields from each of the specified datasets. Fields from a text dataset or Record dataset have names prefixed by its"data_name"
. Fields from a scalar dataset are specified by its"data_name"
.Example
hparams={ 'datasets': [ {'files': 'a.txt', 'vocab_file': 'v.a', 'data_name': 'x'}, {'files': 'b.txt', 'vocab_file': 'v.b', 'data_name': 'y'}, {'files': 'c.txt', 'data_type': 'int', 'data_name': 'z'} ] 'batch_size': 1 } data = MultiAlignedData(hparams) iterator = DataIterator(data) for batch in iterator: # batch contains the following # batch == { # 'x_text': [['<BOS>', 'x', 'sequence', '<EOS>']], # 'x_text_ids': [['1', '5', '10', '2']], # 'x_length': [4] # 'y_text': [['<BOS>', 'y', 'sequence', '1', '<EOS>']], # 'y_text_ids': [['1', '6', '10', '20', '2']], # 'y_length': [5], # 'z': [1000], # } ... hparams={ 'datasets': [ {'files': 'd.txt', 'vocab_file': 'v.d', 'data_name': 'm'}, { 'files': 'd.tfrecord', 'data_type': 'tf_record', "feature_types": { 'image': ['tf.string', 'stacked_tensor'] }, 'image_options': { 'image_feature_name': 'image', 'resize_height': 512, 'resize_width': 512, }, 'data_name': 't', } ] 'batch_size': 1 } data = MultiAlignedData(hparams) iterator = DataIterator(data) for batch in iterator: # batch contains the following # batch_ == { # 'x_text': [['<BOS>', 'NewYork', 'City', 'Map', '<EOS>']], # 'x_text_ids': [['1', '100', '80', '65', '2']], # 'x_length': [5], # # # "t_image" is a list of a "numpy.ndarray" image # # in this example. Its width is equal to 512 and # # its height is equal to 512. # 't_image': [...] # }
- static default_hparams()[source]¶
Returns a dictionary of default hyperparameters:
{ # (1) Hyperparams specific to text dataset "datasets": [] # (2) General hyperparams "num_epochs": 1, "batch_size": 64, "allow_smaller_final_batch": True, "shuffle": True, "shuffle_buffer_size": None, "shard_and_shuffle": False, "num_parallel_calls": 1, "prefetch_buffer_size": 0, "max_dataset_size": -1, "seed": None, "name": "multi_aligned_data", }
Here:
“datasets” is a list of dict each of which specifies a dataset which can be text, scalar or Record. The
"data_name"
field of each dataset is used as the name prefix of the data fields from the respective dataset. The"data_name"
field of each dataset should not be the same.For scalar dataset, the allowed hyperparameters and default values are the same as the “dataset” field of
texar.torch.data.ScalarData.default_hparams()
. Note that"data_type"
must be explicitly specified (either “int” or “float”).For Record dataset, the allowed hyperparameters and default values are the same as the “dataset” field of
texar.torch.data.RecordData.default_hparams()
. Note that"data_type"
must be explicitly specified (“record”).For text dataset, the allowed hyperparameters and default values are the same as the “dataset” filed of
texar.torch.data.MonoTextData.default_hparams()
, with several extra hyperparameters:- “data_type”: str
The type of the dataset, one of {“text”, “int”, “float”, “record”}. If set to “int” or “float”, the dataset is considered to be a scalar dataset. If set to “record”, the dataset is considered to be a Record dataset.
If not specified or set to “text”, the dataset is considered to be a text dataset.
- “vocab_share_with”: int, optional
Share the vocabulary of a preceding text dataset with the specified index in the list (starting from 0). The specified dataset must be a text dataset, and must have an index smaller than the current dataset.
If specified, the vocab file of current dataset is ignored. Default is None which disables the vocab sharing.
- “embedding_init_share_with”: int, optional
Share the embedding initial value of a preceding text dataset with the specified index in the list (starting from 0). The specified dataset must be a text dataset, and must have an index smaller than the current dataset.
If specified, the
"embedding_init"
field of the current dataset is ignored. Default is None which disables the initial value sharing.- “processing_share_with”: int, optional
Share the processing configurations of a preceding text dataset with the specified index in the list (starting from 0). The specified dataset must be a text dataset, and must have an index smaller than the current dataset.
If specified, relevant field of the current dataset are ignored, including delimiter, bos_token, eos_token, and “other_transformations”. Default is None which disables the processing sharing.
2. For the general hyperparameters, see
texar.torch.data.DatasetBase.default_hparams()
for details.
- list_items()[source]¶
Returns the list of item names that the data can produce.
- Returns
A list of strings.
- vocab(name_or_id)[source]¶
Returns the
Vocab
of text dataset by its name or id. None if the dataset is not of text type.
- embedding_init_value(name_or_id)[source]¶
Returns the Tensor of embedding initial value of the dataset by its name or id. None if the dataset is not of text type.
- text_name(name_or_id)[source]¶
The name of text tensor of text dataset by its name or id. If the dataset is not of text type, returns None.
- length_name(name_or_id)[source]¶
The name of length tensor of text dataset by its name or id. If the dataset is not of text type, returns None.
RecordData¶
- class texar.torch.data.RecordData(hparams=None, device=None, data_source=None)[source]¶
Record data which loads and processes pickled files.
This module can be used to process image data, features, etc.
- Parameters
hparams (dict) – Hyperparameters. See
default_hparams()
for the defaults.device – The device of the produced batches. For GPU training, set to current CUDA device.
The module reads and restores data from pickled files and results in a dataset whose element is a Python dict that maps feature names to feature values. The features names and dtypes are specified in
hparams.dataset.feature_types
.The module also provides simple processing options for image data, such as image resize.
Example
# Read data from pickled file hparams={ 'dataset': { 'files': 'image1.pkl', 'feature_types': { 'height': ['int64', 'list'], # or 'stacked_tensor' 'width': ['int64', 'list'], # or 'stacked_tensor' 'label': ['int64', 'stacked_tensor'], 'image_raw': ['bytes', 'stacked_tensor'], } }, 'batch_size': 1 } data = RecordData(hparams) iterator = DataIterator(data) batch = next(iter(iterator)) # get the first batch in dataset # batch == { # 'data': { # 'height': [239], # 'width': [149], # 'label': tensor([1]), # # # 'image_raw' is a NumPy ndarray of raw image bytes in this # # example. # 'image_raw': [...], # } # }
# Read image data from pickled file and do resizing hparams={ 'dataset': { 'files': 'image2.pkl', 'feature_types': { 'label': ['int64', 'stacked_tensor'], 'image_raw': ['bytes', 'stacked_tensor'], }, 'image_options': { 'image_feature_name': 'image_raw', 'resize_height': 512, 'resize_width': 512, } }, 'batch_size': 1 } data = RecordData(hparams) iterator = DataIterator(data) batch = next(iter(iterator)) # get the first batch in dataset # batch == { # 'data': { # 'label': tensor([1]), # # # "image_raw" is a tensor of image pixel data in this # # example. Each image has a width of 512 and height of 512. # 'image_raw': tensor([...]) # } # }
- classmethod writer(file_path, feature_types)[source]¶
Construct a file writer object that saves records in pickled format.
Example:
file_path = "data/train.pkl" feature_types = { "input_ids": ["int64", "stacked_tensor", 128], "label_ids": ["int64", "stacked_tensor"], } with tx.data.RecordData.writer(file_path, feature_types) as writer: writer.write({ "input_ids": np.random.randint(0, 100, size=128), "label_ids": np.random.randint(0, 100), })
- Parameters
file_path (str) – Path to save the dataset.
feature_types – Feature names and types. Please refer to
default_hparams()
for details.
- Returns
A file writer object.
- static default_hparams()[source]¶
Returns a dictionary of default hyperparameters.
{ # (1) Hyperparameters specific to the record data 'dataset': { 'files': [], 'feature_types': {}, 'feature_convert_types': {}, 'image_options': {}, "num_shards": None, "shard_id": None, "other_transformations": [], "data_name": None, } # (2) General hyperparameters "num_epochs": 1, "batch_size": 64, "allow_smaller_final_batch": True, "shuffle": True, "shuffle_buffer_size": None, "shard_and_shuffle": False, "num_parallel_calls": 1, "prefetch_buffer_size": 0, "max_dataset_size": -1, "seed": None, "name": "tfrecord_data", }
Here:
For the hyperparameters in the
"dataset"
field:- “files”: str or list
A (list of) pickled file path(s).
- “feature_types”: dict
The feature names (str) with their descriptions in the form of
feature_name: [dtype, feature_collate_method, shape]
:dtype
is a Python type (int
,str
), dtype instance from PyTorch (torch.float
), NumPy (np.int64
), or TensorFlow (tf.string
), or their stringified names such as"torch.float"
and"np.int64"
. The feature will be read from the files and parsed into this dtype.feature_collate_method
is of typestr
, and describes how features are collated in the batch. Available values are:"stacked_tensor"
: Features are assumed to be tensors of a fixed shape (or scalars). When collating, features are stacked, with the batch dimension being the first dimension. This is the default value iffeature_collate_method
is not specified. For example:5 scalar features -> a tensor of shape [5].
4 tensor features, each of shape [6, 5] -> a tensor of shape [4, 6, 5].
"padded_tensor"
: Features are assumed to be tensors, with all dimensions except the first having the same size. When collating, features are padded with zero values along the end of the first dimension so that every tensor has the same size, and then stacked, with the batch dimension being the first dimension. For example:3 tensor features, with shapes [4, 7, 8], [5, 7, 8], and [4, 7, 8] -> a tensor of shape [3, 5, 7, 8].
"list"
: Features can be any objects. When collating, the features are stored in a Python list.
shape
is optional, and can be of typeint
, tuple`, ortorch.Size
. If specified, shapes of tensor features will be checked, depending on thefeature_collate_method
:"stacked_tensor"
: The shape of every feature tensor must beshape
."padded_tensor"
: The shape (excluding first dimension) of every feature tensor must beshape
."list"
:shape
is ignored.
Note
Shape check is performed before any transformations are applied.
Example:
feature_types = { "input_ids": ["int64", "stacked_tensor", 128], "label_ids": ["int64", "stacked_tensor"], "name_lists": ["string", "list"], }
Note
This field is named “feature_original_types” in Texar-TF. This name is still supported, but is deprecated in favor of “feature_types”.
Texar-TF also uses different names for feature types:
"FixedLenFeature"
corresponds to"stacked_tensor"
."FixedLenSequenceFeature"
corresponds to"padded_tensor"
."VarLenFeature"
corresponds to"list"
.
These names are also accepted in Texar-PyTorch, but are deprecated in favor of the new names.
- “feature_convert_types”: dict, optional
Specifies dtype converting after reading the data files. This dict maps feature names to desired data dtypes. For example, you can first read a feature into dtype
torch.int32
by specifying in"feature_types"
above, and convert the feature to dtype"torch.long"
by specifying here. Features not specified here will not do dtype-convert.dtype
is a Python type (int, str), dtype instance from PyTorch (torch.float
), NumPy (np.int64
), or TensorFlow (tf.string
), or their stringified names such as"torch.float"
and"np.int64"
.
Note that this converting process happens after all the data are restored.
Example:
feature_convert_types = { "input_ids": "int32", "label_ids": "int32", }
- “image_options”: dict, optional
Specifies the image feature name and performs image resizing, includes three fields:
- “image_feature_name”: str
The name of the feature which contains the image data. If set, the image data will be restored in a numpy.ndarray.
- “resize_height”: int
The height of the image after resizing.
- “resize_width”: int
The width of the image after resizing.
If any of
"resize_height"
or"resize_width"
is not set, image data will be restored with original shape.- “num_shards”: int, optional
The number of data shards in distributed mode. Usually set to the number of processes in distributed computing. Used in combination with
"shard_id"
.Warning
Sharding is not yet supported. This option (and related ones below) will be ignored.
- “shard_id”: int, optional
Sets the unique id to identify a shard. The module will processes only the corresponding shard of the whole data. Used in combination with
"num_shards"
.For example, in a case of distributed computing on 2 GPUs, the hyperparameters of the data module for the two processes can be configured as below, respectively.
For GPU 0:
dataset: { ... "num_shards": 2, "shard_id": 0 }
For GPU 1:
dataset: { ... "num_shards": 2, "shard_id": 1 }
Also refer to examples/bert for a use case.
- “other_transformations”: list
A list of transformation functions or function names/paths to further transform each single data instance.
- “data_name”: str
Name of the dataset.
For the general hyperparameters, see
texar.torch.data.DatasetBase.default_hparams()
for details.
- list_items()[source]¶
Returns the list of item names that the data can produce.
- Returns
A list of strings.
- property feature_names¶
A list of feature names.
Data Iterators¶
Batch¶
- class texar.torch.data.Batch(batch_size, batch=None, **kwargs)[source]¶
Wrapper over Python dictionaries representing a batch. It provides a dictionary-like interface to access its fields. This class can be used in the followed way
hparams = { 'dataset': { 'files': 'data.txt', 'vocab_file': 'vocab.txt' }, 'batch_size': 1 } data = MonoTextData(hparams) iterator = DataIterator(data) for batch in iterator: # batch is Batch object and contains the following fields # batch == { # 'text': [['<BOS>', 'example', 'sequence', '<EOS>']], # 'text_ids': [[1, 5, 10, 2]], # 'length': [4] # } input_ids = torch.tensor(batch['text_ids']) # we can also access the elements using dot notation input_text = batch.text
DataIterator¶
- class texar.torch.data.DataIterator(datasets, batching_strategy=None, pin_memory=None)[source]¶
Data iterator that switches and iterates through multiple datasets.
This is a wrapper of
SingleDatasetIterator
.- Parameters
datasets –
Datasets to iterate through. This can be:
A single instance of
DatasetBase
.A dict that maps dataset name to instances of
DatasetBase
.A list of instances of
texar.torch.data.DatasetBase
. The name of instances (texar.torch.data.DatasetBase.name
) must be unique.
batching_strategy – The batching strategy to use when performing dynamic batching. If None, fixed-sized batching is used.
pin_memory –
If True, tensors will be moved onto page-locked memory before returning. This argument is passed into the constructor for DataLoader.
Defaults to None, which will set the value to True if the
DatasetBase
instance is set to use a CUDA device. Set to True or False to override this behavior.
Example
Create an iterator over two datasets and generating fixed-sized batches:
train_data = MonoTextData(hparams_train) test_data = MonoTextData(hparams_test) iterator = DataIterator({'train': train_data, 'test': test_data}) for epoch in range(200): # Run 200 epochs of train/test # Starts iterating through training data from the beginning. iterator.switch_to_dataset('train') for batch in iterator: ... # Do training with the batch. # Starts iterating through test data from the beginning for batch in iterator.get_iterator('test'): ... # Do testing with the batch.
Dynamic batching based on total number of tokens:
iterator = DataIterator( {'train': train_data, 'test': test_data}, batching_strategy=TokenCountBatchingStrategy(max_tokens=1000))
Dynamic batching with custom strategy (e.g. total number of tokens in examples from
PairedTextData
, including padding):class CustomBatchingStrategy(BatchingStrategy): def __init__(self, max_tokens: int): self.max_tokens = max_tokens self.reset_batch() def reset_batch(self) -> None: self.max_src_len = 0 self.max_tgt_len = 0 self.cur_batch_size = 0 def add_example(self, ex: Tuple[List[str], List[str]]) -> bool: max_src_len = max(self.max_src_len, len(ex[0])) max_tgt_len = max(self.max_tgt_len, len(ex[0])) if (max(max_src_len + max_tgt_len) * (self.cur_batch_size + 1) > self.max_tokens): return False self.max_src_len = max_src_len self.max_tgt_len = max_tgt_len self.cur_batch_size += 1 return True iterator = DataIterator( {'train': train_data, 'test': test_data}, batching_strategy=CustomBatchingStrategy(max_tokens=1000))
- property num_datasets¶
Number of datasets.
- property dataset_names¶
A list of dataset names.
TrainTestDataIterator¶
- class texar.torch.data.TrainTestDataIterator(train=None, val=None, test=None, batching_strategy=None, pin_memory=None)[source]¶
Data iterator that alternates between training, validation, and test datasets.
train
,val
, andtest
are instances ofDatasetBase
. At least one of them must be provided.This is a wrapper of
DataIterator
.- Parameters
train (optional) – Training data.
val (optional) – Validation data.
test (optional) – Test data.
batching_strategy – The batching strategy to use when performing dynamic batching. If None, fixed-sized batching is used.
pin_memory –
If True, tensors will be moved onto page-locked memory before returning. This argument is passed into the constructor for DataLoader.
Defaults to None, which will set the value to True if the
DatasetBase
instance is set to use a CUDA device. Set to True or False to override this behavior.
Example
train_data = MonoTextData(hparams_train) val_data = MonoTextData(hparams_val) iterator = TrainTestDataIterator(train=train_data, val=val_data) for epoch in range(200): # Run 200 epochs of train/val # Starts iterating through training data from the beginning. iterator.switch_to_train_data(sess) for batch in iterator: ... # Do training with the batch. # Starts iterating through val data from the beginning. for batch in iterator.get_val_iterator(): ... # Do validation on the batch.
BatchingStrategy¶
- class texar.torch.data.BatchingStrategy(*args, **kwds)[source]¶
Decides batch boundaries in dynamic batching. Please refer to
TokenCountBatchingStrategy
for a concrete example.- reset_batch()[source]¶
Reset the internal state of the batching strategy. This method is called at the start of iteration, and after each batch is yielded.
- add_example(example)[source]¶
Add an example into the current batch, and modify internal states accordingly. If the example should not be added to the batch, this method does not modify the internal state, and returns False.
- Parameters
example – The example to add to the batch.
- Returns
A boolean value indicating whether
example
should be added to the batch.
TokenCountBatchingStrategy¶
- class texar.torch.data.TokenCountBatchingStrategy(max_tokens, max_batch_size=None, length_fn=None)[source]¶
Create dynamically-sized batches so that the total number of tokens inside each batch is constrained.
- Parameters
max_tokens (int) – The maximum number of tokens inside each batch.
max_batch_size (int, optional) – The maximum number of examples for each batch. If None, batches can contain arbitrary number of examples as long as the total number of tokens does not exceed
max_tokens
.length_fn (callable, optional) – A function taking a data example as argument, and returning the number of tokens in the example. By default,
len
is used, which is the desired behavior if the dataset in question is aMonoTextData
.
Data Utilities¶
maybe_download¶
read_words¶
make_vocab¶
- texar.torch.data.make_vocab(filenames, max_vocab_size=- 1, newline_token=None, return_type='list', return_count=False)[source]¶
Builds vocab of the files.
- Parameters
filenames (str) – A (list of) files.
max_vocab_size (int) – Maximum size of the vocabulary. Low frequency words that exceeding the limit will be discarded. Set to -1 (default) if no truncation is wanted.
newline_token (str, optional) – The token to replace the original newline token “\n”. For example,
tx.data.SpecialTokens.EOS
. If None, no replacement is performed.return_type (str) – Either
list
ordict
. Iflist
(default), this function returns a list of words sorted by frequency. Ifdict
, this function returns a dict mapping words to their index sorted by frequency.return_count (bool) – Whether to return word counts. If True and
return_type
isdict
, then a count dict is returned, which is a mapping from words to their frequency.
- Returns
If
return_count
is False, returns a list or dict containing the vocabulary words.If
return_count
if True, returns a pair of list or dict (a, b), where a is a list or dict containing the vocabulary words, b is a list or dict containing the word counts.