Modules¶
ModuleBase¶
- class texar.torch.ModuleBase(hparams=None)[source]¶
Base class inherited by modules that are configurable through hyperparameters.
This is a subclass of torch.nn.Module.
A Texar module inheriting
ModuleBaseis configurable through hyperparameters. That is, each module defines allowed hyperparameters and default values. Hyperparameters not specified by users will take default values.- Parameters
hparams (dict, optional) – Hyperparameters of the module. See
default_hparams()for the structure and default values.
- static default_hparams()[source]¶
Returns a dict of hyperparameters of the module with default values. Used to replace the missing values of input hparams during module construction.
{ "name": "module" }
- property trainable_variables¶
The list of trainable variables (parameters) of the module. Parameters of this module and all its submodules are included.
Note
The list returned may contain duplicate parameters (e.g. output layer shares parameters with embeddings). For most usages, it’s not necessary to ensure uniqueness.
- property output_size¶
The feature size of
forward()output tensor(s), usually it is equal to the last dimension value of the output tensor size.
Embedders¶
WordEmbedder¶
- class texar.torch.modules.WordEmbedder(init_value=None, vocab_size=None, hparams=None)[source]¶
Simple word embedder that maps indexes into embeddings. The indexes can be soft (e.g., distributions over vocabulary).
Either
init_valueorvocab_sizeis required. If both are given, there must beinit_value.shape[0]==vocab_size.- Parameters
init_value (optional) –
A Tensor or numpy array that contains the initial value of embeddings. It is typically of shape
[vocab_size] + embedding-dim. Embeddings can have dimensionality > 1.If None, embedding is initialized as specified in
hparams["initializer"]. Otherwise, the"initializer"and"dim"hyperparameters inhparamsare ignored.vocab_size (int, optional) – The vocabulary size. Required if
init_valueis not given.hparams (dict, optional) – Embedder hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
See
forward()for the inputs and outputs of the embedder.Example:
ids = torch.empty([32, 10]).uniform_(to=10).type(torch.int64). soft_ids = torch.empty([32, 10, 100]).uniform_() embedder = WordEmbedder(vocab_size=100, hparams={'dim': 256}) ids_emb = embedder(ids=ids) # shape: [32, 10, 256] soft_ids_emb = embedder(soft_ids=soft_ids) # shape: [32, 10, 256]
# Use with Texar data module hparams={ 'dataset': { 'embedding_init': {'file': 'word2vec.txt'} ... }, } data = MonoTextData(data_params) iterator = DataIterator(data) batch = next(iter(iterator)) # Use data vocab size embedder_1 = WordEmbedder(vocab_size=data.vocab.size) emb_1 = embedder_1(batch['text_ids']) # Use pre-trained embedding embedder_2 = WordEmbedder(init_value=data.embedding_init_value) emb_2 = embedder_2(batch['text_ids'])
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "dim": 100, "dropout_rate": 0, "dropout_strategy": 'element', "initializer": { "type": "random_uniform_initializer", "kwargs": { "minval": -0.1, "maxval": 0.1, "seed": None } }, "trainable": True, "name": "word_embedder", }
Here:
- “dim”: int or list
Embedding dimension. Can be a list of integers to yield embeddings with dimensionality > 1.
Ignored if
init_valueis given to the embedder constructor.- “dropout_rate”: float
The dropout rate between 0 and 1. For example,
dropout_rate=0.1would zero out 10% of the embeddings. Set to 0 to disable dropout.- “dropout_strategy”: str
The dropout strategy. Can be one of the following
"element": The regular strategy that drops individual elements in the embedding vectors."item": Drops individual items (e.g., words) entirely. For example, for the word sequence “the simpler the better”, the strategy can yield “_ simpler the better”, where the first “the” is dropped."item_type": Drops item types (e.g., word types). For example, for the above sequence, the strategy can yield “_ simpler _ better”, where the word type “the” is dropped. The dropout will never yield “_ simpler the better” as in the"item"strategy.
- “initializer”: dict or None
Hyperparameters of the initializer for embedding values. See
get_initializer()for the details. Ignored ifinit_valueis given to the embedder constructor.- “trainable”: bool
Whether the embedding parameters are trainable. If false, freeze the embedding parameters.
- “name”: str
Name of the embedding variable.
- extra_repr()[source]¶
Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.
- forward(ids=None, soft_ids=None, **kwargs)[source]¶
Embeds (soft) ids.
Either
idsorsoft_idsmust be given, and they must not be given at the same time.- Parameters
ids (optional) – An integer tensor containing the ids to embed.
soft_ids (optional) – A tensor of weights (probabilities) used to mix the embedding vectors.
kwargs – Additional keyword arguments for torch.nn.functional.embedding besides
paramsandids.
- Returns
If
idsis given, returns a Tensor of shapelist(ids.shape) + embedding-dim. For example, iflist(ids.shape) == [batch_size, max_time]andlist(embedding.shape) == [vocab_size, emb_dim], then the return tensor has shape[batch_size, max_time, emb_dim].If
soft_idsis given, returns a Tensor of shapelist(soft_ids.shape)[:-1] + embedding-dim. For example, iflist(soft_ids.shape) == [batch_size, max_time, vocab_size]andlist(embedding.shape) == [vocab_size, emb_dim], then the return tensor has shape[batch_size, max_time, emb_dim].
- property embedding¶
The embedding tensor, of shape
[vocab_size] + dim.
- property dim¶
The embedding dimension.
- property vocab_size¶
The vocabulary size.
- property num_embeddings¶
The vocabulary size. This interface matches torch.nn.Embedding.
PositionEmbedder¶
- class texar.torch.modules.PositionEmbedder(position_size=None, init_value=None, hparams=None)[source]¶
Simple position embedder that maps position indexes into embeddings via lookup.
Either
init_valueorposition_sizeis required. If both are given, there must beinit_value.shape[0]==position_size.- Parameters
init_value (optional) –
A Tensor or numpy array that contains the initial value of embeddings. It is typically of shape
[position_size, embedding dim].If None, embedding is initialized as specified in
hparams["initializer"]. Otherwise, the"initializer"and"dim"hyperparameters inhparamsare ignored.position_size (int, optional) – The number of possible positions, e.g., the maximum sequence length. Required if
init_valueis not given.hparams (dict, optional) – Embedder hyperparameters. If it is not specified, the default hyperparameter setting is used. See
default_hparamsfor the structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "dim": 100, "initializer": { "type": "random_uniform_initializer", "kwargs": { "minval": -0.1, "maxval": 0.1, "seed": None } }, "dropout_rate": 0, "dropout_strategy": 'element', "trainable": True, "name": "position_embedder" }
The hyperparameters have the same meaning as those in
texar.torch.modules.WordEmbedder.default_hparams().
- extra_repr()[source]¶
Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.
- forward(positions=None, sequence_length=None, **kwargs)[source]¶
Embeds the positions.
Either
positionsorsequence_lengthis required:If both are given,
sequence_lengthis used to mask out embeddings of those time steps beyond the respective sequence lengths.If only
sequence_lengthis given, then positions from 0 tosequence_length - 1are embedded.
- Parameters
positions (optional) – A torch.LongTensor containing the position IDs to embed.
sequence_length (optional) – An torch.LongTensor of shape
[batch_size]. Time steps beyond the respective sequence lengths will have zero-valued embeddings.kwargs – Additional keyword arguments for torch.nn.functional.embedding besides
paramsandids.
- Returns
A Tensor of shape shape(inputs) + embedding dimension.
- property embedding¶
The embedding tensor.
- property dim¶
The embedding dimension.
- property position_size¶
The position size, i.e., maximum number of positions.
SinusoidsPositionEmbedder¶
- class texar.torch.modules.SinusoidsPositionEmbedder(position_size=None, hparams=None)[source]¶
Sinusoid position embedder that maps position indexes into embeddings via sinusoid calculation. This module does not have trainable parameters. Used in, e.g., Transformer models (Vaswani et al.) “Attention Is All You Need”.
Each channel of the input Tensor is incremented by a sinusoid of a different frequency and phase. This allows attention to learn to use absolute and relative positions.
Timing signals should be added to some precursors of both the query and the memory inputs to attention. The use of relative position is possible because sin(x+y) and cos(x+y) can be expressed in terms of y, sin(x), and cos(x). In particular, we use a geometric sequence of timescales starting with min_timescale and ending with max_timescale. The number of different timescales is equal to
dim / 2. For each timescale, we generate the two sinusoidal signals sin(timestep/timescale) and cos(timestep/timescale). All of these sinusoids are concatenated in the dim dimension.- Parameters
position_size (int) – The number of possible positions, e.g., the maximum sequence length. Set
position_size=Noneandhparams['cache_embeddings']=Falseto use arbitrarily large or negative position indices.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values We use a geometric sequence of timescales starting with min_timescale and ending with max_timescale. The number of different timescales is equal to
dim / 2.{ 'min_timescale': 1.0, 'max_timescale': 10000.0, 'dim': 512, 'cache_embeddings': True, 'name':'sinusoid_position_embedder', }
Here:
- “cache_embeddings”: bool
If True, precompute embeddings for positions in range [0, position_size - 1]. This leads to faster lookup but requires lookup indices to be within this range.
If False, embeddings are computed on-the-fly during lookup. Set to False if your application needs to handle sequences of arbitrary length, or requires embeddings at negative positions.
- extra_repr()[source]¶
Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.
- forward(positions=None, sequence_length=None, **kwargs)[source]¶
Embeds. Either
positionsorsequence_lengthis required:If both are given,
sequence_lengthis used to mask out embeddings of those time steps beyond the respective sequence lengths.If only
sequence_lengthis given, then positions from 0 to sequence_length - 1 are embedded.
- Parameters
positions (optional) – An torch.LongTensor containing the position IDs to embed.
sequence_length (optional) – An torch.LongTensor of shape
[batch_size]. Time steps beyond the respective sequence lengths will have zero-valued embeddings.
- Returns
A Tensor of shape
[batch_size, position_size, dim].
- property dim¶
The embedding dimension.
EmbedderBase¶
- class texar.torch.modules.EmbedderBase(num_embeds=None, init_value=None, hparams=None)[source]¶
The base embedder class that all embedder classes inherit.
- Parameters
num_embeds (int, optional) – The number of embedding elements, e.g., the vocabulary size of a word embedder.
init_value (Tensor or numpy array, optional) – Initial values of the embedding variable. If not given, embedding is initialized as specified in
hparams["initializer"].hparams (dict or HParams, optional) – Embedder hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "name": "embedder" }
- property num_embeds¶
The number of embedding elements.
Encoders¶
UnidirectionalRNNEncoder¶
- class texar.torch.modules.UnidirectionalRNNEncoder(input_size, cell=None, output_layer=None, hparams=None)[source]¶
One directional RNN encoder.
- Parameters
input_size (int) – The number of expected features in the input for the cell.
cell – (RNNCell, optional) If not specified, a cell is created as specified in
hparams["rnn_cell"].output_layer (optional) – An instance of torch.nn.Module. Applies to the RNN cell output of each step. If None (default), the output layer is created as specified in
hparams["output_layer"].hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
See
forward()for the inputs and outputs of the encoder.Example:
# Use with embedder embedder = WordEmbedder(vocab_size, hparams=emb_hparams) encoder = UnidirectionalRNNEncoder(hparams=enc_hparams) outputs, final_state = encoder( inputs=embedder(data_batch['text_ids']), sequence_length=data_batch['length'])
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "rnn_cell": default_rnn_cell_hparams(), "output_layer": { "num_layers": 0, "layer_size": 128, "activation": "identity", "final_layer_activation": None, "other_dense_kwargs": None, "dropout_layer_ids": [], "dropout_rate": 0.5, "variational_dropout": False }, "name": "unidirectional_rnn_encoder" }
Here:
- “rnn_cell”: dict
A dictionary of RNN cell hyperparameters. Ignored if
cellis given to the encoder constructor.The default value is defined in
default_rnn_cell_hparams().- “output_layer”: dict
Output layer hyperparameters. Ignored if
output_layeris given to the encoder constructor. Includes:- “num_layers”: int
The number of output (dense) layers. Set to 0 to avoid any output layers applied to the cell outputs.
- “layer_size”: int or list
The size of each of the output (dense) layers.
If an int, each output layer will have the same size. If a list, the length must equal to
num_layers.- “activation”: str or callable or None
Activation function for each of the output (dense) layer except for the final layer. This can be a function, or its string name or module path. If function name is given, the function must be from
torch.nn. For example:"activation": "relu" # function name "activation": "my_module.my_activation_fn" # module path "activation": my_module.my_activation_fn # function
Default is None which results in an identity activation.
- “final_layer_activation”: str or callable or None
The activation function for the final output layer.
- “other_dense_kwargs”: dict or None
Other keyword arguments to construct each of the output dense layers, e.g.,
bias. See torch.nn.Linear for the keyword arguments.- “dropout_layer_ids”: int or list
The indexes of layers (starting from 0) whose inputs are applied with dropout. The index =
num_layersmeans dropout applies to the final layer output. For example,{ "num_layers": 2, "dropout_layer_ids": [0, 2] }
will leads to a series of layers as -dropout-layer0-layer1-dropout-.
The dropout mode (training or not) is controlled by
self.training.- “dropout_rate”: float
The dropout rate, between 0 and 1. For example,
"dropout_rate": 0.1would zero out 10% of elements.- “variational_dropout”: bool
Whether the dropout mask is the same across all time steps.
- “name”: str
Name of the encoder
- forward(inputs, sequence_length=None, initial_state=None, time_major=False, return_cell_output=False, return_output_size=False)[source]¶
Encodes the inputs.
- Parameters
inputs – A 3D Tensor of shape
[batch_size, max_time, dim]. The first two dimensionsbatch_sizeandmax_timeare exchanged iftime_majoris True.sequence_length (optional) – A 1D torch.LongTensor of shape
[batch_size]. Sequence lengths of the batch inputs. Used to copy-through state and zero-out outputs when past a batch element’s sequence length.initial_state (optional) – Initial state of the RNN.
time_major (bool) – The shape format of the
inputsandoutputsTensors. If True, these tensors are of shape[max_time, batch_size, depth]. If False (default), these tensors are of shape[batch_size, max_time, depth].return_cell_output (bool) – Whether to return the output of the RNN cell. This is the results prior to the output layer.
return_output_size (bool) – Whether to return the size of the output (i.e., the results after output layers).
- Returns
By default (both
return_cell_outputandreturn_output_sizeare False), returns a pair(outputs, final_state), whereoutputs: The RNN output tensor by the output layer (if exists) or the RNN cell (otherwise). The tensor is of shape[batch_size, max_time, output_size]iftime_majoris False, or[max_time, batch_size, output_size]iftime_majoris True. If RNN cell output is a (nested) tuple of Tensors, then theoutputswill be a (nested) tuple having the same nest structure as the cell output.final_state: The final state of the RNN, which is a Tensor of shape[batch_size] + cell.state_sizeor a (nested) tuple of Tensors ifcell.state_sizeis a (nested) tuple.
If
return_cell_outputis True, returns a triple(outputs, final_state, cell_outputs)cell_outputs: The outputs by the RNN cell prior to the output layer, having the same structure withoutputsexcept for theoutput_dim.
If
return_output_sizeis True, returns a tuple(outputs, final_state, output_size)output_size: A (possibly nested tuple of) int representing the size ofoutputs. If a single int or an int array, thenoutputshas shape[batch/time, time/batch] + output_size. If a (nested) tuple, thenoutput_sizehas the same structure as withoutputs.
If both
return_cell_outputandreturn_output_sizeare True, returns(outputs, final_state, cell_outputs, output_size).
- property cell¶
The RNN cell.
- property state_size¶
The state size of encoder cell. Same as
encoder.cell.state_size.
- property output_layer¶
The output layer.
BidirectionalRNNEncoder¶
- class texar.torch.modules.BidirectionalRNNEncoder(input_size, cell_fw=None, cell_bw=None, output_layer_fw=None, output_layer_bw=None, hparams=None)[source]¶
Bidirectional forward-backward RNN encoder.
- Parameters
cell_fw (RNNCell, optional) – The forward RNN cell. If not given, a cell is created as specified in
hparams["rnn_cell_fw"].cell_bw (RNNCell, optional) – The backward RNN cell. If not given, a cell is created as specified in
hparams["rnn_cell_bw"].output_layer_fw (optional) – An instance of torch.nn.Module. Apply to the forward RNN cell output of each step. If None (default), the output layer is created as specified in
hparams["output_layer_fw"].output_layer_bw (optional) – An instance of torch.nn.Module. Apply to the backward RNN cell output of each step. If None (default), the output layer is created as specified in
hparams["output_layer_bw"].hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
See
forward()for the inputs and outputs of the encoder.Example
# Use with embedder embedder = WordEmbedder(vocab_size, hparams=emb_hparams) encoder = BidirectionalRNNEncoder(hparams=enc_hparams) outputs, final_state = encoder( inputs=embedder(data_batch['text_ids']), sequence_length=data_batch['length']) # outputs == (outputs_fw, outputs_bw) # final_state == (final_state_fw, final_state_bw)
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "rnn_cell_fw": default_rnn_cell_hparams(), "rnn_cell_bw": default_rnn_cell_hparams(), "rnn_cell_share_config": True, "output_layer_fw": { "num_layers": 0, "layer_size": 128, "activation": "identity", "final_layer_activation": None, "other_dense_kwargs": None, "dropout_layer_ids": [], "dropout_rate": 0.5, "variational_dropout": False }, "output_layer_bw": { # Same hyperparams and default values as "output_layer_fw" # ... }, "output_layer_share_config": True, "name": "bidirectional_rnn_encoder" }
Here:
- “rnn_cell_fw”: dict
Hyperparameters of the forward RNN cell. Ignored if
cell_fwis given to the encoder constructor.The default value is defined in
default_rnn_cell_hparams().- “rnn_cell_bw”: dict
Hyperparameters of the backward RNN cell. Ignored if
cell_bwis given to the encoder constructor, or if “rnn_cell_share_config” is True.The default value is defined in
default_rnn_cell_hparams().- “rnn_cell_share_config”: bool
Whether share hyperparameters of the backward cell with the forward cell. Note that the cell parameters (variables) are not shared.
- “output_layer_fw”: dict
Hyperparameters of the forward output layer. Ignored if
output_layer_fwis given to the constructor. See the"output_layer"field ofUnidirectionalRNNEncoder()for details.- “output_layer_bw”: dict
Hyperparameters of the backward output layer. Ignored if
output_layer_bwis given to the constructor. Have the same structure and defaults with"output_layer_fw".Ignored if
output_layer_share_configis True.- “output_layer_share_config”: bool
Whether share hyperparameters of the backward output layer with the forward output layer. Note that the layer parameters (variables) are not shared.
- “name”: str
Name of the encoder
- forward(inputs, sequence_length=None, initial_state_fw=None, initial_state_bw=None, time_major=False, return_cell_output=False, return_output_size=False)[source]¶
Encodes the inputs.
- Parameters
inputs – A 3D Tensor of shape
[batch_size, max_time, dim]. The first two dimensionsbatch_sizeandmax_timemay be exchanged iftime_majoris True.sequence_length (optional) – A 1D torch.LongTensor of shape
[batch_size]. Sequence lengths of the batch inputs. Used to copy-through state and zero-out outputs when past a batch element’s sequence length.initial_state_fw – (optional): Initial state of the forward RNN.
initial_state_bw – (optional): Initial state of the backward RNN.
time_major (bool) – The shape format of the
inputsandoutputsTensors. If True, these tensors are of shape[max_time, batch_size, depth]. If False (default), these tensors are of shape[batch_size, max_time, depth].return_cell_output (bool) – Whether to return the output of the RNN cell. This is the results prior to the output layer.
return_output_size (bool) – Whether to return the output size of the RNN cell. This is the results after the output layer.
- Returns
By default (both
return_cell_outputandreturn_output_sizeare False), returns a pair(outputs, final_state)outputs: A tuple(outputs_fw, outputs_bw)containing the forward and the backward RNN outputs, each of which is of shape[batch_size, max_time, output_dim]iftime_majoris False, or[max_time, batch_size, output_dim]iftime_majoris True. If RNN cell output is a (nested) tuple of Tensors, thenoutputs_fwandoutputs_bwwill be a (nested) tuple having the same structure as the cell output.final_state: A tuple(final_state_fw, final_state_bw)containing the final states of the forward and backward RNNs, each of which is a Tensor of shape[batch_size] + cell.state_size, or a (nested) tuple of Tensors ifcell.state_sizeis a (nested) tuple.
If
return_cell_outputis True, returns a triple(outputs, final_state, cell_outputs)wherecell_outputs: A tuple(cell_outputs_fw, cell_outputs_bw)containing the outputs by the forward and backward RNN cells prior to the output layers, having the same structure withoutputsexcept for theoutput_dim.
If
return_output_sizeis True, returns a tuple(outputs, final_state, output_size)whereoutput_size: A tuple(output_size_fw, output_size_bw)containing the size ofoutputs_fwandoutputs_bw, respectively. Take*_fwfor example,output_size_fwis a (possibly nested tuple of) int. If a single int or an int array, thenoutputs_fwhas shape[batch/time, time/batch] + output_size_fw. If a (nested) tuple, thenoutput_size_fwhas the same structure asoutputs_fw. The same applies tooutput_size_bw.
If both
return_cell_outputandreturn_output_sizeare True, returns(outputs, final_state, cell_outputs, output_size).
- property cell_fw¶
The forward RNN cell.
- property cell_bw¶
The backward RNN cell.
- property state_size_fw¶
The state size of the forward encoder cell. Same as
encoder.cell_fw.state_size.
- property state_size_bw¶
The state size of the backward encoder cell. Same as
encoder.cell_bw.state_size.
- property output_layer_fw¶
The output layer of the forward RNN.
- property output_layer_bw¶
The output layer of the backward RNN.
MultiheadAttentionEncoder¶
- class texar.torch.modules.MultiheadAttentionEncoder(input_size, hparams=None)[source]¶
Multi-head Attention Encoder.
- Parameters
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "initializer": None, 'num_heads': 8, 'output_dim': 512, 'num_units': 512, 'dropout_rate': 0.1, 'use_bias': False, "name": "multihead_attention" }
Here:
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()for details.- “num_heads”: int
Number of heads for attention calculation.
- “output_dim”: int
Output dimension of the returned tensor.
- “num_units”: int
Hidden dimension of the unsplit attention space. Should be divisible by “num_heads”.
- “dropout_rate”: float
Dropout rate in the attention.
- “use_bias”: bool
Use bias when projecting the key, value and query.
- “name”: str
Name of the module.
- forward(queries, memory, memory_attention_bias, cache=None)[source]¶
Encodes the inputs.
- Parameters
queries – A 3D tensor with shape of
[batch, length_query, depth_query].memory – A 3D tensor with shape of
[batch, length_key, depth_key].memory_attention_bias – A 3D tensor with shape of
[batch, length_key, num_units].cache – Memory cache only when inferring the sentence from scratch.
- Returns
A tensor of shape
[batch_size, max_time, dim]containing the encoded vectors.
TransformerEncoder¶
- class texar.torch.modules.TransformerEncoder(hparams=None)[source]¶
Transformer encoder that applies multi-head self attention for encoding sequences.
This module basically stacks
MultiheadAttentionEncoder,FeedForwardNetworkand residual connections. This module supports two types of architectures, namely, the standard Transformer Encoder architecture first proposed in (Vaswani et al.) “Attention is All You Need”, and the variant first used in (Devlin et al.) BERT. Seedefault_hparams()for the nuance between the two types of architectures.- Parameters
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- initialize_blocks()[source]¶
Helper function which initializes blocks for encoder.
Should be overridden by any classes where block initialization varies.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "num_blocks": 6, "dim": 512, 'use_bert_config': False, "embedding_dropout": 0.1, "residual_dropout": 0.1, "poswise_feedforward": default_transformer_poswise_net_hparams, 'multihead_attention': { 'name': 'multihead_attention', 'num_units': 512, 'num_heads': 8, 'dropout_rate': 0.1, 'output_dim': 512, 'use_bias': False, }, "eps": 1e-6, "initializer": None, "name": "transformer_encoder" }
Here:
- “num_blocks”: int
Number of stacked blocks.
- “dim”: int
Hidden dimension of the encoders.
- “use_bert_config”: bool
If False, apply the standard Transformer Encoder architecture from the original paper (Vaswani et al.) “Attention is All You Need”. If True, apply the Transformer Encoder architecture used in BERT (Devlin et al.) and the default setting of TensorFlow. The differences lie in:
The standard arch restricts the word embedding of PAD token to all zero. The BERT arch does not.
The attention bias for padding tokens: Standard architectures use
-1e8for negative attention mask. BERT uses-1e4instead.The residual connections between internal tensors: In BERT, a residual layer connects the tensors after layer normalization. In standard architectures, the tensors are connected before layer normalization.
- “embedding_dropout”: float
Dropout rate of the input embedding.
- “residual_dropout”: float
Dropout rate of the residual connections.
- “eps”: float
Epsilon values for layer norm layers.
- “poswise_feedforward”: dict
Hyperparameters for a feed-forward network used in residual connections. Make sure the dimension of the output tensor is equal to
"dim". Seedefault_transformer_poswise_net_hparams()for details.- “multihead_attention”: dict
Hyperparameters for the multi-head attention strategy. Make sure the
"output_dim"in this module is equal to"dim". SeeMultiheadAttentionEncoderfor details.- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()for details.- “name”: str
Name of the module.
- forward(inputs, sequence_length)[source]¶
Encodes the inputs.
- Parameters
inputs – A 3D Tensor of shape
[batch_size, max_time, dim], containing the embedding of input sequences. Note that the embedding dimension dim must equal “dim” inhparams. The input embedding is typically an aggregation of word embedding and position embedding.sequence_length – A 1D torch.LongTensor of shape
[batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A Tensor of shape
[batch_size, max_time, dim]containing the encoded vectors.
BERTEncoder¶
- class texar.torch.modules.BERTEncoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Raw BERT Transformer for encoding sequences. Please see
PretrainedBERTMixinfor a brief description of BERT.This module basically stacks
WordEmbedder,PositionEmbedder,TransformerEncoderand a dense pooler.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
bert-base-uncased). Please refer toPretrainedBERTMixinfor all supported models. If None, the model name inhparamsis used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_datafolder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- reset_parameters()[source]¶
Initialize parameters of the pre-trained model. This method is only called if pre-trained checkpoints are not loaded.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The encoder arch is determined by the constructor argument
pretrained_model_nameif it’s specified. In this case, hparams are ignored.Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "pretrained_model_name": "bert-base-uncased", "embed": { "dim": 768, "name": "word_embeddings" }, "vocab_size": 30522, "segment_embed": { "dim": 768, "name": "token_type_embeddings" }, "type_vocab_size": 2, "position_embed": { "dim": 768, "name": "position_embeddings" }, "position_size": 512, "encoder": { "dim": 768, "embedding_dropout": 0.1, "multihead_attention": { "dropout_rate": 0.1, "name": "self", "num_heads": 12, "num_units": 768, "output_dim": 768, "use_bias": True }, "name": "encoder", "num_blocks": 12, "eps": 1e-12, "poswise_feedforward": { "layers": [ { "kwargs": { "in_features": 768, "out_features": 3072, "bias": True }, "type": "Linear" }, {"type": "BertGELU"}, { "kwargs": { "in_features": 3072, "out_features": 768, "bias": True }, "type": "Linear" } ] }, "residual_dropout": 0.1, "use_bert_config": True }, "hidden_size": 768, "initializer": None, "name": "bert_encoder", }
Here:
The default parameters are values for uncased BERT-Base model.
- “pretrained_model_name”: str or None
The name of the pre-trained BERT model. If None, the model will be randomly initialized.
- “embed”: dict
Hyperparameters for word embedding layer.
- “vocab_size”: int
The vocabulary size of inputs in BERT model.
- “segment_embed”: dict
Hyperparameters for segment embedding layer.
- “type_vocab_size”: int
The vocabulary size of the segment_ids passed into BertModel.
- “position_embed”: dict
Hyperparameters for position embedding layer.
- “position_size”: int
The maximum sequence length that this model might ever be used with.
- “encoder”: dict
Hyperparameters for the TransformerEncoder. See
default_hparams()for details.- “hidden_size”: int
Size of the pooler dense layer.
- “eps”: float
Epsilon values for layer norm layers.
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()for details.- “name”: str
Name of the module.
- forward(inputs, sequence_length=None, segment_ids=None)[source]¶
Encodes the inputs. Note that the SpanBERT model does not use segmentation embedding. As a result, SpanBERT does not require segment_ids as an input when you use pre-trained SpanBERT checkpoint files.
- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
segment_ids (optional) – A 2D Tensor of shape [batch_size, max_time], containing the segment ids of tokens in input sequences. If None (default), a tensor with all elements set to zero is used.
sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A pair
(outputs, pooled_output)outputs: A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.pooled_output: A Tensor of size [batch_size, hidden_size] which is the output of a pooler pre-trained on top of the hidden state associated to the first character of the input (CLS), see BERT’s paper.
RoBERTaEncoder¶
- class texar.torch.modules.RoBERTaEncoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
RoBERTa Transformer for encoding sequences. Please see
PretrainedRoBERTaMixinfor a brief description of RoBERTa.This module basically stacks
WordEmbedder,PositionEmbedder,TransformerEncoderand a dense pooler.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
roberta-base). Please refer toPretrainedRoBERTaMixinfor all supported models. If None, the model name inhparamsis used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_datafolder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The encoder arch is determined by the constructor argument
pretrained_model_nameif it’s specified. In this case, hparams are ignored.Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "pretrained_model_name": "roberta-base", "embed": { "dim": 768, "name": "word_embeddings" }, "vocab_size": 50265, "position_embed": { "dim": 768, "name": "position_embeddings" }, "position_size": 514, "encoder": { "dim": 768, "embedding_dropout": 0.1, "multihead_attention": { "dropout_rate": 0.1, "name": "self", "num_heads": 12, "num_units": 768, "output_dim": 768, "use_bias": True }, "name": "encoder", "num_blocks": 12, "eps": 1e-12, "poswise_feedforward": { "layers": [ { "kwargs": { "in_features": 768, "out_features": 3072, "bias": True }, "type": "Linear" }, {"type": "BertGELU"}, { "kwargs": { "in_features": 3072, "out_features": 768, "bias": True }, "type": "Linear" } ] }, "residual_dropout": 0.1, "use_bert_config": True }, "hidden_size": 768, "initializer": None, "name": "roberta_encoder", }
Here:
The default parameters are values for RoBERTa-Base model.
- “pretrained_model_name”: str or None
The name of the pre-trained RoBERTa model. If None, the model will be randomly initialized.
- “embed”: dict
Hyperparameters for word embedding layer.
- “vocab_size”: int
The vocabulary size of inputs in RoBERTa model.
- “position_embed”: dict
Hyperparameters for position embedding layer.
- “position_size”: int
The maximum sequence length that this model might ever be used with.
- “encoder”: dict
Hyperparameters for the TransformerEncoder. See
default_hparams()for details.- “hidden_size”: int
Size of the pooler dense layer.
- “eps”: float
Epsilon values for layer norm layers.
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()for details.- “name”: str
Name of the module.
- forward(inputs, sequence_length=None, segment_ids=None)[source]¶
Encodes the inputs. Differing from the standard BERT, the RoBERTa model does not use segmentation embedding. As a result, RoBERTa does not require segment_ids as an input.
- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A pair
(outputs, pooled_output)outputs: A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.pooled_output: A Tensor of size [batch_size, hidden_size] which is the output of a pooler pre-trained on top of the hidden state associated to the first character of the input (CLS), see RoBERTa’s paper.
GPT2Encoder¶
- class texar.torch.modules.GPT2Encoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Raw GPT2 Transformer for encoding sequences. Please see
PretrainedGPT2Mixinfor a brief description of GPT2.This module basically stacks
WordEmbedder,PositionEmbedder,TransformerEncoder.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
gpt2-small). Please refer toPretrainedGPT2Mixinfor all supported models. If None, the model name inhparamsis used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_datafolder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The encoder arch is determined by the constructor argument
pretrained_model_nameif it’s specified. In this case, hparams are ignored.Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "pretrained_model_name": "gpt2-small", "vocab_size": 50257, "context_size": 1024, "embedding_size": 768, "embed": { "dim": 768, "name": "word_embeddings" }, "position_size": 1024, "position_embed": { "dim": 768, "name": "position_embeddings" }, "encoder": { "dim": 768, "num_blocks": 12, "use_bert_config": False, "embedding_dropout": 0, "residual_dropout": 0, "multihead_attention": { "use_bias": True, "num_units": 768, "num_heads": 12, "output_dim": 768 }, "eps": 1e-6, "initializer": { "type": "variance_scaling_initializer", "kwargs": { "factor": 1.0, "mode": "FAN_AVG", "uniform": True } }, "poswise_feedforward": { "layers": [ { "type": "Linear", "kwargs": { "in_features": 768, "out_features": 3072, "bias": True } }, { "type": "GPTGELU", "kwargs": {} }, { "type": "Linear", "kwargs": { "in_features": 3072, "out_features": 768, "bias": True } } ], "name": "ffn" } }, "initializer": None, "name": "gpt2_encoder", }
Here:
The default parameters are values for 124M GPT2 model.
- “pretrained_model_name”: str or None
The name of the pre-trained GPT2 model. If None, the model will be randomly initialized.
- “embed”: dict
Hyperparameters for word embedding layer.
- “vocab_size”: int
The vocabulary size of inputs in GPT2Model.
- “position_embed”: dict
Hyperparameters for position embedding layer.
- “position_size”: int
The maximum sequence length that this model might ever be used with.
- “decoder”: dict
Hyperparameters for the TransformerDecoder. See
default_hparams()for details.- “eps”: float
Epsilon values for layer norm layers.
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()for details.- “name”: str
Name of the module.
- forward(inputs, sequence_length=None)[source]¶
Encodes the inputs.
- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.
- Return type
outputs
XLNetEncoder¶
- class texar.torch.modules.XLNetEncoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Raw XLNet module for encoding sequences. Please see
PretrainedXLNetMixinfor a brief description of XLNet.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
xlnet-based-cased). Please refer toPretrainedXLNetMixinfor all supported models. If None, the model name inhparamsis used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_datafolder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The encoder arch is determined by the constructor argument
pretrained_model_nameif it’s specified. In this case, hparams are ignored.Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "pretrained_model_name": "xlnet-base-cased", "untie_r": True, "num_layers": 12, "mem_len": 0, "reuse_len": 0, "num_heads": 12, "hidden_dim": 768, "head_dim": 64, "dropout": 0.1, "attention_dropout": 0.1, "use_segments": True, "ffn_inner_dim": 3072, "activation": 'gelu', "vocab_size": 32000, "max_seq_length": 512, "initializer": None, "name": "xlnet_encoder", }
Here:
The default parameters are values for cased XLNet-Base model.
- “pretrained_model_name”: str or None
The name of the pre-trained XLNet model. If None, the model will be randomly initialized.
- “untie_r”: bool
Whether to untie the biases in attention.
- “num_layers”: int
The number of stacked layers.
- “mem_len”: int
The number of tokens to cache.
- “reuse_len”: int
The number of tokens in the current batch to be cached and reused in the future.
- “num_heads”: int
The number of attention heads.
- “hidden_dim”: int
The hidden size.
- “head_dim”: int
The dimension size of each attention head.
- “dropout”: float
Dropout rate.
- “attention_dropout”: float
Dropout rate on attention probabilities.
- “use_segments”: bool
Whether to use segment embedding.
- “ffn_inner_dim”: int
The hidden size in feed-forward layers.
- “activation”: str
relu or gelu.
- “vocab_size”: int
The vocabulary size.
- “max_seq_length”: int
The maximum sequence length for RelativePositionalEncoding.
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()for details.- “name”: str
Name of the module.
- param_groups(lr=None, lr_layer_scale=1.0, decay_base_params=False)[source]¶
Create parameter groups for optimizers. When
lr_layer_decay_rateis not 1.0, parameters from each layer form separate groups with different base learning rates.The return value of this method can be used in the constructor of optimizers, for example:
model = XLNetEncoder(...) param_groups = model.param_groups(lr=2e-5, lr_layer_scale=0.8) optim = torch.optim.Adam(param_groups)
- Parameters
lr (float) – The learning rate. Can be omitted if
lr_layer_decay_rateis 1.0.lr_layer_scale (float) – Per-layer LR scaling rate. The i-th layer will be scaled by lr_layer_scale ^ (num_layers - i - 1).
decay_base_params (bool) – If True, treat non-layer parameters (e.g. embeddings) as if they’re in layer 0. If False, these parameters are not scaled.
- Returns
The parameter groups, used as the first argument for optimizers.
- forward(inputs, segment_ids=None, input_mask=None, memory=None, permute_mask=None, target_mapping=None, bi_data=False, clamp_len=None, cache_len=0, same_length=False, attn_type='bi', two_stream=False)[source]¶
Compute XLNet representations for the input.
- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
segment_ids – Shape [batch_size, max_time].
input_mask – Float tensor of shape [batch_size, max_time]. Note that positions with value 1 are masked out.
memory – Memory from previous batches. A list of length num_layers, each tensor of shape [batch_size, mem_len, hidden_dim].
permute_mask – The permutation mask. Float tensor of shape [batch_size, max_time, max_time]. A value of 0 for
permute_mask[i, j, k]indicates that position i attends to position j in batch k.target_mapping – The target token mapping. Float tensor of shape [batch_size, num_targets, max_time]. A value of 1 for
target_mapping[i, j, k]indicates that the i-th target token (in order of permutation) in batch k is the token at position j. Each rowtarget_mapping[i, :, k]can have no more than one value of 1.bi_data (bool) – Whether to use bidirectional data input pipeline.
clamp_len (int) – Clamp all relative distances larger than
clamp_len. A value of -1 means no clamping.cache_len (int) – Length of memory (number of tokens) to cache.
same_length (bool) – Whether to use the same attention length for each token.
attn_type (str) – Attention type. Supported values are “uni” and “bi”.
two_stream (bool) – Whether to use two-stream attention. Only set to True when pre-training or generating text. Defaults to False.
- Returns
A tuple of (output, new_memory):
`output`: The final layer output representations. Shape [batch_size, max_time, hidden_dim].
`new_memory`: The memory of the current batch. If cache_len is 0, then new_memory is None. Otherwise, it is a list of length num_layers, each tensor of shape [batch_size, cache_len, hidden_dim]. This can be used as the
memoryargument in the next batch.
Conv1DEncoder¶
- class texar.torch.modules.Conv1DEncoder(in_channels, in_features=None, hparams=None)[source]¶
Simple Conv-1D encoder which consists of a sequence of convolutional layers followed with a sequence of dense layers.
Wraps
Conv1DNetworkto be a subclass ofEncoderBase. Has exact the same functionality withConv1DNetwork.- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The same as
default_hparams()ofConv1DNetwork, except that the default name is"conv_encoder".
EncoderBase¶
RNNEncoderBase¶
- class texar.torch.modules.RNNEncoderBase(hparams=None)[source]¶
Base class for all RNN encoder classes to inherit.
- Parameters
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
default_transformer_poswise_net_hparams¶
- texar.torch.modules.default_transformer_poswise_net_hparams(input_dim, output_dim=512)[source]¶
Returns default hyperparameters of a
FeedForwardNetworkas a position-wise network used inTransformerEncoderandTransformerDecoder. This is a 2-layer dense network with dropout in-between.{ "layers": [ { "type": "Linear", "kwargs": { "in_features": input_dim, "out_features": output_dim * 4, "bias": True, } }, { "type": "nn.ReLU", "kwargs": { "inplace": True } }, { "type": "Dropout", "kwargs": { "p": 0.1, } }, { "type": "Linear", "kwargs": { "in_features": output_dim * 4, "out_features": output_dim, "bias": True, } } ], "name": "ffn" }
Decoders¶
DecoderBase¶
- class texar.torch.modules.DecoderBase(token_embedder=None, token_pos_embedder=None, input_time_major=False, output_time_major=False, hparams=None)[source]¶
Base class inherited by all RNN decoder classes. See
BasicRNNDecoderfor the arguments.See
forward()for the inputs and outputs of RNN decoders in general.- embed_tokens(tokens, positions)[source]¶
Convert tokens along with positions to embeddings.
- Parameters
tokens – A torch.LongTensor denoting the token indices to convert to embeddings.
positions – A torch.LongTensor with the same size as
tokens, denoting the positions of the tokens. This is useful if the decoder uses positional embeddings.
- Returns
A torch.Tensor of size
tokens.size() + (embed_dim,), denoting the converted embeddings.
- create_helper(*, decoding_strategy=None, start_tokens=None, end_token=None, softmax_temperature=None, infer_mode=None, **kwargs)[source]¶
Create a helper instance for the decoder. This is a shared interface for both
BasicRNNDecoderandAttentionRNNDecoder.The function provides 3 ways to specify the decoding method, with varying flexibility:
The
decoding_strategyargument: A string taking value of:“train_greedy”: decoding in teacher-forcing fashion (i.e., feeding ground truth to decode the next step), and each sample is obtained by taking the argmax of the output logits. Arguments
(inputs, sequence_length)are required for this strategy, and argumentembeddingis optional.“infer_greedy”: decoding in inference fashion (i.e., feeding the generated sample to decode the next step), and each sample is obtained by taking the argmax of the output logits. Arguments
(embedding, start_tokens, end_token)are required for this strategy, and argumentmax_decoding_lengthis optional.“infer_sample”: decoding in inference fashion, and each sample is obtained by random sampling from the RNN output distribution. Arguments
(embedding, start_tokens, end_token)are required for this strategy, and argumentmax_decoding_lengthis optional.
This argument is used only when argument
helperis None.Example:
embedder = WordEmbedder(vocab_size=data.vocab.size) decoder = BasicRNNDecoder(vocab_size=data.vocab.size) # Teacher-forcing decoding outputs_1, _, _ = decoder( decoding_strategy='train_greedy', inputs=embedder(data_batch['text_ids']), sequence_length=data_batch['length'] - 1) # Random sample decoding. Gets 100 sequence samples outputs_2, _, sequence_length = decoder( decoding_strategy='infer_sample', start_tokens=[data.vocab.bos_token_id] * 100, end_token=data.vocab.eos.token_id, embedding=embedder, max_decoding_length=60)
The
helperargument: An instance of subclass ofHelper. This provides a superset of decoding strategies than above, for example:TrainingHelpercorresponding to the “train_greedy” strategy.ScheduledEmbeddingTrainingHelperandScheduledOutputTrainingHelperfor scheduled sampling.SoftmaxEmbeddingHelperandGumbelSoftmaxEmbeddingHelperfor soft decoding and gradient backpropagation.
This means gives the maximal flexibility of configuring the decoding strategy.
Example:
embedder = WordEmbedder(vocab_size=data.vocab.size) decoder = BasicRNNDecoder(vocab_size=data.vocab.size) # Teacher-forcing decoding, same as above with # `decoding_strategy='train_greedy'` helper_1 = TrainingHelper( inputs=embedders(data_batch['text_ids']), sequence_length=data_batch['length'] - 1) outputs_1, _, _ = decoder(helper=helper_1) # Gumbel-softmax decoding helper_2 = GumbelSoftmaxEmbeddingHelper( embedding=embedder, start_tokens=[data.vocab.bos_token_id] * 100, end_token=data.vocab.eos_token_id, tau=0.1) outputs_2, _, sequence_length = decoder( max_decoding_length=60, helper=helper_2)
hparams["helper_train"]andhparams["helper_infer"]: Specifying the helper through hyperparameters. Train and infer strategy is toggled based onmode. Appropriate arguments (e.g.,inputs,start_tokens, etc) are selected to construct the helper. Additional arguments for helper constructor can be provided either through**kwargs, or throughhparams["helper_train/infer"]["kwargs"].This means is used only when both
decoding_strategyandhelperare None.Example:
h = { "helper_infer": { "type": "GumbelSoftmaxEmbeddingHelper", "kwargs": { "tau": 0.1 } } } embedder = WordEmbedder(vocab_size=data.vocab.size) decoder = BasicRNNDecoder(vocab_size=data.vocab.size, hparams=h) # Gumbel-softmax decoding decoder.eval() # disable dropout output, _, _ = decoder( decoding_strategy=None, # Sets to None explicit embedding=embedder, start_tokens=[data.vocab.bos_token_id] * 100, end_token=data.vocab.eos_token_id, max_decoding_length=60)
- Parameters
decoding_strategy (str) – A string specifying the decoding strategy. Different arguments are required based on the strategy. Ignored if
helperis given.start_tokens (optional) – A torch.LongTensor of shape
[batch_size], the start tokens. Used whendecoding_strategyis"infer_greedy"or"infer_sample", or when hparams-configured helper is used. When used with the Texar data module, to getbatch_sizesamples wherebatch_sizeis changing according to the data module, this can be set asstart_tokens=torch.full_like(batch['length'], bos_token_id).end_token (optional) – A integer or 0D torch.LongTensor, the token that marks the end of decoding. Used when
decoding_strategyis"infer_greedy"or"infer_sample", or when hparams-configured helper is used.softmax_temperature (float, optional) – Value to divide the logits by before computing the softmax. Larger values (above 1.0) result in more random samples. Must be > 0. If None, 1.0 is used. Used when
decoding_strategy="infer_sample".infer_mode (optional) – If not None, overrides mode given by
self.training.**kwargs – Other keyword arguments for constructing helpers defined by
hparams["helper_train"]orhparams["helper_infer"].
- Returns
The constructed helper instance.
- set_default_train_helper(helper)[source]¶
Set the default helper used in training mode.
- Parameters
helper – The helper to set as default training helper.
- set_default_infer_helper(helper)[source]¶
Set the default helper used in eval (inference) mode.
- Parameters
helper – The helper to set as default inference helper.
- dynamic_decode(helper, inputs, sequence_length, initial_state, max_decoding_length=None, impute_finished=False, step_hook=None)[source]¶
Generic routine for dynamic decoding. Please check the documentation for the TensorFlow counterpart.
- Returns
A tuple of output, final state, and sequence lengths. Note that final state could be None, when all sequences are of zero length and
initial_stateis also None.
- abstract initialize(helper, inputs, sequence_length, initial_state)[source]¶
Called before any decoding iterations.
This methods must compute initial input values and initial state.
- Parameters
helper – The
Helperinstance to use.inputs (optional) – A (structure of) input tensors.
sequence_length (optional) – A torch.LongTensor representing lengths of each sequence.
initial_state – A possibly nested structure of tensors indicating the initial decoder state.
- Returns
A tuple
(finished, initial_inputs, initial_state)representing initial values offinishedflags, inputs, and state.
- abstract step(helper, time, inputs, state)[source]¶
Compute the output and the state at the current time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(outputs, next_state).outputsis an object containing the decoder output.next_stateis the decoder state for the next time step.
- abstract next_inputs(helper, time, outputs)[source]¶
Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(next_inputs, finished).next_inputsis the tensor that should be used as input for the next step.finishedis a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.
- finalize(outputs, final_state, sequence_lengths)[source]¶
Called after all decoding iterations have finished.
- Parameters
outputs – Outputs at each time step.
final_state – The RNNCell state after the last time step.
sequence_lengths – Sequence lengths for each sequence in batch.
- Returns
A tuple
(outputs, final_state).outputsis an object containing the decoder output.final_stateis the final decoder state.
- property vocab_size¶
The vocabulary size.
- property output_layer¶
The output layer.
RNNDecoderBase¶
- class texar.torch.modules.RNNDecoderBase(input_size, vocab_size, token_embedder=None, token_pos_embedder=None, cell=None, output_layer=None, input_time_major=False, output_time_major=False, hparams=None)[source]¶
Base class inherited by all RNN decoder classes. See
BasicRNNDecoderfor the arguments.See
forward()for the inputs and outputs of RNN decoders in general.- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The hyperparameters are the same as in
default_hparams()ofBasicRNNDecoder, except that the default"name"here is"rnn_decoder".
- forward(inputs=None, sequence_length=None, initial_state=None, helper=None, max_decoding_length=None, impute_finished=False, infer_mode=None, **kwargs)[source]¶
Performs decoding. This is a shared interface for both
BasicRNNDecoderandAttentionRNNDecoder.Implementation calls
initialize()once andstep()repeatedly on the decoder object. Please refer to tf.contrib.seq2seq.dynamic_decode.See also
Arguments of
create_helper(), for arguments likedecoding_strategy.- Parameters
inputs (optional) –
Input tensors for teacher forcing decoding. Used when
decoding_strategyis set to"train_greedy", or when hparams-configured helper is used.The
inputsis a torch.LongTensor used as index to look up embeddings and feed in the decoder. For example, ifembedderis an instance ofWordEmbedder, theninputsis usually a 2D int Tensor [batch_size, max_time] (or [max_time, batch_size] if input_time_major == True) containing the token indexes.sequence_length (optional) – A 1D int Tensor containing the sequence length of
inputs. Used when decoding_strategy=”train_greedy” or hparams-configured helper is used.initial_state (optional) – Initial state of decoding. If None (default), zero state is used.
max_decoding_length – A int scalar Tensor indicating the maximum allowed number of decoding steps. If None (default), either hparams[“max_decoding_length_train”] or hparams[“max_decoding_length_infer”] is used according to
mode.impute_finished (bool) – If True, then states for batch entries which are marked as finished get copied through and the corresponding outputs get zeroed out. This causes some slowdown at each time step, but ensures that the final state and outputs have the correct values and that backprop ignores time steps that were marked as finished.
helper (optional) –
An instance of
Helperthat defines the decoding strategy. If given,decoding_strategyand helper configurations inhparamsare ignored.create_helper()can be used to create some of the common helpers for, e.g., teacher-forcing decoding, greedy decoding, sample decoding, etc.infer_mode (optional) – If not None, overrides mode given by self.training.
**kwargs – Other keyword arguments for constructing helpers defined by
hparams["helper_train"]orhparams["helper_infer"].
- Returns
(outputs, final_state, sequence_lengths), whereoutputs: an object containing the decoder output on all time steps.
final_state: the cell state of the final time step.
sequence_lengths: a torch.LongTensor of shape
[batch_size]containing the length of each sample.
- next_inputs(helper, time, outputs)[source]¶
Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(next_inputs, finished).next_inputsis the tensor that should be used as input for the next step.finishedis a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.
- property cell¶
The RNN cell.
- property state_size¶
The state size of decoder cell. Equivalent to
decoder.cell.state_size.
- property output_layer¶
The output layer.
BasicRNNDecoder¶
- class texar.torch.modules.BasicRNNDecoder(input_size, vocab_size, token_embedder=None, token_pos_embedder=None, cell=None, output_layer=None, input_time_major=False, output_time_major=False, hparams=None)[source]¶
Basic RNN decoder.
- Parameters
input_size (int) – Dimension of input embeddings.
vocab_size (int, optional) – Vocabulary size. Required if
output_layeris None.token_embedder – An instance of torch.nn.Module, or a function taking a torch.LongTensor
tokensas argument. This is the embedder called inembed_tokens()to convert input tokens to embeddings.token_pos_embedder –
An instance of torch.nn.Module, or a function taking two torch.LongTensors
tokensandpositionsas argument. This is the embedder called inembed_tokens()to convert input tokens with positions to embeddings.Note
Only one among
token_embedderandtoken_pos_embeddershould be specified. If neither is specified, you must subclassBasicRNNDecoderand overrideembed_tokens().cell (RNNCellBase, optional) – An instance of
RNNCellBase. If None (default), a cell is created as specified inhparams.output_layer (optional) – An instance of torch.nn.Module. Apply to the RNN cell output to get logits. If None, a torch.nn.Linear layer is used with output dimension set to
vocab_size. Setoutput_layertoidentity()if you do not want to have an output layer after the RNN cell outputs.hparams (dict, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
See
forward()for the inputs and outputs of the decoder. The decoder returns(outputs, final_state, sequence_lengths), whereoutputsis an instance ofBasicRNNDecoderOutput.Example
embedder = WordEmbedder(vocab_size=data.vocab.size) decoder = BasicRNNDecoder(vocab_size=data.vocab.size) # Training loss outputs, _, _ = decoder( decoding_strategy='train_greedy', inputs=embedder(data_batch['text_ids']), sequence_length=data_batch['length']-1) loss = tx.losses.sequence_sparse_softmax_cross_entropy( labels=data_batch['text_ids'][:, 1:], logits=outputs.logits, sequence_length=data_batch['length']-1) # Create helper helper = decoder.create_helper( decoding_strategy='infer_sample', start_tokens=[data.vocab.bos_token_id]*100, end_token=data.vocab.eos.token_id, embedding=embedder) # Inference sample outputs, _, _ = decoder( helper=helerp, max_decoding_length=60) sample_text = tx.utils.map_ids_to_strs( outputs.sample_id, data.vocab) print(sample_text) # [ # the first sequence sample . # the second sequence sample . # ... # ]
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "rnn_cell": default_rnn_cell_hparams(), "max_decoding_length_train": None, "max_decoding_length_infer": None, "helper_train": { "type": "TrainingHelper", "kwargs": {} } "helper_infer": { "type": "SampleEmbeddingHelper", "kwargs": {} } "name": "basic_rnn_decoder" }
Here:
- “rnn_cell”: dict
A dictionary of RNN cell hyperparameters. Ignored if
cellis given to the decoder constructor. The default value is defined indefault_rnn_cell_hparams().- “max_decoding_length_train”: int or None
Maximum allowed number of decoding steps in training mode. If None (default), decoding is performed until fully done, e.g., encountering the
<EOS>token. Ignored if"max_decoding_length"is not None given when calling the decoder.- “max_decoding_length_infer”: int or None
Same as
"max_decoding_length_train"but for inference mode.- “helper_train”: dict
The hyperparameters of the helper used in training.
"type"can be a helper class, its name or module path, or a helper instance. If a class name is given, the class must be from moduletexar.torch.modules, ortexar.torch.custom. This is used only when both"decoding_strategy"and"helper"arguments are None when calling the decoder. Seeforward()for more details.- “helper_infer”: dict
Same as
"helper_train"but during inference mode.- “name”: str
Name of the decoder. The default value is
"basic_rnn_decoder".
- next_inputs(helper, time, outputs)[source]¶
Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(next_inputs, finished).next_inputsis the tensor that should be used as input for the next step.finishedis a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.
BasicRNNDecoderOutput¶
- class texar.torch.modules.BasicRNNDecoderOutput(logits, sample_id, cell_output)[source]¶
The outputs of
BasicRNNDecoderthat include both RNN outputs and sampled IDs at each step. This is also used to store results of all the steps after decoding the whole sequence.- property logits¶
The outputs of RNN (at each step/of all steps) by applying the output layer on cell outputs. For example, in
BasicRNNDecoderwith default hyperparameters, this is a torch.Tensor of shape[batch_size, max_time, vocab_size]after decoding the whole sequence.
- property sample_id¶
The sampled results (at each step/of all steps). For example, in
BasicRNNDecoderwith decoding strategy of"train_greedy", this is a torch.LongTensor of shape[batch_size, max_time]containing the sampled token indices of all steps. Note that the shape ofsample_idis different for different decoding strategy or helper. Please refer toHelperfor the detailed information.
- property cell_output¶
The output of RNN cell (at each step/of all steps). This contains the results prior to the output layer. For example, in
BasicRNNDecoderwith default hyperparameters, this is a torch.Tensor of shape[batch_size, max_time, cell_output_size]after decoding the whole sequence.
AttentionRNNDecoder¶
- class texar.torch.modules.AttentionRNNDecoder(input_size, encoder_output_size, vocab_size, token_embedder=None, token_pos_embedder=None, cell=None, output_layer=None, cell_input_fn=None, hparams=None)[source]¶
RNN decoder with attention mechanism.
- Parameters
input_size (int) – Dimension of input embeddings.
encoder_output_size (int) – The output size of the encoder cell.
vocab_size (int) – Vocabulary size. Required if
output_layeris None.token_embedder – An instance of torch.nn.Module, or a function taking a torch.LongTensor
tokensas argument. This is the embedder called inembed_tokens()to convert input tokens to embeddings.token_pos_embedder –
An instance of torch.nn.Module, or a function taking two torch.LongTensors
tokensandpositionsas argument. This is the embedder called inembed_tokens()to convert input tokens with positions to embeddings.Note
Only one among
token_embedderandtoken_pos_embeddershould be specified. If neither is specified, you must subclassAttentionRNNDecoderand overrideembed_tokens().cell (RNNCellBase, optional) – An instance of
RNNCellBase. If None, a cell is created as specified inhparams.output_layer (optional) –
An output layer that transforms cell output to logits. This can be:
A callable layer, e.g., an instance of torch.nn.Module.
A tensor. A dense layer will be created using the tensor as the kernel weights. The bias of the dense layer is determined by hparams.output_layer_bias. This can be used to tie the output layer with the input embedding matrix, as proposed in https://arxiv.org/pdf/1608.05859.pdf
None. A dense layer will be created based on
vocab_sizeand hparams.output_layer_bias.If no output layer after the cell output is needed, set (vocab_size=None, output_layer=texar.torch.core.identity).
cell_input_fn (callable, optional) – A callable that produces RNN cell inputs. If None (default), the default is used:
lambda inputs, attention: torch.cat([inputs, attention], -1), which concatenates regular RNN cell inputs with attentions.hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
See
texar.torch.modules.RNNDecoderBase.forward()for the inputs and outputs of the decoder. The decoder returns (outputs, final_state, sequence_lengths), where outputs is an instance ofAttentionRNNDecoderOutput.Example
# Encodes the source enc_embedder = WordEmbedder(data.source_vocab.size, ...) encoder = UnidirectionalRNNEncoder(...) enc_outputs, _ = encoder( inputs=enc_embedder(data_batch['source_text_ids']), sequence_length=data_batch['source_length']) # Decodes while attending to the source dec_embedder = WordEmbedder(vocab_size=data.target_vocab.size, ...) decoder = AttentionRNNDecoder( encoder_output_size=(self.encoder.cell_fw.hidden_size + self.encoder.cell_bw.hidden_size), input_size=dec_embedder.dim, vocab_size=data.target_vocab.size) outputs, _, _ = decoder( decoding_strategy='train_greedy', memory=enc_outputs, memory_sequence_length=data_batch['source_length'], inputs=dec_embedder(data_batch['target_text_ids']), sequence_length=data_batch['target_length']-1)
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values. Common hyperparameters are the same as in
BasicRNNDecoder.default_hparams(). Additional hyperparameters are for attention mechanism configuration.{ "attention": { "type": "LuongAttention", "kwargs": { "num_units": 256, }, "attention_layer_size": None, "alignment_history": False, "output_attention": True, }, # The following hyperparameters are the same as with # `BasicRNNDecoder` "rnn_cell": default_rnn_cell_hparams(), "max_decoding_length_train": None, "max_decoding_length_infer": None, "helper_train": { "type": "TrainingHelper", "kwargs": {} } "helper_infer": { "type": "SampleEmbeddingHelper", "kwargs": {} } "name": "attention_rnn_decoder" }
Here:
- “attention”: dict
Attention hyperparameters, including:
- “type”: str or class or instance
The attention type. Can be an attention class, its name or module path, or a class instance. The class must be a subclass of
AttentionMechanism. See Attention Mechanism for all supported attention mechanisms. If class name is given, the class must be from modulestexar.torch.coreortexar.torch.custom.Example:
# class name "type": "LuongAttention" "type": "BahdanauAttention" # module path "type": "texar.torch.core.BahdanauMonotonicAttention" "type": "my_module.MyAttentionMechanismClass" # class "type": texar.torch.core.LuongMonotonicAttention # instance "type": LuongAttention(...)
- “kwargs”: dict
keyword arguments for the attention class constructor. Arguments
memoryandmemory_sequence_lengthshould not be specified here because they are given to the decoder constructor. Ignored if “type” is an attention class instance. For example:"type": "LuongAttention", "kwargs": { "num_units": 256, "probability_fn": torch.nn.functional.softmax, }
Here “probability_fn” can also be set to the string name or module path to a probability function.
- “attention_layer_size”: int or None
The depth of the attention (output) layer. The context and cell output are fed into the attention layer to generate attention at each time step. If None (default), use the context as attention at each time step.
- “alignment_history”: bool
whether to store alignment history from all time steps in the final output state. (Stored as a time major TensorArray on which you must call stack().)
- “output_attention”: bool
If True (default), the output at each time step is the attention value. This is the behavior of Luong-style attention mechanisms. If False, the output at each time step is the output of cell. This is the behavior of Bahdanau-style attention mechanisms. In both cases, the attention tensor is propagated to the next time step via the state and is used there. This flag only controls whether the attention mechanism is propagated up to the next cell in an RNN stack or to the top RNN output.
- next_inputs(helper, time, outputs)[source]¶
Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(next_inputs, finished).next_inputsis the tensor that should be used as input for the next step.finishedis a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.
- forward(memory, memory_sequence_length=None, inputs=None, sequence_length=None, initial_state=None, helper=None, max_decoding_length=None, impute_finished=False, infer_mode=None, beam_width=None, length_penalty=0.0, **kwargs)[source]¶
Performs decoding.
Implementation calls initialize() once and step() repeatedly on the Decoder object. Please refer to tf.contrib.seq2seq.dynamic_decode.
See also
Arguments of
create_helper().- Parameters
memory – The memory to query; usually the output of an RNN encoder. This tensor should be shaped [batch_size, max_time, …].
memory_sequence_length – (optional) Sequence lengths for the batch entries in memory. If provided, the memory tensor rows are masked with zeros for values past the respective sequence lengths.
inputs (optional) –
Input tensors for teacher forcing decoding. Used when
decoding_strategyis set to"train_greedy", or when hparams-configured helper is used.The attr:inputs is a torch.LongTensor used as index to look up embeddings and feed in the decoder. For example, if
embedderis an instance ofWordEmbedder, theninputsis usually a 2D int Tensor [batch_size, max_time] (or [max_time, batch_size] if input_time_major == True) containing the token indexes.sequence_length (optional) – A 1D int Tensor containing the sequence length of
inputs. Used when decoding_strategy=”train_greedy” or hparams-configured helper is used.initial_state (optional) – Initial state of decoding. If None (default), zero state is used.
helper (optional) – An instance of
Helperthat defines the decoding strategy. If given,decoding_strategyand helper configurations inhparamsare ignored.max_decoding_length – A int scalar Tensor indicating the maximum allowed number of decoding steps. If None (default), either hparams[“max_decoding_length_train”] or hparams[“max_decoding_length_infer”] is used according to
mode.impute_finished (bool) – If True, then states for batch entries which are marked as finished get copied through and the corresponding outputs get zeroed out. This causes some slowdown at each time step, but ensures that the final state and outputs have the correct values and that backprop ignores time steps that were marked as finished.
infer_mode (optional) – If not None, overrides mode given by self.training.
beam_width (int) – Set to use beam search. If given,
decoding_strategyis ignored.length_penalty (float) – Length penalty coefficient used in beam search decoding. Refer to https://arxiv.org/abs/1609.08144 for more details. It should be larger if longer sentences are desired.
**kwargs – Other keyword arguments for constructing helpers defined by
hparams["helper_train"]orhparams["helper_infer"].
- Returns
For beam search decoding, returns a
dictcontaining keys"sample_id"and"log_prob"."sample_id"is a torch.LongTensor of shape[batch_size, max_time, beam_width]containing generated token indexes.sample_id[:,:,0]is the highest-probable sample."log_prob"is a torch.Tensor of shape[batch_size, beam_width]containing the log probability of each sequence sample.
For “infer_greedy” and “infer_sample” decoding or decoding with
helper, returns a tuple (outputs, final_state, sequence_lengths), whereoutputs: an object containing the decoder output on all time steps.
final_state: is the cell state of the final time step.
sequence_lengths: is an int Tensor of shape [batch_size] containing the length of each sample.
AttentionRNNDecoderOutput¶
- class texar.torch.modules.AttentionRNNDecoderOutput(logits, sample_id, cell_output, attention_scores, attention_context)[source]¶
The outputs of
AttentionRNNDecoderthat additionally includes attention results.- property logits¶
The outputs of RNN (at each step/of all steps) by applying the output layer on cell outputs. For example, in
AttentionRNNDecoderwith default hyperparameters, this is a torch.Tensor of shape[batch_size, max_time, vocab_size]after decoding the whole sequence.
- property sample_id¶
The sampled results (at each step/of all steps). For example, in
AttentionRNNDecoderwith decoding strategy of"train_greedy", this is a torch.LongTensor of shape[batch_size, max_time]containing the sampled token indices of all steps. Note that the shape ofsample_idis different for different decoding strategy or helper. Please refer toHelperfor the detailed information.
- property cell_output¶
The output of RNN cell (at each step/of all steps). This contains the results prior to the output layer. For example, in
AttentionRNNDecoderwith default hyperparameters, this is a torch.Tensor of shape[batch_size, max_time, cell_output_size]after decoding the whole sequence.
- property attention_scores¶
A single or tuple of Tensor(s) containing the alignments emitted (at the previous time step/of all time steps) for each attention mechanism.
- property attention_context¶
The attention emitted (at the previous time step/of all time steps).
GPT2Decoder¶
- class texar.torch.modules.GPT2Decoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Raw GPT2 Transformer for decoding sequences. Please see
PretrainedGPT2Mixinfor a brief description of GPT2.This module basically stacks
WordEmbedder,PositionEmbedder,TransformerDecoder.This module supports the architecture first proposed in (Radford et al.) GPT2.
- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
gpt2-small). Please refer toPretrainedGPT2Mixinfor all supported models. If None, the model name inhparamsis used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_datafolder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The decoder arch is determined by the constructor argument
pretrained_model_nameif it’s specified. In this case, hparams are ignored.Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "name": "gpt2_decoder", "pretrained_model_name": "gpt2-small", "vocab_size": 50257, "context_size": 1024, "embedding_size": 768, "embed": { "dim": 768, "name": "word_embeddings" }, "position_size": 1024, "position_embed": { "dim": 768, "name": "position_embeddings" }, # hparams for TransformerDecoder "decoder": { "dim": 768, "num_blocks": 12, "embedding_dropout": 0, "residual_dropout": 0, "multihead_attention": { "use_bias": True, "num_units": 768, "num_heads": 12, "dropout_rate": 0.0, "output_dim": 768 }, "initializer": { "type": "variance_scaling_initializer", "kwargs": { "factor": 1.0, "mode": "FAN_AVG", "uniform": True } }, "eps": 1e-5, "poswise_feedforward": { "layers": [ { "type": "Linear", "kwargs": { "in_features": 768, "out_features": 3072, "bias": True } }, { "type": "GPTGELU", "kwargs": {} }, { "type": "Linear", "kwargs": { "in_features": 3072, "out_features": 768, "bias": True } } ], "name": "ffn" } }, }
Here:
The default parameters are values for 124M GPT2 model.
- “pretrained_model_name”: str or None
The name of the pre-trained GPT2 model. If None, the model will be randomly initialized.
- “embed”: dict
Hyperparameters for word embedding layer.
- “vocab_size”: int
The vocabulary size of inputs in GPT2Model.
- “position_embed”: dict
Hyperparameters for position embedding layer.
- “eps”: float
Epsilon values for layer norm layers.
- “position_size”: int
The maximum sequence length that this model might ever be used with.
- “name”: str
Name of the module.
- forward(inputs=None, sequence_length=None, memory=None, memory_sequence_length=None, memory_attention_bias=None, context=None, context_sequence_length=None, helper=None, decoding_strategy='train_greedy', max_decoding_length=None, impute_finished=False, infer_mode=None, beam_width=None, length_penalty=0.0, **kwargs)[source]¶
Performs decoding. Has exact the same interfaces with
texar.torch.modules.TransformerDecoder.forward(). Please refer to it for the detailed usage.
XLNetDecoder¶
- class texar.torch.modules.XLNetDecoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Raw XLNet module for decoding sequences. Please see
PretrainedXLNetMixinfor a brief description of XLNet.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
xlnet-based-cased). Please refer toPretrainedXLNetMixinfor all supported models. If None, the model name inhparamsis used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_datafolder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The decoder arch is determined by the constructor argument
pretrained_model_nameif it’s specified. In this case, hparams are ignored.Otherwise, the decoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the decoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "pretrained_model_name": "xlnet-base-cased", "untie_r": True, "num_layers": 12, "mem_len": 0, "reuse_len": 0, "num_heads": 12, "hidden_dim": 768, "head_dim": 64, "dropout": 0.1, "attention_dropout": 0.1, "use_segments": True, "ffn_inner_dim": 3072, "activation": 'gelu', "vocab_size": 32000, "max_seq_length": 512, "initializer": None, "name": "xlnet_decoder", }
Here:
The default parameters are values for cased XLNet-Base model.
- “pretrained_model_name”: str or None
The name of the pre-trained XLNet model. If None, the model will be randomly initialized.
- “untie_r”: bool
Whether to untie the biases in attention.
- “num_layers”: int
The number of stacked layers.
- “mem_len”: int
The number of tokens to cache.
- “reuse_len”: int
The number of tokens in the current batch to be cached and reused in the future.
- “num_heads”: int
The number of attention heads.
- “hidden_dim”: int
The hidden size.
- “head_dim”: int
The dimension size of each attention head.
- “dropout”: float
Dropout rate.
- “attention_dropout”: float
Dropout rate on attention probabilities.
- “use_segments”: bool
Whether to use segment embedding.
- “ffn_inner_dim”: int
The hidden size in feed-forward layers.
- “activation”: str
relu or gelu.
- “vocab_size”: int
The vocabulary size.
- “max_seq_length”: int
The maximum sequence length for RelativePositionalEncoding.
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()for details.- “name”: str
Name of the module.
- embed_tokens(tokens, positions)[source]¶
Convert tokens along with positions to embeddings.
- Parameters
tokens – A torch.LongTensor denoting the token indices to convert to embeddings.
positions – A torch.LongTensor with the same size as
tokens, denoting the positions of the tokens. This is useful if the decoder uses positional embeddings.
- Returns
A torch.Tensor of size
tokens.size() + (embed_dim,), denoting the converted embeddings.
- next_inputs(helper, time, outputs)[source]¶
Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(next_inputs, finished).next_inputsis the tensor that should be used as input for the next step.finishedis a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.
- forward(start_tokens, memory=None, cache_len=512, max_decoding_length=500, recompute_memory=True, print_steps=False, helper_type=None, **helper_kwargs)[source]¶
Perform autoregressive decoding using XLNet. The algorithm is largely inspired by: https://github.com/rusiaaman/XLNet-gen.
- Parameters
start_tokens – A LongTensor of shape [batch_size, prompt_len], representing the tokenized initial prompt.
memory (optional) – The initial memory.
cache_len – Length of memory (number of tokens) to cache.
max_decoding_length (int) – Maximum number of tokens to decode.
recompute_memory (bool) – If True, the entire memory is recomputed for each token to generate. This leads to better performance because it enables every generated token to attend to each other, compared to reusing previous memory which is equivalent to using a causal attention mask. However, it is computationally more expensive. Defaults to True.
print_steps (bool) – If True, will print decoding progress.
helper – Type (or name of the type) of any sub-class of
Helper.helper_kwargs – The keyword arguments to pass to constructor of the specific helper type.
- Returns
A tuple of (output, new_memory): - `output`: The sampled tokens as a list of integers. - `new_memory`: The memory of the sampled tokens.
XLNetDecoderOutput¶
- class texar.torch.modules.XLNetDecoderOutput(logits, sample_id)[source]¶
The output of
XLNetDecoder.- property logits¶
A torch.Tensor of shape
[batch_size, max_time, vocab_size]containing the logits.
- property sample_id¶
A torch.LongTensor of shape
[batch_size, max_time](or[batch_size, max_time, vocab_size]) containing the sampled token indices. Note that the shape ofsample_idis different for different decoding strategy or helper. Please refer toHelperfor the detailed information.
TransformerDecoder¶
- class texar.torch.modules.TransformerDecoder(token_embedder=None, token_pos_embedder=None, vocab_size=None, output_layer=None, hparams=None)[source]¶
Transformer decoder that applies multi-head self-attention for sequence decoding.
It is a stack of
MultiheadAttentionEncoder,FeedForwardNetwork, and residual connections.- Parameters
token_embedder – An instance of torch.nn.Module, or a function taking a torch.LongTensor
tokensas argument. This is the embedder called inembed_tokens()to convert input tokens to embeddings.token_pos_embedder –
An instance of torch.nn.Module, or a function taking two torch.LongTensors
tokensandpositionsas argument. This is the embedder called inembed_tokens()to convert input tokens with positions to embeddings.Note
Only one among
token_embedderandtoken_pos_embeddershould be specified. If neither is specified, you must subclassTransformerDecoderand overrideembed_tokens().vocab_size (int, optional) – Vocabulary size. Required if
output_layeris None.output_layer (optional) –
An output layer that transforms cell output to logits. This can be:
A callable layer, e.g., an instance of torch.nn.Module.
A tensor. A torch.nn.Linear layer will be created using the tensor as weights. The bias of the dense layer is determined by
hparams.output_layer_bias. This can be used to tie the output layer with the input embedding matrix, as proposed in https://arxiv.org/pdf/1608.05859.pdf.None. A torch.nn.Linear layer will be created based on
vocab_sizeandhparams.output_layer_bias.If no output layer is needed at the end, set
vocab_sizeto None andoutput_layertoidentity().
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- initialize_blocks()[source]¶
Helper function which initializes blocks for decoder.
Should be overridden by any classes where block initialization varies.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # Same as in TransformerEncoder "num_blocks": 6, "dim": 512, "embedding_dropout": 0.1, "residual_dropout": 0.1, "poswise_feedforward": default_transformer_poswise_net_hparams, "multihead_attention": { 'name': 'multihead_attention', 'num_units': 512, 'output_dim': 512, 'num_heads': 8, 'dropout_rate': 0.1, 'use_bias': False, }, "eps": 1e-12, "initializer": None, "name": "transformer_decoder" # Additional for TransformerDecoder "embedding_tie": True, "output_layer_bias": False, "max_decoding_length": int(1e10), }
Here:
- “num_blocks”: int
Number of stacked blocks.
- “dim”: int
Hidden dimension of the encoder.
- “embedding_dropout”: float
Dropout rate of the input word and position embeddings.
- “residual_dropout”: float
Dropout rate of the residual connections.
- “poswise_feedforward”: dict
Hyperparameters for a feed-forward network used in residual connections. Make sure the dimension of the output tensor is equal to
dim.See
default_transformer_poswise_net_hparams()for details.- “multihead_attention”: dict
Hyperparameters for the multi-head attention strategy. Make sure the
output_dimin this module is equal todim.See
MultiheadAttentionEncoderfor details.- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module.
See
get_initializer()for details.- “embedding_tie”: bool
Whether to use the word embedding matrix as the output layer that computes logits. If False, a new dense layer is created.
- “eps”: float
Epsilon values for layer norm layers.
- “output_layer_bias”: bool
Whether to use bias to the output layer.
- “max_decoding_length”: int
The maximum allowed number of decoding steps. Set to a very large number of avoid the length constraint. Ignored if provided in
forward()or"train_greedy"decoding is used.- “name”: str
Name of the module.
- forward(inputs=None, sequence_length=None, memory=None, memory_sequence_length=None, memory_attention_bias=None, context=None, context_sequence_length=None, helper=None, decoding_strategy='train_greedy', max_decoding_length=None, impute_finished=False, infer_mode=None, beam_width=None, length_penalty=0.0, **kwargs)[source]¶
Performs decoding.
The interface is very similar to that of RNN decoders (
RNNDecoderBase). In particular, the function provides 3 ways to specify the decoding method, with varying flexibility:The
decoding_strategyargument.“train_greedy”: decoding in teacher-forcing fashion (i.e., feeding ground truth to decode the next step), and for each step sample is obtained by taking the argmax of logits. Argument
inputsis required for this strategy.sequence_lengthis optional.“infer_greedy”: decoding in inference fashion (i.e., feeding generated sample to decode the next step), and for each step sample is obtained by taking the argmax of logits. Arguments
(start_tokens, end_token)are required for this strategy, and argumentmax_decoding_lengthis optional.“infer_sample”: decoding in inference fashion, and for each step sample is obtained by random sampling from the logits. Arguments
(start_tokens, end_token)are required for this strategy, and argumentmax_decoding_lengthis optional.
This argument is used only when arguments
helperandbeam_widthare both None.The
helperargument: An instance of subclass ofHelper. This provides a superset of decoding strategies than above. The interface is the same as in RNN decoders. Please refer totexar.torch.modules.RNNDecoderBase.forward()for detailed usage and examples.Note that, here, though using a
TrainingHelpercorresponding to the"train_greedy"strategy above, the implementation is slower than directly settingdecoding_strategy="train_greedy"(though output results are the same).Argument
max_decoding_lengthis optional.Beam search: set
beam_widthto use beam search decoding. Arguments(start_tokens, end_token)are required, and argumentmax_decoding_lengthis optional.
- Parameters
memory (optional) – The memory to attend, e.g., the output of an RNN encoder. A torch.Tensor of shape
[batch_size, memory_max_time, dim].memory_sequence_length (optional) – A torch.Tensor of shape
[batch_size]containing the sequence lengths for the batch entries in memory. Used to create attention bias ofmemory_attention_biasis not given. Ignored ifmemory_attention_biasis provided.memory_attention_bias (optional) – A torch.Tensor of shape
[batch_size, num_heads, memory_max_time, dim]. An attention bias typically sets the value of a padding position to a large negative value for masking. If not given,memory_sequence_lengthis used to automatically create an attention bias.inputs (optional) –
Input tensors for teacher forcing decoding. Used when
decoding_strategyis set to"train_greedy", or when hparams-configured helper is used.The attr:inputs is a torch.LongTensor used as index to look up embeddings and feed in the decoder. For example, if
embedderis an instance ofWordEmbedder, theninputsis usually a 2D int Tensor [batch_size, max_time] (or [max_time, batch_size] if input_time_major == True) containing the token indexes.sequence_length (optional) – A torch.LongTensor of shape
[batch_size], containing the sequence length ofinputs. Tokens beyond the respective sequence length are masked out. Used whendecoding_strategyis set to"train_greedy".decoding_strategy (str) – A string specifying the decoding strategy, including
"train_greedy","infer_greedy","infer_sample". Different arguments are required based on the strategy. See above for details. Ignored ifbeam_widthorhelperis set.beam_width (int) – Set to use beam search. If given,
decoding_strategyis ignored.length_penalty (float) – Length penalty coefficient used in beam search decoding. Refer to https://arxiv.org/abs/1609.08144 for more details. It should be larger if longer sentences are desired.
context (optional) – An torch.LongTensor of shape
[batch_size, length], containing the starting tokens for decoding. If context is set,start_tokensof theHelperwill be ignored.context_sequence_length (optional) – Specify the length of context.
max_decoding_length (int, optional) – The maximum allowed number of decoding steps. If None (default), use
"max_decoding_length"defined inhparams. Ignored in"train_greedy"decoding.impute_finished (bool) – If True, then states for batch entries which are marked as finished get copied through and the corresponding outputs get zeroed out. This causes some slowdown at each time step, but ensures that the final state and outputs have the correct values and that backprop ignores time steps that were marked as finished. Ignored in
"train_greedy"decoding.helper (optional) – An instance of
Helperthat defines the decoding strategy. If given,decoding_strategyand helper configurations inhparamsare ignored.infer_mode (optional) – If not None, overrides mode given by
self.training.**kwargs (optional, dict) –
Other keyword arguments. Typically ones such as:
start_tokens: A torch.LongTensor of shape
[batch_size], the start tokens. Used whendecoding_strategyis"infer_greedy"or"infer_sample"or whenbeam_searchis set. Ignored whencontextis set.When used with the Texar data module, to get
batch_sizesamples wherebatch_sizeis changing according to the data module, this can be set asstart_tokens=torch.full_like(batch['length'], bos_token_id).end_token: An integer or 0D torch.LongTensor, the token that marks the end of decoding. Used when
decoding_strategyis"infer_greedy"or"infer_sample", or whenbeam_searchis set.
- Returns
For “train_greedy” decoding, returns an instance of
TransformerDecoderOutputwhich contains sample_id and logits.For “infer_greedy” and “infer_sample” decoding or decoding with
helper, returns a tuple(outputs, sequence_lengths), whereoutputsis an instance ofTransformerDecoderOutputas in “train_greedy”, andsequence_lengthsis a torch.LongTensor of shape[batch_size]containing the length of each sample.For beam search decoding, returns a
dictcontaining keys"sample_id"and"log_prob"."sample_id"is a torch.LongTensor of shape[batch_size, max_time, beam_width]containing generated token indexes.sample_id[:,:,0]is the highest-probable sample."log_prob"is a torch.Tensor of shape[batch_size, beam_width]containing the log probability of each sequence sample.
- property output_size¶
Output size of one step.
- initialize(helper, inputs, sequence_length, initial_state)[source]¶
Called before any decoding iterations.
This methods must compute initial input values and initial state.
- Parameters
helper – The
Helperinstance to use.inputs (optional) – A (structure of) input tensors.
sequence_length (optional) – A torch.LongTensor representing lengths of each sequence.
initial_state – A possibly nested structure of tensors indicating the initial decoder state.
- Returns
A tuple
(finished, initial_inputs, initial_state)representing initial values offinishedflags, inputs, and state.
- step(helper, time, inputs, state)[source]¶
Compute the output and the state at the current time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(outputs, next_state).outputsis an object containing the decoder output.next_stateis the decoder state for the next time step.
- next_inputs(helper, time, outputs)[source]¶
Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(next_inputs, finished).next_inputsis the tensor that should be used as input for the next step.finishedis a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.
- finalize(outputs, final_state, sequence_lengths)[source]¶
Called after all decoding iterations have finished.
- Parameters
outputs – Outputs at each time step.
final_state – The RNNCell state after the last time step.
sequence_lengths – Sequence lengths for each sequence in batch.
- Returns
A tuple
(outputs, final_state).outputsis an object containing the decoder output.final_stateis the final decoder state.
TransformerDecoderOutput¶
- class texar.torch.modules.TransformerDecoderOutput(logits, sample_id)[source]¶
The output of
TransformerDecoder.- property logits¶
A torch.Tensor of shape
[batch_size, max_time, vocab_size]containing the logits.
- property sample_id¶
A torch.LongTensor of shape
[batch_size, max_time](or[batch_size, max_time, vocab_size]) containing the sampled token indices. Note that the shape ofsample_idis different for different decoding strategy or helper. Please refer toHelperfor the detailed information.
Helper¶
- class texar.torch.modules.Helper(*args, **kwds)[source]¶
Interface for implementing sampling in seq2seq decoders.
Please refer to the documentation for the TensorFlow counterpart tf.contrib.seq2seq.Helper.
- initialize(embedding_fn, inputs, sequence_length)[source]¶
Initialize the current batch.
- Parameters
embedding_fn – A function taking input tokens and timestamps, returning embedding tensors.
inputs – Input tensors.
sequence_length – An int32 vector tensor.
- Returns
(initial_finished, initial_inputs).
TrainingHelper¶
- class texar.torch.modules.TrainingHelper(time_major=False)[source]¶
A helper for use during training. Only reads inputs.
Returned
sample_idsare the argmax of the RNN output logits.Please refer to the documentation for the TensorFlow counterpart tf.contrib.seq2seq.TrainingHelper.
- Parameters
time_major (bool) – Whether the tensors in
inputsare time major. If False (default), they are assumed to be batch major.
EmbeddingHelper¶
- class texar.torch.modules.EmbeddingHelper(start_tokens, end_token)[source]¶
A generic helper for use during inference.
Uses output logits for sampling, and passes the result through an embedding layer to get the next input.
- Parameters
start_tokens – 1D torch.LongTensor shaped
[batch_size], representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
- Raises
ValueError – if
start_tokensis not a 1D tensor orend_tokenis not a scalar.
GreedyEmbeddingHelper¶
- class texar.torch.modules.GreedyEmbeddingHelper(start_tokens, end_token)[source]¶
A helper for use during inference.
Uses the argmax of the output (treated as logits) and passes the result through an embedding layer to get the next input.
Note that for greedy decoding, Texar’s decoders provide a simpler interface by specifying
decoding_strategy='infer_greedy'when calling a decoder (see, e.g.,,RNN decoder). In this case, use ofGreedyEmbeddingHelperis not necessary.Please refer to the documentation for the TensorFlow counterpart tf.contrib.seq2seq.GreedyEmbeddingHelper.
- Parameters
start_tokens – 1D torch.LongTensor shaped
[batch_size], representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
- Raises
ValueError – if
start_tokensis not a 1D tensor orend_tokenis not a scalar.
SampleEmbeddingHelper¶
- class texar.torch.modules.SampleEmbeddingHelper(start_tokens, end_token, softmax_temperature=None)[source]¶
A helper for use during inference.
Uses sampling (from a distribution) instead of argmax and passes the result through an embedding layer to get the next input.
Please refer to the documentation for the TensorFlow counterpart tf.contrib.seq2seq.SampleEmbeddingHelper.
- Parameters
embedding – A callable or the
paramsargument for torch.nn.functional.embedding. If a callable, it can take a vector tensor ofids(argmax ids), or take two arguments (ids,times), whereidsis a vector of argmax ids, andtimesis a vector of current time steps (i.e., position ids). The latter case can be used whenembeddingis a combination of word embedding and position embedding. The returned tensor will be passed to the decoder input.start_tokens – 1D torch.LongTensor shaped
[batch_size], representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
softmax_temperature (float, optional) – Value to divide the logits by before computing the softmax. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to 1.0.
- Raises
ValueError – if
start_tokensis not a 1D tensor orend_tokenis not a scalar.
TopKSampleEmbeddingHelper¶
- class texar.torch.modules.TopKSampleEmbeddingHelper(start_tokens, end_token, top_k=10, softmax_temperature=None)[source]¶
A helper for use during inference.
Samples from
top_kmost likely candidates from a vocab distribution, and passes the result through an embedding layer to get the next input.- Parameters
start_tokens – 1D torch.LongTensor shaped
[batch_size], representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
top_k (int, optional) – Number of top candidates to sample from. Must be >=0. If set to 0, samples from all candidates (i.e., regular random sample decoding). Defaults to 10.
softmax_temperature (float, optional) – Value to divide the logits by before computing the softmax. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to 1.0.
- Raises
ValueError – if
start_tokensis not a 1D tensor orend_tokenis not a scalar.
TopPSampleEmbeddingHelper¶
- class texar.torch.modules.TopPSampleEmbeddingHelper(start_tokens, end_token, p=0.9, softmax_temperature=None)[source]¶
A helper for use during inference.
Samples from candidates that have a cumulative probability of at most p when arranged in decreasing order, and passes the result through an embedding layer to get the next input. This is also named as “Nucleus Sampling” as proposed in the paper “The Curious Case of Neural Text Degeneration(Holtzman et al.)”.
- Parameters
start_tokens – 1D torch.LongTensor shaped
[batch_size], representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
p (float, optional) – A value used to filter out tokens whose cumulative probability is greater than p when arranged in decreasing order of probabilities. Must be between [0, 1.0]. If set to 1, samples from all candidates (i.e., regular random sample decoding). Defaults to 0.5.
softmax_temperature (float, optional) – Value to divide the logits by before computing the softmax. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to 1.0.
- Raises
ValueError – if
start_tokensis not a 1D tensor orend_tokenis not a scalar.
SoftmaxEmbeddingHelper¶
- class texar.torch.modules.SoftmaxEmbeddingHelper(start_tokens, end_token, tau, stop_gradient=False, use_finish=True)[source]¶
A helper that feeds softmax probabilities over vocabulary to the next step.
Uses the softmax probability vector to pass through word embeddings to get the next input (i.e., a mixed word embedding).
A subclass of
Helper. Used as a helper toRNNDecoderBasein inference mode.- Parameters
embedding – A callable or the
paramsargument for torch.nn.functional.embedding. If a callable, it can take a vector tensor ofids(argmax ids), or take two arguments (ids,times), whereidsis a vector of argmax ids, andtimesis a vector of current time steps (i.e., position ids). The latter case can be used whenembeddingis a combination of word embedding and position embedding. The returned tensor will be passed to the decoder input.start_tokens – 1D torch.LongTensor shaped
[batch_size], representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
tau – A float scalar tensor, the softmax temperature.
stop_gradient (bool) – Whether to stop the gradient backpropagation when feeding softmax vector to the next step.
use_finish (bool) – Whether to stop decoding once
end_tokenis generated. If False, decoding will continue untilmax_decoding_lengthof the decoder is reached.
- Raises
ValueError – if
start_tokensis not a 1D tensor orend_tokenis not a scalar.
GumbelSoftmaxEmbeddingHelper¶
- class texar.torch.modules.GumbelSoftmaxEmbeddingHelper(start_tokens, end_token, tau, straight_through=False, stop_gradient=False, use_finish=True)[source]¶
A helper that feeds Gumbel softmax sample to the next step.
Uses the Gumbel softmax vector to pass through word embeddings to get the next input (i.e., a mixed word embedding).
A subclass of
Helper. Used as a helper toRNNDecoderBasein inference mode.Same as
SoftmaxEmbeddingHelperexcept that here Gumbel softmax (instead of softmax) is used.- Parameters
embedding – A callable or the
paramsargument for torch.nn.functional.embedding. If a callable, it can take a vector tensor ofids(argmax ids), or take two arguments (ids,times), whereidsis a vector of argmax ids, andtimesis a vector of current time steps (i.e., position ids). The latter case can be used whenembeddingis a combination of word embedding and position embedding. The returned tensor will be passed to the decoder input.start_tokens – 1D torch.LongTensor shaped
[batch_size], representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
tau – A float scalar tensor, the softmax temperature.
straight_through (bool) – Whether to use straight through gradient between time steps. If True, a single token with highest probability (i.e., greedy sample) is fed to the next step and gradient is computed using straight through. If False (default), the soft Gumbel-softmax distribution is fed to the next step.
stop_gradient (bool) – Whether to stop the gradient backpropagation when feeding softmax vector to the next step.
use_finish (bool) – Whether to stop decoding once
end_tokenis generated. If False, decoding will continue untilmax_decoding_lengthof the decoder is reached.
- Raises
ValueError – if
start_tokensis not a 1D tensor orend_tokenis not a scalar.
get_helper¶
- texar.torch.modules.get_helper(helper_type, start_tokens=None, end_token=None, **kwargs)[source]¶
Creates a Helper instance.
- Parameters
helper_type – A
Helperclass, its name or module path, or a class instance. If a class instance is given, it is returned directly.start_tokens – 1D torch.LongTensor shaped
[batch_size], representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
**kwargs – Additional keyword arguments for constructing the helper.
- Returns
A helper instance.
Classifiers¶
BERTClassifier¶
- class texar.torch.modules.BERTClassifier(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Classifier based on BERT modules. Please see
PretrainedBERTMixinfor a brief description of BERT.This is a combination of the
BERTEncoderwith a classification layer. Both step-wise classification and sequence-level classification are supported, specified inhparams.Arguments are the same as in
BERTEncoder.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
bert-base-uncased). Please refer toPretrainedBERTMixinfor all supported models. If None, the model name inhparamsis used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_datafolder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in BertEncoder ... # (2) Additional hyperparameters "num_classes": 2, "logit_layer_kwargs": None, "clas_strategy": "cls_time", "max_seq_length": None, "dropout": 0.1, "name": "bert_classifier" }
Here:
Same hyperparameters as in
BERTEncoder. See thedefault_hparams(). An instance of BERTEncoder is created for feature extraction.Additional hyperparameters:
- “num_classes”: int
Number of classes:
If > 0, an additional Linear layer is appended to the encoder to compute the logits over classes.
If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.
- “logit_layer_kwargs”: dict
Keyword arguments for the logit Dense layer constructor, except for argument “units” which is set to num_classes. Ignored if no extra logit layer is appended.
- “clas_strategy”: str
The classification strategy, one of:
cls_time: Sequence-level classification based on the output of the first time step (which is the CLS token). Each sequence has a class.
all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.
time_wise: Step-wise classification, i.e., make classification for each time step based on its output.
- “max_seq_length”: int, optional
Maximum possible length of input sequences. Required if clas_strategy is all_time.
- “dropout”: float
The dropout rate of the BERT encoder output.
- “name”: str
Name of the classifier.
- forward(inputs, sequence_length=None, segment_ids=None)[source]¶
Feeds the inputs through the network and makes classification.
The arguments are the same as in
BERTEncoder.- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
segment_ids (optional) – A 2D Tensor of shape [batch_size, max_time], containing the segment ids of tokens in input sequences. If None (default), a tensor with all elements set to zero is used.
- Returns
A tuple (logits, preds), containing the logits over classes and the predictions, respectively.
If
clas_strategyiscls_timeorall_time:If
num_classes== 1,logitsandpredare both of shape[batch_size].If
num_classes> 1,logitsis of shape[batch_size, num_classes]andpredis of shape[batch_size].
If
clas_strategyistime_wise:num_classes== 1,logitsandpredare both of shape[batch_size, max_time].If
num_classes> 1,logitsis of shape[batch_size, max_time, num_classes]andpredis of shape[batch_size, max_time].
RoBERTaClassifier¶
- class texar.torch.modules.RoBERTaClassifier(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Classifier based on RoBERTa modules. Please see
PretrainedRoBERTaMixinfor a brief description of RoBERTa.This is a combination of the
RoBERTaEncoderwith a classification layer. Both step-wise classification and sequence-level classification are supported, specified inhparams.Arguments are the same as in
RoBERTaEncoder.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
roberta-base). Please refer toPretrainedRoBERTaMixinfor all supported models. If None, the model name inhparamsis used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_datafolder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in RoBertaEncoder ... # (2) Additional hyperparameters "num_classes": 2, "logit_layer_kwargs": None, "clas_strategy": "cls_time", "max_seq_length": None, "dropout": 0.1, "name": "roberta_classifier" }
Here:
Same hyperparameters as in
RoBERTaEncoder. See thedefault_hparams(). An instance of RoBERTaEncoder is created for feature extraction.Additional hyperparameters:
- “num_classes”: int
Number of classes:
If > 0, an additional Linear layer is appended to the encoder to compute the logits over classes.
If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.
- “logit_layer_kwargs”: dict
Keyword arguments for the logit Dense layer constructor, except for argument “units” which is set to num_classes. Ignored if no extra logit layer is appended.
- “clas_strategy”: str
The classification strategy, one of:
cls_time: Sequence-level classification based on the output of the first time step (which is the CLS token). Each sequence has a class.
all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.
time_wise: Step-wise classification, i.e., make classification for each time step based on its output.
- “max_seq_length”: int, optional
Maximum possible length of input sequences. Required if clas_strategy is all_time.
- “dropout”: float
The dropout rate of the RoBERTa encoder output.
- “name”: str
Name of the classifier.
- forward(inputs, sequence_length=None)[source]¶
Feeds the inputs through the network and makes classification.
The arguments are the same as in
RoBERTaEncoder.- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A tuple (logits, preds), containing the logits over classes and the predictions, respectively.
If
clas_strategyiscls_timeorall_time:If
num_classes== 1,logitsandpredare both of shape[batch_size].If
num_classes> 1,logitsis of shape[batch_size, num_classes]andpredis of shape[batch_size].
If
clas_strategyistime_wise:num_classes== 1,logitsandpredare both of shape[batch_size, max_time].If
num_classes> 1,logitsis of shape[batch_size, max_time, num_classes]andpredis of shape[batch_size, max_time].
GPT2Classifier¶
- class texar.torch.modules.GPT2Classifier(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Classifier based on GPT2 modules. Please see
PretrainedGPT2Mixinfor a brief description of GPT2.This is a combination of the
GPT2Encoderwith a classification layer. Both step-wise classification and sequence-level classification are supported, specified inhparams.Arguments are the same as in
GPT2Encoder.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
gpt2-small). Please refer toPretrainedGPT2Mixinfor all supported models. If None, the model name inhparamsis used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_datafolder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in GPT2Encoder ... # (2) Additional hyperparameters "num_classes": 2, "logit_layer_kwargs": None, "clas_strategy": `cls_time`, "max_seq_length": None, "dropout": 0.1, "name": `gpt2_classifier` }
Here:
Same hyperparameters as in
GPT2Encoder. See thedefault_hparams(). An instance of GPT2Encoder is created for feature extraction.Additional hyperparameters:
- “num_classes”: int
Number of classes:
If > 0, an additional Linear layer is appended to the encoder to compute the logits over classes.
If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.
- “logit_layer_kwargs”: dict
Keyword arguments for the logit Dense layer constructor, except for argument “units” which is set to num_classes. Ignored if no extra logit layer is appended.
- “clas_strategy”: str
The classification strategy, one of:
cls_time: Sequence-level classification based on the output of the last time step. Each sequence has a class.
all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.
time_wise: Step-wise classification, i.e., make classification for each time step based on its output.
- “max_seq_length”: int, optional
Maximum possible length of input sequences. Required if clas_strategy is all_time.
- “dropout”: float
The dropout rate of the GPT2 encoder output.
- “name”: str
Name of the classifier.
- forward(inputs, sequence_length=None)[source]¶
Feeds the inputs through the network and makes classification.
The arguments are the same as in
GPT2Encoder.- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A tuple (logits, preds), containing the logits over classes and the predictions, respectively.
If
clas_strategyiscls_timeorall_time:If
num_classes== 1,logitsandpredare of both shape[batch_size].If
num_classes> 1,logitsis of shape[batch_size, num_classes]andpredis of shape[batch_size].
If
clas_strategyistime_wise:If
num_classes== 1,logitsandpredare of both shape[batch_size, max_time].If
num_classes> 1,logitsis of shape[batch_size, max_time, num_classes]andpredis of shape[batch_size, max_time].
UnidirectionalRNNClassifier¶
- class texar.torch.modules.UnidirectionalRNNClassifier(input_size, cell=None, output_layer=None, hparams=None)[source]¶
One directional RNN classifier. This is a combination of the
UnidirectionalRNNEncoderwith a classification layer. Both step-wise classification and sequence-level classification are supported, specified inhparams.Arguments are the same as in
UnidirectionalRNNEncoder.- Parameters
input_size (int) – The number of expected features in the input for the cell.
cell – (RNNCell, optional) If not specified, a cell is created as specified in
hparams["rnn_cell"].output_layer (optional) – An instance of torch.nn.Module. Applies to the RNN cell output of each step. If None (default), the output layer is created as specified in
hparams["output_layer"].hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in UnidirectionalRNNEncoder ... # (2) Additional hyperparameters "num_classes": 2, "logit_layer_kwargs": None, "clas_strategy": "final_time", "max_seq_length": None, "name": "unidirectional_rnn_classifier" }
Here:
Same hyperparameters as in
UnidirectionalRNNEncoder. See thedefault_hparams(). An instance of UnidirectionalRNNEncoder is created for feature extraction.Additional hyperparameters:
- “num_classes”: int
Number of classes:
If > 0, an additional Linear layer is appended to the encoder to compute the logits over classes.
If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.
- “logit_layer_kwargs”: dict
Keyword arguments for the logit Dense layer constructor, except for argument “units” which is set to num_classes. Ignored if no extra logit layer is appended.
- “clas_strategy”: str
The classification strategy, one of:
final_time: Sequence-level classification based on the output of the final time step. Each sequence has a class.
all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.
time_wise: Step-wise classification, i.e., make classification for each time step based on its output.
- “max_seq_length”: int, optional
Maximum possible length of input sequences. Required if clas_strategy is all_time.
- “name”: str
Name of the classifier.
- forward(inputs, sequence_length=None, initial_state=None, time_major=False)[source]¶
Feeds the inputs through the network and makes classification.
The arguments are the same as in
UnidirectionalRNNEncoder.- Parameters
inputs – A 3D Tensor of shape
[batch_size, max_time, dim]. The first two dimensionsbatch_sizeandmax_timeare exchanged iftime_majoris True.sequence_length (optional) – A 1D torch.LongTensor of shape
[batch_size]. Sequence lengths of the batch inputs. Used to copy-through state and zero-out outputs when past a batch element’s sequence length.initial_state (optional) – Initial state of the RNN.
time_major (bool) – The shape format of the
inputsandoutputsTensors. If True, these tensors are of shape[max_time, batch_size, depth]. If False (default), these tensors are of shape[batch_size, max_time, depth].
- Returns
A tuple (logits, preds), containing the logits over classes and the predictions, respectively.
If
clas_strategyisfinal_timeorall_time:If
num_classes== 1,logitsandpredare both of shape[batch_size].If
num_classes> 1,logitsis of shape[batch_size, num_classes]andpredis of shape[batch_size].
If
clas_strategyistime_wise:num_classes== 1,logitsandpredare both of shape[batch_size, max_time].If
num_classes> 1,logitsis of shape[batch_size, max_time, num_classes]andpredis of shape[batch_size, max_time].If
time_majoris True, the batch and time dimensions are exchanged.
Conv1DClassifier¶
- class texar.torch.modules.Conv1DClassifier(in_channels, in_features=None, hparams=None)[source]¶
Simple Conv-1D classifier. This is a combination of the
Conv1DEncoderwith a classification layer.- Parameters
in_channels (int) – Number of channels in the input tensor.
in_features (int) – Size of the feature dimension in the input tensor.
hparams (dict, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
See
forward()for the inputs and outputs. If"data_format"is set to"channels_first"(this is the default), inputs must be a tensor of shape [batch_size, channels, length]. If"data_format"is set to"channels_last", inputs must be a tensor of shape [batch_size, length, channels]. For example, for sequence classification, length corresponds to time steps, and channels corresponds to embedding dim.Example:
inputs = torch.randn([64, 20, 256]) clas = Conv1DClassifier(in_channels=20, in_features=256, hparams={'num_classes': 10}) logits, pred = clas(inputs) # logits == Tensor of shape [64, 10] # pred == Tensor of shape [64]
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in Conv1DEncoder ... # (2) Additional hyperparameters "num_classes": 2, "logit_layer_kwargs": { "use_bias": False }, "name": "conv1d_classifier" }
Here:
Same hyperparameters as in
Conv1DEncoder. See thedefault_hparams(). An instance ofConv1DEncoderis created for feature extraction.Additional hyperparameters:
- “num_classes”: int
Number of classes:
If > 0, an additional torch.nn.Linear layer is appended to the encoder to compute the logits over classes.
If <= 0, no dense layer is appended. The number of classes is assumed to be equal to
out_featuresof the final dense layer size of the encoder.
- “logit_layer_kwargs”: dict
Keyword arguments for the logit torch.nn.Linear layer constructor, except for argument
out_featureswhich is set to"num_classes". Ignored if no extra logit layer is appended.- “name”: str
Name of the classifier.
- forward(input, sequence_length=None, dtype=None, data_format=None)[source]¶
Feeds the inputs through the network and makes classification.
The arguments are the same as in
Conv1DEncoder.The predictions of binary classification (
num_classes=1) and multi-way classification (num_classes>1) are different, as explained below.- Parameters
input – The inputs to the network, which is a 3D tensor. See
Conv1DEncoderfor more details.sequence_length (optional) – An int tensor of shape [batch_size] or a python array containing the length of each element in
inputs. If given, time steps beyond the length will first be masked out before feeding to the layers.dtype (optional) – Type of the inputs. If not provided, infers from inputs automatically.
data_format (optional) – Data type of the input tensor. If
channels_last, the last dimension will be treated as channel dimension so the size of theinputshould be [batch_size, X, channel]. Ifchannels_first, first dimension will be treated as channel dimension so the size should be [batch_size, channel, X]. Defaults to None. If None, the value will be picked from hyperparameters.
- Returns
A tuple
(logits, pred), wherelogitsis a torch.Tensor of shape[batch_size, num_classes]fornum_classes>1, and[batch_size]fornum_classes=1 (i.e., binary classification).predis the prediction, a torch.LongTensor of shape[batch_size]. For binary classification, the standard sigmoid function is used for prediction, and the class labels are{0, 1}.
- property num_classes¶
The number of classes.
- property encoder¶
The classifier neural network.
- has_layer(layer_name)[source]¶
Returns True if the network with the name exists. Returns False otherwise.
- Parameters
layer_name (str) – Name of the layer.
- layer_by_name(layer_name)[source]¶
Returns the layer with the name. Returns None if the layer name does not exist.
- Parameters
layer_name (str) – Name of the layer.
- property layers_by_name¶
A dictionary mapping layer names to the layers.
- property layers¶
A list of the layers.
- property layer_names¶
A list of uniquified layer names.
XLNetClassifier¶
- class texar.torch.modules.XLNetClassifier(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Classifier based on XLNet modules. Please see
PretrainedXLNetMixinfor a brief description of XLNet.Arguments are the same as in
XLNetEncoder.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
xlnet-based-cased). Please refer toPretrainedXLNetMixinfor all supported models. If None, the model name inhparamsis used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_datafolder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in XLNetEncoder ... # (2) Additional hyperparameters "clas_strategy": "cls_time", "use_projection": True, "num_classes": 2, "name": "xlnet_classifier", }
Here:
- Same hyperparameters as in
XLNetEncoder. See thedefault_hparams(). An instance of XLNetEncoder is created for feature extraction.
Additional hyperparameters:
- “clas_strategy”: str
The classification strategy, one of:
cls_time: Sequence-level classification based on the output of the last time step (which is the CLS token). Each sequence has a class.
all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.
time_wise: Step-wise classification, i.e., make classification for each time step based on its output.
- “use_projection”: bool
If True, an additional Linear layer is added after the summary step.
- “num_classes”: int
Number of classes:
If > 0, an additional torch.nn.Linear layer is appended to the encoder to compute the logits over classes.
If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.
- “name”: str
Name of the classifier.
- param_groups(lr=None, lr_layer_scale=1.0, decay_base_params=False)[source]¶
Create parameter groups for optimizers. When
lr_layer_decay_rateis not 1.0, parameters from each layer form separate groups with different base learning rates.The return value of this method can be used in the constructor of optimizers, for example:
model = XLNetClassifier(...) param_groups = model.param_groups(lr=2e-5, lr_layer_scale=0.8) optim = torch.optim.Adam(param_groups)
- Parameters
lr (float) – The learning rate. Can be omitted if
lr_layer_decay_rateis 1.0.lr_layer_scale (float) – Per-layer LR scaling rate. The i-th layer will be scaled by lr_layer_scale ^ (num_layers - i - 1).
decay_base_params (bool) – If True, treat non-layer parameters (e.g. embeddings) as if they’re in layer 0. If False, these parameters are not scaled.
- Returns
The parameter groups, used as the first argument for optimizers.
- forward(inputs, segment_ids=None, input_mask=None)[source]¶
Feeds the inputs through the network and makes classification.
- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
segment_ids – Shape [batch_size, max_time].
input_mask – Float tensor of shape [batch_size, max_time]. Note that positions with value 1 are masked out.
- Returns
A tuple (logits, preds), containing the logits over classes and the predictions, respectively.
If
clas_strategyiscls_timeorall_time:If
num_classes== 1,logitsandpredare both of shape[batch_size].If
num_classes> 1,logitsis of shape[batch_size, num_classes]andpredis of shape[batch_size].
If
clas_strategyistime_wise:num_classes== 1,logitsandpredare both of shape[batch_size, max_time].If
num_classes> 1,logitsis of shape[batch_size, max_time, num_classes]andpredis of shape[batch_size, max_time].
Regressors¶
XLNetRegressor¶
- class texar.torch.modules.XLNetRegressor(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Regressor based on XLNet modules. Please see
PretrainedXLNetMixinfor a brief description of XLNet.Arguments are the same as in
XLNetEncoder.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
xlnet-based-cased). Please refer toPretrainedXLNetMixinfor all supported models. If None, the model name inhparamsis used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_datafolder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in XLNetEncoder ... # (2) Additional hyperparameters "regr_strategy": "cls_time", "use_projection": True, "logit_layer_kwargs": None, "name": "xlnet_regressor", }
Here:
Same hyperparameters as in
XLNetEncoder. See thedefault_hparams(). An instance of XLNetEncoder is created for feature extraction.Additional hyperparameters:
- “regr_strategy”: str
The regression strategy, one of:
cls_time: Sequence-level regression based on the output of the first time step (which is the CLS token). Each sequence has a prediction.
all_time: Sequence-level regression based on the output of all time steps. Each sequence has a prediction.
time_wise: Step-wise regression, i.e., make regression for each time step based on its output.
- “logit_layer_kwargs”: dict
Keyword arguments for the logit torch.nn.Linear layer constructor. Ignored if no extra logit layer is appended.
- “use_projection”: bool
If True, an additional torch.nn.Linear layer is added after the summary step.
- “name”: str
Name of the regressor.
- param_groups(lr=None, lr_layer_scale=1.0, decay_base_params=False)[source]¶
Create parameter groups for optimizers. When
lr_layer_decay_rateis not 1.0, parameters from each layer form separate groups with different base learning rates.The return value of this method can be used in the constructor of optimizers, for example:
model = XLNetRegressor(...) param_groups = model.param_groups(lr=2e-5, lr_layer_scale=0.8) optim = torch.optim.Adam(param_groups)
- Parameters
lr (float) – The learning rate. Can be omitted if
lr_layer_decay_rateis 1.0.lr_layer_scale (float) – Per-layer LR scaling rate. The i-th layer will be scaled by lr_layer_scale ^ (num_layers - i - 1).
decay_base_params (bool) – If True, treat non-layer parameters (e.g. embeddings) as if they’re in layer 0. If False, these parameters are not scaled.
- Returns
The parameter groups, used as the first argument for optimizers.
- forward(inputs, segment_ids=None, input_mask=None)[source]¶
Feeds the inputs through the network and makes regression.
- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
segment_ids – Shape [batch_size, max_time].
input_mask – Float tensor of shape [batch_size, max_time]. Note that positions with value 1 are masked out.
- Returns
Regression predictions.
If
regr_strategyiscls_timeorall_time, predictions have shape [batch_size].If
clas_strategyistime_wise, predictions have shape [batch_size, max_time].
EncoderDecoders¶
T5EncoderDecoder¶
- class texar.torch.modules.T5EncoderDecoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
The pre-trained T5 model. Please see
PretrainedT5Mixinfor a brief description of T5.This module basically stacks
WordEmbedder,T5Encoder, andT5Decoder.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
T5-Small). Please refer toPretrainedT5Mixinfor all supported models. If None, the model name inhparamsis used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_datafolder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- reset_parameters()[source]¶
Initialize parameters of the pre-trained model. This method is only called if pre-trained checkpoints are not loaded.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The model arch is determined by the constructor argument
pretrained_model_nameif it’s specified. In this case, hparams are ignored.Otherwise, the model arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "pretrained_model_name": "T5-Small", "embed": { "dim": 768, "name": "word_embeddings" }, "vocab_size": 32128, "encoder": { "dim": 768, "embedding_dropout": 0.1, "multihead_attention": { "dropout_rate": 0.1, "name": "self", "num_heads": 12, "num_units": 768, "output_dim": 768, "use_bias": False, "is_decoder": False, "relative_attention_num_buckets": 32, }, "eps": 1e-6, "name": "encoder", "num_blocks": 12, "poswise_feedforward": { "layers": [ { "kwargs": { "in_features": 768, "out_features": 3072, "bias": False }, "type": "Linear" }, {"type": "ReLU"}, { "kwargs": { "in_features": 3072, "out_features": 768, "bias": False }, "type": "Linear" } ] }, "residual_dropout": 0.1, }, "decoder": { "eps": 1e-6, "dim": 768, "embedding_dropout": 0.1, "multihead_attention": { "dropout_rate": 0.1, "name": "self", "num_heads": 12, "num_units": 768, "output_dim": 768, "use_bias": False, "is_decoder": True, "relative_attention_num_buckets": 32, }, "name": "decoder", "num_blocks": 12, "poswise_feedforward": { "layers": [ { "kwargs": { "in_features": 768, "out_features": 3072, "bias": False }, "type": "Linear" }, {"type": "ReLU"}, { "kwargs": { "in_features": 3072, "out_features": 768, "bias": False }, "type": "Linear" } ] }, "residual_dropout": 0.1, }, "hidden_size": 768, "initializer": None, "name": "t5_encoder_decoder", }
Here:
The default parameters are values for T5-Small model.
- “pretrained_model_name”: str or None
The name of the pre-trained T5 model. If None, the model will be randomly initialized.
- “embed”: dict
Hyperparameters for word embedding layer.
- “vocab_size”: int
The vocabulary size of inputs in T5 model.
- “encoder”: dict
Hyperparameters for the T5Encoder. See
default_hparams()for details.- “decoder”: dict
Hyperparameters for the T5Decoder. See
default_hparams()for details.- “hidden_size”: int
Size of the hidden layer.
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()for details.- “name”: str
Name of the module.
- forward(inputs, sequence_length=None)[source]¶
Performs encoding and decoding.
- Parameters
inputs – Either a 2D Tensor of shape
[batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.sequence_length – A 1D torch.Tensor of shape
[batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A pair
(encoder_output, decoder_output)encoder_output: A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.decoder_output: An instance ofTransformerDecoderOutputwhich contains sample_id and logits.
Pre-trained¶
PretrainedMixin¶
- class texar.torch.modules.PretrainedMixin(hparams=None)[source]¶
A mixin class for all pre-trained classes to inherit.
- load_pretrained_config(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Load paths and configurations of the pre-trained model.
- Parameters
pretrained_model_name (optional) – A str with the name of a pre-trained model to load. If None, will use the model name in
hparams.cache_dir (optional) – The path to a folder in which the pre-trained models will be cached. If None (default), a default directory will be used.
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- reset_parameters()[source]¶
Initialize parameters of the pre-trained model. This method is only called if pre-trained checkpoints are not loaded.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "pretrained_model_name": None, "name": "pretrained_base" }
- classmethod download_checkpoint(pretrained_model_name, cache_dir=None)[source]¶
Download the specified pre-trained checkpoint, and return the directory in which the checkpoint is cached.
- abstract classmethod _transform_config(pretrained_model_name, cache_dir)[source]¶
Load the official configuration file and transform it into Texar-style hyperparameters.
PretrainedBERTMixin¶
- class texar.torch.modules.PretrainedBERTMixin(hparams=None)[source]¶
A mixin class to support loading pre-trained checkpoints for modules that implement the BERT model.
Both standard BERT models and many domain specific BERT-based models are supported. You can specify the
pretrained_model_nameargument to pick which pre-trained BERT model to use. All available categories of pre-trained models (and names) include:Standard BERT: proposed in (Devlin et al. 2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . A bidirectional Transformer language model pre-trained on large text corpora. Available model names include:
bert-base-uncased: 12-layer, 768-hidden, 12-heads, 110M parameters.bert-large-uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters.bert-base-cased: 12-layer, 768-hidden, 12-heads , 110M parameters.bert-large-cased: 24-layer, 1024-hidden, 16-heads, 340M parameters.bert-base-multilingual-uncased: 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters.bert-base-multilingual-cased: 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters.bert-base-chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters.
BioBERT: proposed in (Lee et al. 2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining . A domain specific language representation model pre-trained on large-scale biomedical corpora. Based on the BERT architecture, BioBERT effectively transfers the knowledge from a large amount of biomedical texts to biomedical text mining models with minimal task-specific architecture modifications. Available model names include:
biobert-v1.0-pmc: BioBERT v1.0 (+ PMC 270K) - based on BERT-base-Cased (same vocabulary).biobert-v1.0-pubmed-pmc: BioBERT v1.0 (+ PubMed 200K + PMC 270K) - based on BERT-base-Cased (same vocabulary).biobert-v1.0-pubmed: BioBERT v1.0 (+ PubMed 200K) - based on BERT-base-Cased (same vocabulary).biobert-v1.1-pubmed: BioBERT v1.1 (+ PubMed 1M) - based on BERT-base-Cased (same vocabulary).
SciBERT: proposed in (Beltagy et al. 2019) SciBERT: A Pretrained Language Model for Scientific Text. A BERT model trained on scientific text. SciBERT leverages unsupervised pre-training on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. Available model names include:
scibert-scivocab-uncased: Uncased version of the model trained on its own vocabulary.scibert-scivocab-cased: Cased version of the model trained on its own vocabulary.scibert-basevocab-uncased: Uncased version of the model trained on the original BERT vocabulary.scibert-basevocab-cased: Cased version of the model trained on the original BERT vocabulary.
SpanBERT: proposed in (Joshi et al. 2019) SpanBERT: Improving Pre-training by Representing and Predicting Spans. As a variant of the standard BERT model, SpanBERT extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Differing from the standard BERT, the SpanBERT model does not use segmentation embedding. Available model names include:
spanbert-base-cased: SpanBERT using the BERT-base architecture, 12-layer, 768-hidden, 12-heads , 110M parameters.spanbert-large-cased: SpanBERT using the BERT-large architecture, 24-layer, 1024-hidden, 16-heads, 340M parameters.
We provide the following BERT classes:
BERTEncoderfor text encoding.BERTClassifierfor text classification and sequence tagging.
PretrainedRoBERTaMixin¶
- class texar.torch.modules.PretrainedRoBERTaMixin(hparams=None)[source]¶
A mixin class to support loading pre-trained checkpoints for modules that implement the RoBERTa model.
The RoBERTa model was proposed in (Liu et al. 2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. As a variant of the standard BERT model, RoBERTa trains for more iterations on more data with a larger batch size as well as other tweaks in pre-training. Differing from the standard BERT, the RoBERTa model does not use segmentation embedding. Available model names include:
roberta-base: RoBERTa using the BERT-base architecture, 125M parameters.roberta-large: RoBERTa using the BERT-large architecture, 355M parameters.
We provide the following RoBERTa classes:
RoBERTaEncoderfor text encoding.RoBERTaClassifierfor text classification and sequence tagging.
PretrainedGPT2Mixin¶
- class texar.torch.modules.PretrainedGPT2Mixin(hparams=None)[source]¶
A mixin class to support loading pre-trained checkpoints for modules that implement the GPT2 model.
The GPT2 model was proposed in Language Models are Unsupervised Multitask Learners by Radford et al. from OpenAI. It is a unidirectional Transformer model pre-trained using the vanilla language modeling objective on a large corpus.
The available GPT2 models are as follows:
gpt2-small: Small version of GPT-2, 124M parameters.gpt2-medium: Medium version of GPT-2, 355M parameters.gpt2-large: Large version of GPT-2, 774M parameters.gpt2-xl: XL version of GPT-2, 1558M parameters.
We provide the following GPT2 classes:
GPT2Encoderfor text encoding.GPT2Decoderfor text generation and decoding.GPT2Classifierfor text classification and sequence tagging.
PretrainedXLNetMixin¶
- class texar.torch.modules.PretrainedXLNetMixin(hparams=None)[source]¶
A mixin class to support loading pre-trained checkpoints for modules that implement the XLNet model.
The XLNet model was proposed in XLNet: Generalized Autoregressive Pretraining for Language Understanding by Yang et al. It is based on the Transformer-XL model, pre-trained on a large corpus using a language modeling objective that considers all permutations of the input sentence.
The available XLNet models are as follows:
xlnet-based-cased: 12-layer, 768-hidden, 12-heads. This model is trained on full data (different from the one in the paper).xlnet-large-cased: 24-layer, 1024-hidden, 16-heads.
We provide the following XLNet classes:
XLNetEncoderfor text encoding.XLNetDecoderfor text generation and decoding.XLNetClassifierfor text classification and sequence tagging.XLNetRegressorfor text regression.
PretrainedT5Mixin¶
- class texar.torch.modules.PretrainedT5Mixin(hparams=None)[source]¶
A mixin class to support loading pre-trained checkpoints for modules that implement the T5 model.
The T5 model was proposed in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Raffel et al. from Google. It treats multiple NLP tasks in a similar manner by encoding the different tasks as text directives in the input stream. This enables a single model to be trained supervised on a wide variety of NLP tasks. The T5 model examines factors relevant for leveraging transfer learning at scale from pure unsupervised pre-training to supervised tasks.
The available T5 models are as follows:
T5-Small: Small version of T5, 60 million parameters.T5-Base: Base-line version of T5, 220 million parameters.T5-Large: Large Version of T5, 770 million parameters.T5-3B: A version of T5 with 3 billion parameters.T5-11B: A version of T5 with 11 billion parameters.
We provide the following classes:
T5Encoderfor loading weights for the encoder stack.T5Decoderfor loading weights for the decoding stack.T5EncoderDecoderas a raw pre-trained model.
Connectors¶
ConnectorBase¶
- class texar.torch.modules.ConnectorBase(output_size, hparams=None)[source]¶
Base class inherited by all connector classes. A connector is to transform inputs into outputs with any specified structure and shape. For example, transforming the final state of an encoder to the initial state of a decoder, and performing stochastic sampling in between as in Variational Autoencoders (VAEs).
- Parameters
output_size – Size of output excluding the batch dimension. For example, set
output_sizetodimto generate output of shape[batch_size, dim]. Can be an int, a tuple of int, a torch.Size, or a tuple of torch.Sizes. For example, to transform inputs to have decoder state size, setoutput_size=decoder.state_size.hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
- property output_size¶
The feature size of
forward()output tensor(s), usually it is equal to the last dimension value of the output tensor size.
ConstantConnector¶
- class texar.torch.modules.ConstantConnector(output_size, hparams=None)[source]¶
Creates a constant tensor or (nested) tuple of Tensors that contains a constant value.
- Parameters
output_size – Size of output excluding the batch dimension. For example, set
output_sizetodimto generate output of shape[batch_size, dim]. Can be anint, a tuple ofint, atorch.Size, or a tuple oftorch.Size. For example, to transform inputs to have decoder state size, setoutput_size=decoder.state_size. Ifoutput_sizeis a tuple(1, 2, 3), then the output structure will be([batch_size * 1], [batch_size * 2], [batch_size * 3]). Ifoutput_sizeistorch.Size([1, 2, 3]), then the output structure will be[batch_size, 1, 2, 3].hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
This connector does not have trainable parameters.
Example
state_size = (1, 2, 3) connector = ConstantConnector(state_size, hparams={"value": 1.}) one_state = connector(batch_size=64) # `one_state` structure: (Tensor_1, Tensor_2, Tensor_3), # Tensor_1.size() == torch.Size([64, 1]) # Tensor_2.size() == torch.Size([64, 2]) # Tensor_3.size() == torch.Size([64, 3]) # Tensors are filled with 1.0. size = torch.Size([1, 2, 3]) connector_size = ConstantConnector(size, hparams={"value": 2.}) size_state = connector_size(batch_size=64) # `size_state` structure: Tensor with size [64, 1, 2, 3]. # Tensor is filled with 2.0.
ForwardConnector¶
- class texar.torch.modules.ForwardConnector(output_size, hparams=None)[source]¶
Transforms inputs to have specified structure.
Example:
state_size = namedtuple('LSTMStateTuple', ['h', 'c'])(256, 256) # state_size == LSTMStateTuple(c=256, h=256) connector = ForwardConnector(state_size) output = connector([tensor_1, tensor_2]) # output == LSTMStateTuple(c=tensor_1, h=tensor_2)
- Parameters
output_size – Size of output excluding the batch dimension. For example, set
output_sizetodimto generate output of shape[batch_size, dim]. Can be anint, a tuple ofint, atorch.Size, or a tuple oftorch.Size. For example, to transform inputs to have decoder state size, setoutput_size=decoder.state_size.hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
This connector does not have trainable parameters. See
forward()for the inputs and outputs of the connector. The input to the connector must have the same structure withoutput_size, or must have the same number of elements and be re-packable into the structure ofoutput_size. Note that if input is or contains adictinstance, the keys will be sorted to pack in deterministic order (Seepack_sequence_as()).- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "name": "forward_connector" }
Here:
- “name”: str
Name of the connector.
- forward(inputs)[source]¶
Transforms inputs to have the same structure as with
output_size. Values of the inputs are not changed.inputsmust either have the same structure, or have the same number of elements withoutput_size.- Parameters
inputs – The input (structure of) tensor to pass forward.
- Returns
A (structure of) tensors that re-packs
inputsto have the specified structure ofoutput_size.
MLPTransformConnector¶
- class texar.torch.modules.MLPTransformConnector(output_size, linear_layer_dim, hparams=None)[source]¶
Transforms inputs with an MLP layer and packs the results into the specified structure and size.
Example
cell = LSTMCell(num_units=256) # cell.state_size == LSTMStateTuple(c=256, h=256) connector = MLPTransformConnector(cell.state_size) inputs = torch.zeros([64, 10]) output = connector(inputs) # output == LSTMStateTuple(c=tensor_of_shape_(64, 256), # h=tensor_of_shape_(64, 256))
## Use to connect encoder and decoder with different state size encoder = UnidirectionalRNNEncoder(...) _, final_state = encoder(inputs=...) decoder = BasicRNNDecoder(...) connector = MLPTransformConnector(decoder.state_size) _ = decoder( initial_state=connector(final_state), ...)
- Parameters
output_size – Size of output excluding the batch dimension. For example, set
output_sizetodimto generate output of shape[batch_size, dim]. Can be anint, a tuple ofint, atorch.Size, or a tuple oftorch.Size. For example, to transform inputs to have decoder state size, setoutput_size=decoder.state_size.linear_layer_dim (int) – Value of final dim of the input tensors i.e. the input dim of the mlp linear layer.
hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
The input to the connector can have arbitrary structure and size.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "activation_fn": "texar.torch.core.layers.identity", "name": "mlp_connector" }
Here:
- “activation_fn”: str or callable
The activation function applied to the outputs of the MLP transformation layer. Can be a function, or its name or module path.
- “name”: str
Name of the connector.
- forward(inputs)[source]¶
Transforms inputs with an MLP layer and packs the results to have the same structure as specified by
output_size.- Parameters
inputs – Input (structure of) tensors to be transformed. Must be a tensor of shape
[batch_size, ...]or a (nested) tuple of such Tensors. That is, the first dimension of (each) tensor must be the batch dimension.- Returns
A tensor or a (nested) tuple of tensors of the same structure of
output_size.
Networks¶
FeedForwardNetworkBase¶
- class texar.torch.modules.FeedForwardNetworkBase(hparams=None)[source]¶
Base class inherited by all feed-forward network classes.
- Parameters
hparams (dict, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
See
forward()for the inputs and outputs.- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "name": "NN" }
- forward(input)[source]¶
Feeds forward inputs through the network layers and returns outputs.
- Parameters
input – The inputs to the network. The requirements on inputs depends on the first layer and subsequent layers in the network.
- Returns
The output of the network.
- append_layer(layer)[source]¶
Appends a layer to the end of the network.
- Parameters
layer – A subclass of torch.nn.Module, or a dict of layer hyperparameters.
- has_layer(layer_name)[source]¶
Returns True if the network with the name exists. Returns False otherwise.
- Parameters
layer_name (str) – Name of the layer.
- layer_by_name(layer_name)[source]¶
Returns the layer with the name. Returns None if the layer name does not exist.
- Parameters
layer_name (str) – Name of the layer.
- property layers_by_name¶
A dictionary mapping layer names to the layers.
- property layers¶
A list of the layers.
- property layer_names¶
A list of uniquified layer names.
FeedForwardNetwork¶
- class texar.torch.modules.FeedForwardNetwork(layers=None, hparams=None)[source]¶
Feed-forward neural network that consists of a sequence of layers.
- Parameters
layers (list, optional) – A list of torch.nn.Linear instances composing the network. If not given, layers are created according to
hparams.hparams (dict, optional) – Embedder hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
See
forward()for the inputs and outputs.Example
hparams = { # Builds a two-layer dense NN "layers": [ { "type": "Dense", "kwargs": { "units": 256 }, { "type": "Dense", "kwargs": { "units": 10 } ] } nn = FeedForwardNetwork(hparams=hparams) inputs = torch.randn([64, 100]) outputs = nn(inputs) # outputs == Tensor of shape [64, 10]
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "layers": [], "name": "NN" }
Here:
- “layers”: list
A list of layer hyperparameters. See
get_layer()for details on layer hyperparameters.- “name”: str
Name of the network.
- forward(input)¶
Feeds forward inputs through the network layers and returns outputs.
- Parameters
input – The inputs to the network. The requirements on inputs depends on the first layer and subsequent layers in the network.
- Returns
The output of the network.
- append_layer(layer)¶
Appends a layer to the end of the network.
- Parameters
layer – A subclass of torch.nn.Module, or a dict of layer hyperparameters.
- has_layer(layer_name)¶
Returns True if the network with the name exists. Returns False otherwise.
- Parameters
layer_name (str) – Name of the layer.
- layer_by_name(layer_name)¶
Returns the layer with the name. Returns None if the layer name does not exist.
- Parameters
layer_name (str) – Name of the layer.
- property layers_by_name¶
A dictionary mapping layer names to the layers.
- property layers¶
A list of the layers.
- property layer_names¶
A list of uniquified layer names.
Conv1DNetwork¶
- class texar.torch.modules.Conv1DNetwork(in_channels, in_features=None, hparams=None)[source]¶
Simple Conv-1D network which consists of a sequence of convolutional layers followed with a sequence of dense layers.
- Parameters
in_channels (int) – Number of channels in the input tensor.
in_features (int) – Size of the feature dimension in the input tensor.
hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()for the hyperparameter structure and default values.
See
forward()for the inputs and outputs. If"data_format"is set to"channels_first"(this is the default), inputs must be a tensor of shape [batch_size, channels, length]. If"data_format"is set to"channels_last", inputs must be a tensor of shape [batch_size, length, channels]. For example, for sequence classification, length corresponds to time steps, and channels corresponds to embedding dim.Example:
nn = Conv1DNetwork(in_channels=20, in_features=256) # Use the default inputs = torch.randn([64, 20, 256]) outputs = nn(inputs) # outputs == Tensor of shape [64, 256], because the final dense layer # has size 256.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Conv layers "num_conv_layers": 1, "out_channels": 128, "kernel_size": [3, 4, 5], "conv_activation": "ReLU", "conv_activation_kwargs": None, "other_conv_kwargs": {}, "data_format": "channels_first", # (2) Pooling layers "pooling": "MaxPool1d", "pool_size": None, "pool_stride": 1, "other_pool_kwargs": {}, # (3) Dense layers "num_dense_layers": 1, "out_features": 256, "dense_activation": None, "dense_activation_kwargs": None, "final_dense_activation": None, "final_dense_activation_kwargs": None, "other_dense_kwargs": None, # (4) Dropout "dropout_conv": [1], "dropout_dense": [], "dropout_rate": 0.75, # (5) Others "name": "conv1d_network" }
Here:
For convolutional layers:
- “num_conv_layers”: int
Number of convolutional layers.
- “out_channels”: int or list
The number of out_channels in the convolution, i.e., the dimensionality of the output space.
If
"num_conv_layers"> 1 and"out_channels"is an int, all convolution layers will have the same number of output channels.If
"num_conv_layers"> 1 and"out_channels"is a list, the length must equal"num_conv_layers". The number of output channels of each convolution layer will be the corresponding element from this list.
- “kernel_size”: int or list
Lengths of 1D convolution windows.
If “num_conv_layers” = 1, this can also be a
intlist of arbitrary length denoting differently sized convolution windows. The number of output channels of each size is specified by"out_channels". For example, the default values will create 3 convolution layers, each of which has kernel size of 3, 4, and 5, respectively, and has output channel 128.If “num_conv_layers” > 1, this must be a list of length
"num_conv_layers". Each element can be anintor aintlist of arbitrary length denoting the kernel size of each layer.
- “conv_activation”: str or callable
Activation applied to the output of the convolutional layers. Set to None to maintain a linear activation. See
get_layer()for more details.- “conv_activation_kwargs”: dict, optional
Keyword arguments for the activation following the convolutional layer. See
get_layer()for more details.- “other_conv_kwargs”: list or dict, optional
Other keyword arguments for torch.nn.Conv1d constructor, e.g.,
padding.If a dict, the same dict is applied to all the convolution layers.
If a list, the length must equal
"num_conv_layers". This list can contain nested lists. If the convolution layer at index i has multiple kernel sizes, then the corresponding element of this list can also be a list of length equal to"kernel_size"at index i. If the element at index i is instead a dict, then the same dict gets applied to all the convolution layers at index i.
- “data_format”: str, optional
Data format of the input tensor. Defaults to
channels_firstdenoting the first dimension to be the channel dimension. Set it tochannels_lastto treat last dimension as the channel dimension. This argument can also be passed inforwardfunction, in which case the value specified here will be ignored.
For pooling layers:
- “pooling”: str or class or instance
Pooling layer after each of the convolutional layer(s). Can be a pooling layer class, its name or module path, or a class instance.
- “pool_size”: int or list, optional
Size of the pooling window. If an
int, all pooling layer will have the same pool size. If a list, the list length must equal"num_conv_layers". If None and the pooling type is either MaxPool1d or AvgPool1d, the pool size will be set to input size. That is, the output of the pooling layer is a single unit.- “pool_stride”: int or list, optional
Strides of the pooling operation. If an
int, all layers will have the same stride. If a list, the list length must equal"num_conv_layers".- “other_pool_kwargs”: list or dict, optional
Other keyword arguments for pooling layer class constructor.
If a dict, the same dict is applied to all the pooling layers.
If a list, the length must equal
"num_conv_layers". The pooling arguments for layer i will be the element at index i from this list.
For dense layers (note that here dense layers always follow convolutional and pooling layers):
- “num_dense_layers”: int
Number of dense layers.
- “out_features”: int or list
Dimension of features after the dense layers. If an
int, all dense layers will have the same feature dimension. If a list ofint, the list length must equal"num_dense_layers".- “dense_activation”: str or callable
Activation function applied to the output of the dense layers except the last dense layer output. Set to None to maintain a linear activation.
- “dense_activation_kwargs”: dict, optional
Keyword arguments for dense layer activation functions before the last dense layer.
- “final_dense_activation”: str or callable
Activation function applied to the output of the last dense layer. Set to None to maintain a linear activation.
- “final_dense_activation_kwargs”: dict, optional
Keyword arguments for the activation function of last dense layer.
- “other_dense_kwargs”: dict, optional
Other keyword arguments for dense layer class constructor.
For dropouts:
- “dropout_conv”: int or list
The indices of convolutional layers (starting from 0) whose inputs are applied with dropout. The index =
num_conv_layersmeans dropout applies to the final convolutional layer output. For example,{ "num_conv_layers": 2, "dropout_conv": [0, 2] }
will leads to a series of layers as -dropout-conv0-conv1-dropout-.
The dropout mode (training or not) is controlled by
self.training.- “dropout_dense”: int or list
Same as
"dropout_conv"but applied to dense layers (index starting from 0).- “dropout_rate”: float
The dropout rate, between 0 and 1. For example,
"dropout_rate": 0.1would drop out 10% of elements.
Others:
- “name”: str
Name of the network.
- forward(input, sequence_length=None, dtype=None, data_format=None)[source]¶
Feeds forward inputs through the network layers and returns outputs.
- Parameters
input – The inputs to the network, which is a 3D tensor.
sequence_length (optional) – An torch.LongTensor of shape
[batch_size]or a python array containing the length of each element ininputs. If given, time steps beyond the length will first be masked out before feeding to the layers.dtype (optional) – Type of the inputs. If not provided, infers from inputs automatically.
data_format (optional) – Data type of the input tensor. If
channels_last, the last dimension will be treated as channel dimension so the size of theinputshould be [batch_size, X, channel]. Ifchannels_first, first dimension will be treated as channel dimension so the size should be [batch_size, channel, X]. Defaults to None. If None, the value will be picked from hyperparameters.
- Returns
The output of the final layer.
- append_layer(layer)¶
Appends a layer to the end of the network.
- Parameters
layer – A subclass of torch.nn.Module, or a dict of layer hyperparameters.
- has_layer(layer_name)¶
Returns True if the network with the name exists. Returns False otherwise.
- Parameters
layer_name (str) – Name of the layer.
- layer_by_name(layer_name)¶
Returns the layer with the name. Returns None if the layer name does not exist.
- Parameters
layer_name (str) – Name of the layer.
- property layers_by_name¶
A dictionary mapping layer names to the layers.
- property layers¶
A list of the layers.
- property layer_names¶
A list of uniquified layer names.