Modules¶
ModuleBase¶
- class texar.torch.ModuleBase(hparams=None)[source]¶
Base class inherited by modules that are configurable through hyperparameters.
This is a subclass of torch.nn.Module.
A Texar module inheriting
ModuleBase
is configurable through hyperparameters. That is, each module defines allowed hyperparameters and default values. Hyperparameters not specified by users will take default values.- Parameters
hparams (dict, optional) – Hyperparameters of the module. See
default_hparams()
for the structure and default values.
- static default_hparams()[source]¶
Returns a dict of hyperparameters of the module with default values. Used to replace the missing values of input hparams during module construction.
{ "name": "module" }
- property trainable_variables¶
The list of trainable variables (parameters) of the module. Parameters of this module and all its submodules are included.
Note
The list returned may contain duplicate parameters (e.g. output layer shares parameters with embeddings). For most usages, it’s not necessary to ensure uniqueness.
- property output_size¶
The feature size of
forward()
output tensor(s), usually it is equal to the last dimension value of the output tensor size.
Embedders¶
WordEmbedder¶
- class texar.torch.modules.WordEmbedder(init_value=None, vocab_size=None, hparams=None)[source]¶
Simple word embedder that maps indexes into embeddings. The indexes can be soft (e.g., distributions over vocabulary).
Either
init_value
orvocab_size
is required. If both are given, there must beinit_value.shape[0]==vocab_size
.- Parameters
init_value (optional) –
A Tensor or numpy array that contains the initial value of embeddings. It is typically of shape
[vocab_size] + embedding-dim
. Embeddings can have dimensionality > 1.If None, embedding is initialized as specified in
hparams["initializer"]
. Otherwise, the"initializer"
and"dim"
hyperparameters inhparams
are ignored.vocab_size (int, optional) – The vocabulary size. Required if
init_value
is not given.hparams (dict, optional) – Embedder hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
See
forward()
for the inputs and outputs of the embedder.Example:
ids = torch.empty([32, 10]).uniform_(to=10).type(torch.int64). soft_ids = torch.empty([32, 10, 100]).uniform_() embedder = WordEmbedder(vocab_size=100, hparams={'dim': 256}) ids_emb = embedder(ids=ids) # shape: [32, 10, 256] soft_ids_emb = embedder(soft_ids=soft_ids) # shape: [32, 10, 256]
# Use with Texar data module hparams={ 'dataset': { 'embedding_init': {'file': 'word2vec.txt'} ... }, } data = MonoTextData(data_params) iterator = DataIterator(data) batch = next(iter(iterator)) # Use data vocab size embedder_1 = WordEmbedder(vocab_size=data.vocab.size) emb_1 = embedder_1(batch['text_ids']) # Use pre-trained embedding embedder_2 = WordEmbedder(init_value=data.embedding_init_value) emb_2 = embedder_2(batch['text_ids'])
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "dim": 100, "dropout_rate": 0, "dropout_strategy": 'element', "initializer": { "type": "random_uniform_initializer", "kwargs": { "minval": -0.1, "maxval": 0.1, "seed": None } }, "trainable": True, "name": "word_embedder", }
Here:
- “dim”: int or list
Embedding dimension. Can be a list of integers to yield embeddings with dimensionality > 1.
Ignored if
init_value
is given to the embedder constructor.- “dropout_rate”: float
The dropout rate between 0 and 1. For example,
dropout_rate=0.1
would zero out 10% of the embeddings. Set to 0 to disable dropout.- “dropout_strategy”: str
The dropout strategy. Can be one of the following
"element"
: The regular strategy that drops individual elements in the embedding vectors."item"
: Drops individual items (e.g., words) entirely. For example, for the word sequence “the simpler the better”, the strategy can yield “_ simpler the better”, where the first “the” is dropped."item_type"
: Drops item types (e.g., word types). For example, for the above sequence, the strategy can yield “_ simpler _ better”, where the word type “the” is dropped. The dropout will never yield “_ simpler the better” as in the"item"
strategy.
- “initializer”: dict or None
Hyperparameters of the initializer for embedding values. See
get_initializer()
for the details. Ignored ifinit_value
is given to the embedder constructor.- “trainable”: bool
Whether the embedding parameters are trainable. If false, freeze the embedding parameters.
- “name”: str
Name of the embedding variable.
- extra_repr()[source]¶
Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.
- forward(ids=None, soft_ids=None, **kwargs)[source]¶
Embeds (soft) ids.
Either
ids
orsoft_ids
must be given, and they must not be given at the same time.- Parameters
ids (optional) – An integer tensor containing the ids to embed.
soft_ids (optional) – A tensor of weights (probabilities) used to mix the embedding vectors.
kwargs – Additional keyword arguments for torch.nn.functional.embedding besides
params
andids
.
- Returns
If
ids
is given, returns a Tensor of shapelist(ids.shape) + embedding-dim
. For example, iflist(ids.shape) == [batch_size, max_time]
andlist(embedding.shape) == [vocab_size, emb_dim]
, then the return tensor has shape[batch_size, max_time, emb_dim]
.If
soft_ids
is given, returns a Tensor of shapelist(soft_ids.shape)[:-1] + embedding-dim
. For example, iflist(soft_ids.shape) == [batch_size, max_time, vocab_size]
andlist(embedding.shape) == [vocab_size, emb_dim]
, then the return tensor has shape[batch_size, max_time, emb_dim]
.
- property embedding¶
The embedding tensor, of shape
[vocab_size] + dim
.
- property dim¶
The embedding dimension.
- property vocab_size¶
The vocabulary size.
- property num_embeddings¶
The vocabulary size. This interface matches torch.nn.Embedding.
PositionEmbedder¶
- class texar.torch.modules.PositionEmbedder(position_size=None, init_value=None, hparams=None)[source]¶
Simple position embedder that maps position indexes into embeddings via lookup.
Either
init_value
orposition_size
is required. If both are given, there must beinit_value.shape[0]==position_size
.- Parameters
init_value (optional) –
A Tensor or numpy array that contains the initial value of embeddings. It is typically of shape
[position_size, embedding dim]
.If None, embedding is initialized as specified in
hparams["initializer"]
. Otherwise, the"initializer"
and"dim"
hyperparameters inhparams
are ignored.position_size (int, optional) – The number of possible positions, e.g., the maximum sequence length. Required if
init_value
is not given.hparams (dict, optional) – Embedder hyperparameters. If it is not specified, the default hyperparameter setting is used. See
default_hparams
for the structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "dim": 100, "initializer": { "type": "random_uniform_initializer", "kwargs": { "minval": -0.1, "maxval": 0.1, "seed": None } }, "dropout_rate": 0, "dropout_strategy": 'element', "trainable": True, "name": "position_embedder" }
The hyperparameters have the same meaning as those in
texar.torch.modules.WordEmbedder.default_hparams()
.
- extra_repr()[source]¶
Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.
- forward(positions=None, sequence_length=None, **kwargs)[source]¶
Embeds the positions.
Either
positions
orsequence_length
is required:If both are given,
sequence_length
is used to mask out embeddings of those time steps beyond the respective sequence lengths.If only
sequence_length
is given, then positions from 0 tosequence_length - 1
are embedded.
- Parameters
positions (optional) – A torch.LongTensor containing the position IDs to embed.
sequence_length (optional) – An torch.LongTensor of shape
[batch_size]
. Time steps beyond the respective sequence lengths will have zero-valued embeddings.kwargs – Additional keyword arguments for torch.nn.functional.embedding besides
params
andids
.
- Returns
A Tensor of shape shape(inputs) + embedding dimension.
- property embedding¶
The embedding tensor.
- property dim¶
The embedding dimension.
- property position_size¶
The position size, i.e., maximum number of positions.
SinusoidsPositionEmbedder¶
- class texar.torch.modules.SinusoidsPositionEmbedder(position_size=None, hparams=None)[source]¶
Sinusoid position embedder that maps position indexes into embeddings via sinusoid calculation. This module does not have trainable parameters. Used in, e.g., Transformer models (Vaswani et al.) “Attention Is All You Need”.
Each channel of the input Tensor is incremented by a sinusoid of a different frequency and phase. This allows attention to learn to use absolute and relative positions.
Timing signals should be added to some precursors of both the query and the memory inputs to attention. The use of relative position is possible because sin(x+y) and cos(x+y) can be expressed in terms of y, sin(x), and cos(x). In particular, we use a geometric sequence of timescales starting with min_timescale and ending with max_timescale. The number of different timescales is equal to
dim / 2
. For each timescale, we generate the two sinusoidal signals sin(timestep/timescale) and cos(timestep/timescale). All of these sinusoids are concatenated in the dim dimension.- Parameters
position_size (int) – The number of possible positions, e.g., the maximum sequence length. Set
position_size=None
andhparams['cache_embeddings']=False
to use arbitrarily large or negative position indices.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values We use a geometric sequence of timescales starting with min_timescale and ending with max_timescale. The number of different timescales is equal to
dim / 2
.{ 'min_timescale': 1.0, 'max_timescale': 10000.0, 'dim': 512, 'cache_embeddings': True, 'name':'sinusoid_position_embedder', }
Here:
- “cache_embeddings”: bool
If True, precompute embeddings for positions in range [0, position_size - 1]. This leads to faster lookup but requires lookup indices to be within this range.
If False, embeddings are computed on-the-fly during lookup. Set to False if your application needs to handle sequences of arbitrary length, or requires embeddings at negative positions.
- extra_repr()[source]¶
Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.
- forward(positions=None, sequence_length=None, **kwargs)[source]¶
Embeds. Either
positions
orsequence_length
is required:If both are given,
sequence_length
is used to mask out embeddings of those time steps beyond the respective sequence lengths.If only
sequence_length
is given, then positions from 0 to sequence_length - 1 are embedded.
- Parameters
positions (optional) – An torch.LongTensor containing the position IDs to embed.
sequence_length (optional) – An torch.LongTensor of shape
[batch_size]
. Time steps beyond the respective sequence lengths will have zero-valued embeddings.
- Returns
A Tensor of shape
[batch_size, position_size, dim]
.
- property dim¶
The embedding dimension.
EmbedderBase¶
- class texar.torch.modules.EmbedderBase(num_embeds=None, init_value=None, hparams=None)[source]¶
The base embedder class that all embedder classes inherit.
- Parameters
num_embeds (int, optional) – The number of embedding elements, e.g., the vocabulary size of a word embedder.
init_value (Tensor or numpy array, optional) – Initial values of the embedding variable. If not given, embedding is initialized as specified in
hparams["initializer"]
.hparams (dict or HParams, optional) – Embedder hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "name": "embedder" }
- property num_embeds¶
The number of embedding elements.
Encoders¶
UnidirectionalRNNEncoder¶
- class texar.torch.modules.UnidirectionalRNNEncoder(input_size, cell=None, output_layer=None, hparams=None)[source]¶
One directional RNN encoder.
- Parameters
input_size (int) – The number of expected features in the input for the cell.
cell – (RNNCell, optional) If not specified, a cell is created as specified in
hparams["rnn_cell"]
.output_layer (optional) – An instance of torch.nn.Module. Applies to the RNN cell output of each step. If None (default), the output layer is created as specified in
hparams["output_layer"]
.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
See
forward()
for the inputs and outputs of the encoder.Example:
# Use with embedder embedder = WordEmbedder(vocab_size, hparams=emb_hparams) encoder = UnidirectionalRNNEncoder(hparams=enc_hparams) outputs, final_state = encoder( inputs=embedder(data_batch['text_ids']), sequence_length=data_batch['length'])
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "rnn_cell": default_rnn_cell_hparams(), "output_layer": { "num_layers": 0, "layer_size": 128, "activation": "identity", "final_layer_activation": None, "other_dense_kwargs": None, "dropout_layer_ids": [], "dropout_rate": 0.5, "variational_dropout": False }, "name": "unidirectional_rnn_encoder" }
Here:
- “rnn_cell”: dict
A dictionary of RNN cell hyperparameters. Ignored if
cell
is given to the encoder constructor.The default value is defined in
default_rnn_cell_hparams()
.- “output_layer”: dict
Output layer hyperparameters. Ignored if
output_layer
is given to the encoder constructor. Includes:- “num_layers”: int
The number of output (dense) layers. Set to 0 to avoid any output layers applied to the cell outputs.
- “layer_size”: int or list
The size of each of the output (dense) layers.
If an int, each output layer will have the same size. If a list, the length must equal to
num_layers
.- “activation”: str or callable or None
Activation function for each of the output (dense) layer except for the final layer. This can be a function, or its string name or module path. If function name is given, the function must be from
torch.nn
. For example:"activation": "relu" # function name "activation": "my_module.my_activation_fn" # module path "activation": my_module.my_activation_fn # function
Default is None which results in an identity activation.
- “final_layer_activation”: str or callable or None
The activation function for the final output layer.
- “other_dense_kwargs”: dict or None
Other keyword arguments to construct each of the output dense layers, e.g.,
bias
. See torch.nn.Linear for the keyword arguments.- “dropout_layer_ids”: int or list
The indexes of layers (starting from 0) whose inputs are applied with dropout. The index =
num_layers
means dropout applies to the final layer output. For example,{ "num_layers": 2, "dropout_layer_ids": [0, 2] }
will leads to a series of layers as -dropout-layer0-layer1-dropout-.
The dropout mode (training or not) is controlled by
self.training
.- “dropout_rate”: float
The dropout rate, between 0 and 1. For example,
"dropout_rate": 0.1
would zero out 10% of elements.- “variational_dropout”: bool
Whether the dropout mask is the same across all time steps.
- “name”: str
Name of the encoder
- forward(inputs, sequence_length=None, initial_state=None, time_major=False, return_cell_output=False, return_output_size=False)[source]¶
Encodes the inputs.
- Parameters
inputs – A 3D Tensor of shape
[batch_size, max_time, dim]
. The first two dimensionsbatch_size
andmax_time
are exchanged iftime_major
is True.sequence_length (optional) – A 1D torch.LongTensor of shape
[batch_size]
. Sequence lengths of the batch inputs. Used to copy-through state and zero-out outputs when past a batch element’s sequence length.initial_state (optional) – Initial state of the RNN.
time_major (bool) – The shape format of the
inputs
andoutputs
Tensors. If True, these tensors are of shape[max_time, batch_size, depth]
. If False (default), these tensors are of shape[batch_size, max_time, depth]
.return_cell_output (bool) – Whether to return the output of the RNN cell. This is the results prior to the output layer.
return_output_size (bool) – Whether to return the size of the output (i.e., the results after output layers).
- Returns
By default (both
return_cell_output
andreturn_output_size
are False), returns a pair(outputs, final_state)
, whereoutputs
: The RNN output tensor by the output layer (if exists) or the RNN cell (otherwise). The tensor is of shape[batch_size, max_time, output_size]
iftime_major
is False, or[max_time, batch_size, output_size]
iftime_major
is True. If RNN cell output is a (nested) tuple of Tensors, then theoutputs
will be a (nested) tuple having the same nest structure as the cell output.final_state
: The final state of the RNN, which is a Tensor of shape[batch_size] + cell.state_size
or a (nested) tuple of Tensors ifcell.state_size
is a (nested) tuple.
If
return_cell_output
is True, returns a triple(outputs, final_state, cell_outputs)
cell_outputs
: The outputs by the RNN cell prior to the output layer, having the same structure withoutputs
except for theoutput_dim
.
If
return_output_size
is True, returns a tuple(outputs, final_state, output_size)
output_size
: A (possibly nested tuple of) int representing the size ofoutputs
. If a single int or an int array, thenoutputs
has shape[batch/time, time/batch] + output_size
. If a (nested) tuple, thenoutput_size
has the same structure as withoutputs
.
If both
return_cell_output
andreturn_output_size
are True, returns(outputs, final_state, cell_outputs, output_size)
.
- property cell¶
The RNN cell.
- property state_size¶
The state size of encoder cell. Same as
encoder.cell.state_size
.
- property output_layer¶
The output layer.
BidirectionalRNNEncoder¶
- class texar.torch.modules.BidirectionalRNNEncoder(input_size, cell_fw=None, cell_bw=None, output_layer_fw=None, output_layer_bw=None, hparams=None)[source]¶
Bidirectional forward-backward RNN encoder.
- Parameters
cell_fw (RNNCell, optional) – The forward RNN cell. If not given, a cell is created as specified in
hparams["rnn_cell_fw"]
.cell_bw (RNNCell, optional) – The backward RNN cell. If not given, a cell is created as specified in
hparams["rnn_cell_bw"]
.output_layer_fw (optional) – An instance of torch.nn.Module. Apply to the forward RNN cell output of each step. If None (default), the output layer is created as specified in
hparams["output_layer_fw"]
.output_layer_bw (optional) – An instance of torch.nn.Module. Apply to the backward RNN cell output of each step. If None (default), the output layer is created as specified in
hparams["output_layer_bw"]
.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
See
forward()
for the inputs and outputs of the encoder.Example
# Use with embedder embedder = WordEmbedder(vocab_size, hparams=emb_hparams) encoder = BidirectionalRNNEncoder(hparams=enc_hparams) outputs, final_state = encoder( inputs=embedder(data_batch['text_ids']), sequence_length=data_batch['length']) # outputs == (outputs_fw, outputs_bw) # final_state == (final_state_fw, final_state_bw)
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "rnn_cell_fw": default_rnn_cell_hparams(), "rnn_cell_bw": default_rnn_cell_hparams(), "rnn_cell_share_config": True, "output_layer_fw": { "num_layers": 0, "layer_size": 128, "activation": "identity", "final_layer_activation": None, "other_dense_kwargs": None, "dropout_layer_ids": [], "dropout_rate": 0.5, "variational_dropout": False }, "output_layer_bw": { # Same hyperparams and default values as "output_layer_fw" # ... }, "output_layer_share_config": True, "name": "bidirectional_rnn_encoder" }
Here:
- “rnn_cell_fw”: dict
Hyperparameters of the forward RNN cell. Ignored if
cell_fw
is given to the encoder constructor.The default value is defined in
default_rnn_cell_hparams()
.- “rnn_cell_bw”: dict
Hyperparameters of the backward RNN cell. Ignored if
cell_bw
is given to the encoder constructor, or if “rnn_cell_share_config” is True.The default value is defined in
default_rnn_cell_hparams()
.- “rnn_cell_share_config”: bool
Whether share hyperparameters of the backward cell with the forward cell. Note that the cell parameters (variables) are not shared.
- “output_layer_fw”: dict
Hyperparameters of the forward output layer. Ignored if
output_layer_fw
is given to the constructor. See the"output_layer"
field ofUnidirectionalRNNEncoder()
for details.- “output_layer_bw”: dict
Hyperparameters of the backward output layer. Ignored if
output_layer_bw
is given to the constructor. Have the same structure and defaults with"output_layer_fw"
.Ignored if
output_layer_share_config
is True.- “output_layer_share_config”: bool
Whether share hyperparameters of the backward output layer with the forward output layer. Note that the layer parameters (variables) are not shared.
- “name”: str
Name of the encoder
- forward(inputs, sequence_length=None, initial_state_fw=None, initial_state_bw=None, time_major=False, return_cell_output=False, return_output_size=False)[source]¶
Encodes the inputs.
- Parameters
inputs – A 3D Tensor of shape
[batch_size, max_time, dim]
. The first two dimensionsbatch_size
andmax_time
may be exchanged iftime_major
is True.sequence_length (optional) – A 1D torch.LongTensor of shape
[batch_size]
. Sequence lengths of the batch inputs. Used to copy-through state and zero-out outputs when past a batch element’s sequence length.initial_state_fw – (optional): Initial state of the forward RNN.
initial_state_bw – (optional): Initial state of the backward RNN.
time_major (bool) – The shape format of the
inputs
andoutputs
Tensors. If True, these tensors are of shape[max_time, batch_size, depth]
. If False (default), these tensors are of shape[batch_size, max_time, depth]
.return_cell_output (bool) – Whether to return the output of the RNN cell. This is the results prior to the output layer.
return_output_size (bool) – Whether to return the output size of the RNN cell. This is the results after the output layer.
- Returns
By default (both
return_cell_output
andreturn_output_size
are False), returns a pair(outputs, final_state)
outputs
: A tuple(outputs_fw, outputs_bw)
containing the forward and the backward RNN outputs, each of which is of shape[batch_size, max_time, output_dim]
iftime_major
is False, or[max_time, batch_size, output_dim]
iftime_major
is True. If RNN cell output is a (nested) tuple of Tensors, thenoutputs_fw
andoutputs_bw
will be a (nested) tuple having the same structure as the cell output.final_state
: A tuple(final_state_fw, final_state_bw)
containing the final states of the forward and backward RNNs, each of which is a Tensor of shape[batch_size] + cell.state_size
, or a (nested) tuple of Tensors ifcell.state_size
is a (nested) tuple.
If
return_cell_output
is True, returns a triple(outputs, final_state, cell_outputs)
wherecell_outputs
: A tuple(cell_outputs_fw, cell_outputs_bw)
containing the outputs by the forward and backward RNN cells prior to the output layers, having the same structure withoutputs
except for theoutput_dim
.
If
return_output_size
is True, returns a tuple(outputs, final_state, output_size)
whereoutput_size
: A tuple(output_size_fw, output_size_bw)
containing the size ofoutputs_fw
andoutputs_bw
, respectively. Take*_fw
for example,output_size_fw
is a (possibly nested tuple of) int. If a single int or an int array, thenoutputs_fw
has shape[batch/time, time/batch] + output_size_fw
. If a (nested) tuple, thenoutput_size_fw
has the same structure asoutputs_fw
. The same applies tooutput_size_bw
.
If both
return_cell_output
andreturn_output_size
are True, returns(outputs, final_state, cell_outputs, output_size)
.
- property cell_fw¶
The forward RNN cell.
- property cell_bw¶
The backward RNN cell.
- property state_size_fw¶
The state size of the forward encoder cell. Same as
encoder.cell_fw.state_size
.
- property state_size_bw¶
The state size of the backward encoder cell. Same as
encoder.cell_bw.state_size
.
- property output_layer_fw¶
The output layer of the forward RNN.
- property output_layer_bw¶
The output layer of the backward RNN.
MultiheadAttentionEncoder¶
- class texar.torch.modules.MultiheadAttentionEncoder(input_size, hparams=None)[source]¶
Multi-head Attention Encoder.
- Parameters
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "initializer": None, 'num_heads': 8, 'output_dim': 512, 'num_units': 512, 'dropout_rate': 0.1, 'use_bias': False, "name": "multihead_attention" }
Here:
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()
for details.- “num_heads”: int
Number of heads for attention calculation.
- “output_dim”: int
Output dimension of the returned tensor.
- “num_units”: int
Hidden dimension of the unsplit attention space. Should be divisible by “num_heads”.
- “dropout_rate”: float
Dropout rate in the attention.
- “use_bias”: bool
Use bias when projecting the key, value and query.
- “name”: str
Name of the module.
- forward(queries, memory, memory_attention_bias, cache=None)[source]¶
Encodes the inputs.
- Parameters
queries – A 3D tensor with shape of
[batch, length_query, depth_query]
.memory – A 3D tensor with shape of
[batch, length_key, depth_key]
.memory_attention_bias – A 3D tensor with shape of
[batch, length_key, num_units]
.cache – Memory cache only when inferring the sentence from scratch.
- Returns
A tensor of shape
[batch_size, max_time, dim]
containing the encoded vectors.
TransformerEncoder¶
- class texar.torch.modules.TransformerEncoder(hparams=None)[source]¶
Transformer encoder that applies multi-head self attention for encoding sequences.
This module basically stacks
MultiheadAttentionEncoder
,FeedForwardNetwork
and residual connections. This module supports two types of architectures, namely, the standard Transformer Encoder architecture first proposed in (Vaswani et al.) “Attention is All You Need”, and the variant first used in (Devlin et al.) BERT. Seedefault_hparams()
for the nuance between the two types of architectures.- Parameters
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- initialize_blocks()[source]¶
Helper function which initializes blocks for encoder.
Should be overridden by any classes where block initialization varies.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "num_blocks": 6, "dim": 512, 'use_bert_config': False, "embedding_dropout": 0.1, "residual_dropout": 0.1, "poswise_feedforward": default_transformer_poswise_net_hparams, 'multihead_attention': { 'name': 'multihead_attention', 'num_units': 512, 'num_heads': 8, 'dropout_rate': 0.1, 'output_dim': 512, 'use_bias': False, }, "eps": 1e-6, "initializer": None, "name": "transformer_encoder" }
Here:
- “num_blocks”: int
Number of stacked blocks.
- “dim”: int
Hidden dimension of the encoders.
- “use_bert_config”: bool
If False, apply the standard Transformer Encoder architecture from the original paper (Vaswani et al.) “Attention is All You Need”. If True, apply the Transformer Encoder architecture used in BERT (Devlin et al.) and the default setting of TensorFlow. The differences lie in:
The standard arch restricts the word embedding of PAD token to all zero. The BERT arch does not.
The attention bias for padding tokens: Standard architectures use
-1e8
for negative attention mask. BERT uses-1e4
instead.The residual connections between internal tensors: In BERT, a residual layer connects the tensors after layer normalization. In standard architectures, the tensors are connected before layer normalization.
- “embedding_dropout”: float
Dropout rate of the input embedding.
- “residual_dropout”: float
Dropout rate of the residual connections.
- “eps”: float
Epsilon values for layer norm layers.
- “poswise_feedforward”: dict
Hyperparameters for a feed-forward network used in residual connections. Make sure the dimension of the output tensor is equal to
"dim"
. Seedefault_transformer_poswise_net_hparams()
for details.- “multihead_attention”: dict
Hyperparameters for the multi-head attention strategy. Make sure the
"output_dim"
in this module is equal to"dim"
. SeeMultiheadAttentionEncoder
for details.- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()
for details.- “name”: str
Name of the module.
- forward(inputs, sequence_length)[source]¶
Encodes the inputs.
- Parameters
inputs – A 3D Tensor of shape
[batch_size, max_time, dim]
, containing the embedding of input sequences. Note that the embedding dimension dim must equal “dim” inhparams
. The input embedding is typically an aggregation of word embedding and position embedding.sequence_length – A 1D torch.LongTensor of shape
[batch_size]
. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A Tensor of shape
[batch_size, max_time, dim]
containing the encoded vectors.
BERTEncoder¶
- class texar.torch.modules.BERTEncoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Raw BERT Transformer for encoding sequences. Please see
PretrainedBERTMixin
for a brief description of BERT.This module basically stacks
WordEmbedder
,PositionEmbedder
,TransformerEncoder
and a dense pooler.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
bert-base-uncased
). Please refer toPretrainedBERTMixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- reset_parameters()[source]¶
Initialize parameters of the pre-trained model. This method is only called if pre-trained checkpoints are not loaded.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The encoder arch is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored.Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "pretrained_model_name": "bert-base-uncased", "embed": { "dim": 768, "name": "word_embeddings" }, "vocab_size": 30522, "segment_embed": { "dim": 768, "name": "token_type_embeddings" }, "type_vocab_size": 2, "position_embed": { "dim": 768, "name": "position_embeddings" }, "position_size": 512, "encoder": { "dim": 768, "embedding_dropout": 0.1, "multihead_attention": { "dropout_rate": 0.1, "name": "self", "num_heads": 12, "num_units": 768, "output_dim": 768, "use_bias": True }, "name": "encoder", "num_blocks": 12, "eps": 1e-12, "poswise_feedforward": { "layers": [ { "kwargs": { "in_features": 768, "out_features": 3072, "bias": True }, "type": "Linear" }, {"type": "BertGELU"}, { "kwargs": { "in_features": 3072, "out_features": 768, "bias": True }, "type": "Linear" } ] }, "residual_dropout": 0.1, "use_bert_config": True }, "hidden_size": 768, "initializer": None, "name": "bert_encoder", }
Here:
The default parameters are values for uncased BERT-Base model.
- “pretrained_model_name”: str or None
The name of the pre-trained BERT model. If None, the model will be randomly initialized.
- “embed”: dict
Hyperparameters for word embedding layer.
- “vocab_size”: int
The vocabulary size of inputs in BERT model.
- “segment_embed”: dict
Hyperparameters for segment embedding layer.
- “type_vocab_size”: int
The vocabulary size of the segment_ids passed into BertModel.
- “position_embed”: dict
Hyperparameters for position embedding layer.
- “position_size”: int
The maximum sequence length that this model might ever be used with.
- “encoder”: dict
Hyperparameters for the TransformerEncoder. See
default_hparams()
for details.- “hidden_size”: int
Size of the pooler dense layer.
- “eps”: float
Epsilon values for layer norm layers.
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()
for details.- “name”: str
Name of the module.
- forward(inputs, sequence_length=None, segment_ids=None)[source]¶
Encodes the inputs. Note that the SpanBERT model does not use segmentation embedding. As a result, SpanBERT does not require segment_ids as an input when you use pre-trained SpanBERT checkpoint files.
- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
segment_ids (optional) – A 2D Tensor of shape [batch_size, max_time], containing the segment ids of tokens in input sequences. If None (default), a tensor with all elements set to zero is used.
sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A pair
(outputs, pooled_output)
outputs
: A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.pooled_output
: A Tensor of size [batch_size, hidden_size] which is the output of a pooler pre-trained on top of the hidden state associated to the first character of the input (CLS), see BERT’s paper.
RoBERTaEncoder¶
- class texar.torch.modules.RoBERTaEncoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
RoBERTa Transformer for encoding sequences. Please see
PretrainedRoBERTaMixin
for a brief description of RoBERTa.This module basically stacks
WordEmbedder
,PositionEmbedder
,TransformerEncoder
and a dense pooler.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
roberta-base
). Please refer toPretrainedRoBERTaMixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The encoder arch is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored.Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "pretrained_model_name": "roberta-base", "embed": { "dim": 768, "name": "word_embeddings" }, "vocab_size": 50265, "position_embed": { "dim": 768, "name": "position_embeddings" }, "position_size": 514, "encoder": { "dim": 768, "embedding_dropout": 0.1, "multihead_attention": { "dropout_rate": 0.1, "name": "self", "num_heads": 12, "num_units": 768, "output_dim": 768, "use_bias": True }, "name": "encoder", "num_blocks": 12, "eps": 1e-12, "poswise_feedforward": { "layers": [ { "kwargs": { "in_features": 768, "out_features": 3072, "bias": True }, "type": "Linear" }, {"type": "BertGELU"}, { "kwargs": { "in_features": 3072, "out_features": 768, "bias": True }, "type": "Linear" } ] }, "residual_dropout": 0.1, "use_bert_config": True }, "hidden_size": 768, "initializer": None, "name": "roberta_encoder", }
Here:
The default parameters are values for RoBERTa-Base model.
- “pretrained_model_name”: str or None
The name of the pre-trained RoBERTa model. If None, the model will be randomly initialized.
- “embed”: dict
Hyperparameters for word embedding layer.
- “vocab_size”: int
The vocabulary size of inputs in RoBERTa model.
- “position_embed”: dict
Hyperparameters for position embedding layer.
- “position_size”: int
The maximum sequence length that this model might ever be used with.
- “encoder”: dict
Hyperparameters for the TransformerEncoder. See
default_hparams()
for details.- “hidden_size”: int
Size of the pooler dense layer.
- “eps”: float
Epsilon values for layer norm layers.
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()
for details.- “name”: str
Name of the module.
- forward(inputs, sequence_length=None, segment_ids=None)[source]¶
Encodes the inputs. Differing from the standard BERT, the RoBERTa model does not use segmentation embedding. As a result, RoBERTa does not require segment_ids as an input.
- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A pair
(outputs, pooled_output)
outputs
: A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.pooled_output
: A Tensor of size [batch_size, hidden_size] which is the output of a pooler pre-trained on top of the hidden state associated to the first character of the input (CLS), see RoBERTa’s paper.
GPT2Encoder¶
- class texar.torch.modules.GPT2Encoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Raw GPT2 Transformer for encoding sequences. Please see
PretrainedGPT2Mixin
for a brief description of GPT2.This module basically stacks
WordEmbedder
,PositionEmbedder
,TransformerEncoder
.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
gpt2-small
). Please refer toPretrainedGPT2Mixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The encoder arch is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored.Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "pretrained_model_name": "gpt2-small", "vocab_size": 50257, "context_size": 1024, "embedding_size": 768, "embed": { "dim": 768, "name": "word_embeddings" }, "position_size": 1024, "position_embed": { "dim": 768, "name": "position_embeddings" }, "encoder": { "dim": 768, "num_blocks": 12, "use_bert_config": False, "embedding_dropout": 0, "residual_dropout": 0, "multihead_attention": { "use_bias": True, "num_units": 768, "num_heads": 12, "output_dim": 768 }, "eps": 1e-6, "initializer": { "type": "variance_scaling_initializer", "kwargs": { "factor": 1.0, "mode": "FAN_AVG", "uniform": True } }, "poswise_feedforward": { "layers": [ { "type": "Linear", "kwargs": { "in_features": 768, "out_features": 3072, "bias": True } }, { "type": "GPTGELU", "kwargs": {} }, { "type": "Linear", "kwargs": { "in_features": 3072, "out_features": 768, "bias": True } } ], "name": "ffn" } }, "initializer": None, "name": "gpt2_encoder", }
Here:
The default parameters are values for 124M GPT2 model.
- “pretrained_model_name”: str or None
The name of the pre-trained GPT2 model. If None, the model will be randomly initialized.
- “embed”: dict
Hyperparameters for word embedding layer.
- “vocab_size”: int
The vocabulary size of inputs in GPT2Model.
- “position_embed”: dict
Hyperparameters for position embedding layer.
- “position_size”: int
The maximum sequence length that this model might ever be used with.
- “decoder”: dict
Hyperparameters for the TransformerDecoder. See
default_hparams()
for details.- “eps”: float
Epsilon values for layer norm layers.
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()
for details.- “name”: str
Name of the module.
- forward(inputs, sequence_length=None)[source]¶
Encodes the inputs.
- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.
- Return type
outputs
XLNetEncoder¶
- class texar.torch.modules.XLNetEncoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Raw XLNet module for encoding sequences. Please see
PretrainedXLNetMixin
for a brief description of XLNet.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
xlnet-based-cased
). Please refer toPretrainedXLNetMixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The encoder arch is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored.Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "pretrained_model_name": "xlnet-base-cased", "untie_r": True, "num_layers": 12, "mem_len": 0, "reuse_len": 0, "num_heads": 12, "hidden_dim": 768, "head_dim": 64, "dropout": 0.1, "attention_dropout": 0.1, "use_segments": True, "ffn_inner_dim": 3072, "activation": 'gelu', "vocab_size": 32000, "max_seq_length": 512, "initializer": None, "name": "xlnet_encoder", }
Here:
The default parameters are values for cased XLNet-Base model.
- “pretrained_model_name”: str or None
The name of the pre-trained XLNet model. If None, the model will be randomly initialized.
- “untie_r”: bool
Whether to untie the biases in attention.
- “num_layers”: int
The number of stacked layers.
- “mem_len”: int
The number of tokens to cache.
- “reuse_len”: int
The number of tokens in the current batch to be cached and reused in the future.
- “num_heads”: int
The number of attention heads.
- “hidden_dim”: int
The hidden size.
- “head_dim”: int
The dimension size of each attention head.
- “dropout”: float
Dropout rate.
- “attention_dropout”: float
Dropout rate on attention probabilities.
- “use_segments”: bool
Whether to use segment embedding.
- “ffn_inner_dim”: int
The hidden size in feed-forward layers.
- “activation”: str
relu or gelu.
- “vocab_size”: int
The vocabulary size.
- “max_seq_length”: int
The maximum sequence length for RelativePositionalEncoding.
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()
for details.- “name”: str
Name of the module.
- param_groups(lr=None, lr_layer_scale=1.0, decay_base_params=False)[source]¶
Create parameter groups for optimizers. When
lr_layer_decay_rate
is not 1.0, parameters from each layer form separate groups with different base learning rates.The return value of this method can be used in the constructor of optimizers, for example:
model = XLNetEncoder(...) param_groups = model.param_groups(lr=2e-5, lr_layer_scale=0.8) optim = torch.optim.Adam(param_groups)
- Parameters
lr (float) – The learning rate. Can be omitted if
lr_layer_decay_rate
is 1.0.lr_layer_scale (float) – Per-layer LR scaling rate. The i-th layer will be scaled by lr_layer_scale ^ (num_layers - i - 1).
decay_base_params (bool) – If True, treat non-layer parameters (e.g. embeddings) as if they’re in layer 0. If False, these parameters are not scaled.
- Returns
The parameter groups, used as the first argument for optimizers.
- forward(inputs, segment_ids=None, input_mask=None, memory=None, permute_mask=None, target_mapping=None, bi_data=False, clamp_len=None, cache_len=0, same_length=False, attn_type='bi', two_stream=False)[source]¶
Compute XLNet representations for the input.
- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
segment_ids – Shape [batch_size, max_time].
input_mask – Float tensor of shape [batch_size, max_time]. Note that positions with value 1 are masked out.
memory – Memory from previous batches. A list of length num_layers, each tensor of shape [batch_size, mem_len, hidden_dim].
permute_mask – The permutation mask. Float tensor of shape [batch_size, max_time, max_time]. A value of 0 for
permute_mask[i, j, k]
indicates that position i attends to position j in batch k.target_mapping – The target token mapping. Float tensor of shape [batch_size, num_targets, max_time]. A value of 1 for
target_mapping[i, j, k]
indicates that the i-th target token (in order of permutation) in batch k is the token at position j. Each rowtarget_mapping[i, :, k]
can have no more than one value of 1.bi_data (bool) – Whether to use bidirectional data input pipeline.
clamp_len (int) – Clamp all relative distances larger than
clamp_len
. A value of -1 means no clamping.cache_len (int) – Length of memory (number of tokens) to cache.
same_length (bool) – Whether to use the same attention length for each token.
attn_type (str) – Attention type. Supported values are “uni” and “bi”.
two_stream (bool) – Whether to use two-stream attention. Only set to True when pre-training or generating text. Defaults to False.
- Returns
A tuple of (output, new_memory):
`output`: The final layer output representations. Shape [batch_size, max_time, hidden_dim].
`new_memory`: The memory of the current batch. If cache_len is 0, then new_memory is None. Otherwise, it is a list of length num_layers, each tensor of shape [batch_size, cache_len, hidden_dim]. This can be used as the
memory
argument in the next batch.
Conv1DEncoder¶
- class texar.torch.modules.Conv1DEncoder(in_channels, in_features=None, hparams=None)[source]¶
Simple Conv-1D encoder which consists of a sequence of convolutional layers followed with a sequence of dense layers.
Wraps
Conv1DNetwork
to be a subclass ofEncoderBase
. Has exact the same functionality withConv1DNetwork
.- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The same as
default_hparams()
ofConv1DNetwork
, except that the default name is"conv_encoder"
.
EncoderBase¶
RNNEncoderBase¶
- class texar.torch.modules.RNNEncoderBase(hparams=None)[source]¶
Base class for all RNN encoder classes to inherit.
- Parameters
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
default_transformer_poswise_net_hparams¶
- texar.torch.modules.default_transformer_poswise_net_hparams(input_dim, output_dim=512)[source]¶
Returns default hyperparameters of a
FeedForwardNetwork
as a position-wise network used inTransformerEncoder
andTransformerDecoder
. This is a 2-layer dense network with dropout in-between.{ "layers": [ { "type": "Linear", "kwargs": { "in_features": input_dim, "out_features": output_dim * 4, "bias": True, } }, { "type": "nn.ReLU", "kwargs": { "inplace": True } }, { "type": "Dropout", "kwargs": { "p": 0.1, } }, { "type": "Linear", "kwargs": { "in_features": output_dim * 4, "out_features": output_dim, "bias": True, } } ], "name": "ffn" }
Decoders¶
DecoderBase¶
- class texar.torch.modules.DecoderBase(token_embedder=None, token_pos_embedder=None, input_time_major=False, output_time_major=False, hparams=None)[source]¶
Base class inherited by all RNN decoder classes. See
BasicRNNDecoder
for the arguments.See
forward()
for the inputs and outputs of RNN decoders in general.- embed_tokens(tokens, positions)[source]¶
Convert tokens along with positions to embeddings.
- Parameters
tokens – A torch.LongTensor denoting the token indices to convert to embeddings.
positions – A torch.LongTensor with the same size as
tokens
, denoting the positions of the tokens. This is useful if the decoder uses positional embeddings.
- Returns
A torch.Tensor of size
tokens.size() + (embed_dim,)
, denoting the converted embeddings.
- create_helper(*, decoding_strategy=None, start_tokens=None, end_token=None, softmax_temperature=None, infer_mode=None, **kwargs)[source]¶
Create a helper instance for the decoder. This is a shared interface for both
BasicRNNDecoder
andAttentionRNNDecoder
.The function provides 3 ways to specify the decoding method, with varying flexibility:
The
decoding_strategy
argument: A string taking value of:“train_greedy”: decoding in teacher-forcing fashion (i.e., feeding ground truth to decode the next step), and each sample is obtained by taking the argmax of the output logits. Arguments
(inputs, sequence_length)
are required for this strategy, and argumentembedding
is optional.“infer_greedy”: decoding in inference fashion (i.e., feeding the generated sample to decode the next step), and each sample is obtained by taking the argmax of the output logits. Arguments
(embedding, start_tokens, end_token)
are required for this strategy, and argumentmax_decoding_length
is optional.“infer_sample”: decoding in inference fashion, and each sample is obtained by random sampling from the RNN output distribution. Arguments
(embedding, start_tokens, end_token)
are required for this strategy, and argumentmax_decoding_length
is optional.
This argument is used only when argument
helper
is None.Example:
embedder = WordEmbedder(vocab_size=data.vocab.size) decoder = BasicRNNDecoder(vocab_size=data.vocab.size) # Teacher-forcing decoding outputs_1, _, _ = decoder( decoding_strategy='train_greedy', inputs=embedder(data_batch['text_ids']), sequence_length=data_batch['length'] - 1) # Random sample decoding. Gets 100 sequence samples outputs_2, _, sequence_length = decoder( decoding_strategy='infer_sample', start_tokens=[data.vocab.bos_token_id] * 100, end_token=data.vocab.eos.token_id, embedding=embedder, max_decoding_length=60)
The
helper
argument: An instance of subclass ofHelper
. This provides a superset of decoding strategies than above, for example:TrainingHelper
corresponding to the “train_greedy” strategy.ScheduledEmbeddingTrainingHelper
andScheduledOutputTrainingHelper
for scheduled sampling.SoftmaxEmbeddingHelper
andGumbelSoftmaxEmbeddingHelper
for soft decoding and gradient backpropagation.
This means gives the maximal flexibility of configuring the decoding strategy.
Example:
embedder = WordEmbedder(vocab_size=data.vocab.size) decoder = BasicRNNDecoder(vocab_size=data.vocab.size) # Teacher-forcing decoding, same as above with # `decoding_strategy='train_greedy'` helper_1 = TrainingHelper( inputs=embedders(data_batch['text_ids']), sequence_length=data_batch['length'] - 1) outputs_1, _, _ = decoder(helper=helper_1) # Gumbel-softmax decoding helper_2 = GumbelSoftmaxEmbeddingHelper( embedding=embedder, start_tokens=[data.vocab.bos_token_id] * 100, end_token=data.vocab.eos_token_id, tau=0.1) outputs_2, _, sequence_length = decoder( max_decoding_length=60, helper=helper_2)
hparams["helper_train"]
andhparams["helper_infer"]
: Specifying the helper through hyperparameters. Train and infer strategy is toggled based onmode
. Appropriate arguments (e.g.,inputs
,start_tokens
, etc) are selected to construct the helper. Additional arguments for helper constructor can be provided either through**kwargs
, or throughhparams["helper_train/infer"]["kwargs"]
.This means is used only when both
decoding_strategy
andhelper
are None.Example:
h = { "helper_infer": { "type": "GumbelSoftmaxEmbeddingHelper", "kwargs": { "tau": 0.1 } } } embedder = WordEmbedder(vocab_size=data.vocab.size) decoder = BasicRNNDecoder(vocab_size=data.vocab.size, hparams=h) # Gumbel-softmax decoding decoder.eval() # disable dropout output, _, _ = decoder( decoding_strategy=None, # Sets to None explicit embedding=embedder, start_tokens=[data.vocab.bos_token_id] * 100, end_token=data.vocab.eos_token_id, max_decoding_length=60)
- Parameters
decoding_strategy (str) – A string specifying the decoding strategy. Different arguments are required based on the strategy. Ignored if
helper
is given.start_tokens (optional) – A torch.LongTensor of shape
[batch_size]
, the start tokens. Used whendecoding_strategy
is"infer_greedy"
or"infer_sample"
, or when hparams-configured helper is used. When used with the Texar data module, to getbatch_size
samples wherebatch_size
is changing according to the data module, this can be set asstart_tokens=torch.full_like(batch['length'], bos_token_id)
.end_token (optional) – A integer or 0D torch.LongTensor, the token that marks the end of decoding. Used when
decoding_strategy
is"infer_greedy"
or"infer_sample"
, or when hparams-configured helper is used.softmax_temperature (float, optional) – Value to divide the logits by before computing the softmax. Larger values (above 1.0) result in more random samples. Must be > 0. If None, 1.0 is used. Used when
decoding_strategy="infer_sample"
.infer_mode (optional) – If not None, overrides mode given by
self.training
.**kwargs – Other keyword arguments for constructing helpers defined by
hparams["helper_train"]
orhparams["helper_infer"]
.
- Returns
The constructed helper instance.
- set_default_train_helper(helper)[source]¶
Set the default helper used in training mode.
- Parameters
helper – The helper to set as default training helper.
- set_default_infer_helper(helper)[source]¶
Set the default helper used in eval (inference) mode.
- Parameters
helper – The helper to set as default inference helper.
- dynamic_decode(helper, inputs, sequence_length, initial_state, max_decoding_length=None, impute_finished=False, step_hook=None)[source]¶
Generic routine for dynamic decoding. Please check the documentation for the TensorFlow counterpart.
- Returns
A tuple of output, final state, and sequence lengths. Note that final state could be None, when all sequences are of zero length and
initial_state
is also None.
- abstract initialize(helper, inputs, sequence_length, initial_state)[source]¶
Called before any decoding iterations.
This methods must compute initial input values and initial state.
- Parameters
helper – The
Helper
instance to use.inputs (optional) – A (structure of) input tensors.
sequence_length (optional) – A torch.LongTensor representing lengths of each sequence.
initial_state – A possibly nested structure of tensors indicating the initial decoder state.
- Returns
A tuple
(finished, initial_inputs, initial_state)
representing initial values offinished
flags, inputs, and state.
- abstract step(helper, time, inputs, state)[source]¶
Compute the output and the state at the current time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(outputs, next_state)
.outputs
is an object containing the decoder output.next_state
is the decoder state for the next time step.
- abstract next_inputs(helper, time, outputs)[source]¶
Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(next_inputs, finished)
.next_inputs
is the tensor that should be used as input for the next step.finished
is a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.
- finalize(outputs, final_state, sequence_lengths)[source]¶
Called after all decoding iterations have finished.
- Parameters
outputs – Outputs at each time step.
final_state – The RNNCell state after the last time step.
sequence_lengths – Sequence lengths for each sequence in batch.
- Returns
A tuple
(outputs, final_state)
.outputs
is an object containing the decoder output.final_state
is the final decoder state.
- property vocab_size¶
The vocabulary size.
- property output_layer¶
The output layer.
RNNDecoderBase¶
- class texar.torch.modules.RNNDecoderBase(input_size, vocab_size, token_embedder=None, token_pos_embedder=None, cell=None, output_layer=None, input_time_major=False, output_time_major=False, hparams=None)[source]¶
Base class inherited by all RNN decoder classes. See
BasicRNNDecoder
for the arguments.See
forward()
for the inputs and outputs of RNN decoders in general.- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The hyperparameters are the same as in
default_hparams()
ofBasicRNNDecoder
, except that the default"name"
here is"rnn_decoder"
.
- forward(inputs=None, sequence_length=None, initial_state=None, helper=None, max_decoding_length=None, impute_finished=False, infer_mode=None, **kwargs)[source]¶
Performs decoding. This is a shared interface for both
BasicRNNDecoder
andAttentionRNNDecoder
.Implementation calls
initialize()
once andstep()
repeatedly on the decoder object. Please refer to tf.contrib.seq2seq.dynamic_decode.See also
Arguments of
create_helper()
, for arguments likedecoding_strategy
.- Parameters
inputs (optional) –
Input tensors for teacher forcing decoding. Used when
decoding_strategy
is set to"train_greedy"
, or when hparams-configured helper is used.The
inputs
is a torch.LongTensor used as index to look up embeddings and feed in the decoder. For example, ifembedder
is an instance ofWordEmbedder
, theninputs
is usually a 2D int Tensor [batch_size, max_time] (or [max_time, batch_size] if input_time_major == True) containing the token indexes.sequence_length (optional) – A 1D int Tensor containing the sequence length of
inputs
. Used when decoding_strategy=”train_greedy” or hparams-configured helper is used.initial_state (optional) – Initial state of decoding. If None (default), zero state is used.
max_decoding_length – A int scalar Tensor indicating the maximum allowed number of decoding steps. If None (default), either hparams[“max_decoding_length_train”] or hparams[“max_decoding_length_infer”] is used according to
mode
.impute_finished (bool) – If True, then states for batch entries which are marked as finished get copied through and the corresponding outputs get zeroed out. This causes some slowdown at each time step, but ensures that the final state and outputs have the correct values and that backprop ignores time steps that were marked as finished.
helper (optional) –
An instance of
Helper
that defines the decoding strategy. If given,decoding_strategy
and helper configurations inhparams
are ignored.create_helper()
can be used to create some of the common helpers for, e.g., teacher-forcing decoding, greedy decoding, sample decoding, etc.infer_mode (optional) – If not None, overrides mode given by self.training.
**kwargs – Other keyword arguments for constructing helpers defined by
hparams["helper_train"]
orhparams["helper_infer"]
.
- Returns
(outputs, final_state, sequence_lengths)
, whereoutputs: an object containing the decoder output on all time steps.
final_state: the cell state of the final time step.
sequence_lengths: a torch.LongTensor of shape
[batch_size]
containing the length of each sample.
- next_inputs(helper, time, outputs)[source]¶
Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(next_inputs, finished)
.next_inputs
is the tensor that should be used as input for the next step.finished
is a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.
- property cell¶
The RNN cell.
- property state_size¶
The state size of decoder cell. Equivalent to
decoder.cell.state_size
.
- property output_layer¶
The output layer.
BasicRNNDecoder¶
- class texar.torch.modules.BasicRNNDecoder(input_size, vocab_size, token_embedder=None, token_pos_embedder=None, cell=None, output_layer=None, input_time_major=False, output_time_major=False, hparams=None)[source]¶
Basic RNN decoder.
- Parameters
input_size (int) – Dimension of input embeddings.
vocab_size (int, optional) – Vocabulary size. Required if
output_layer
is None.token_embedder – An instance of torch.nn.Module, or a function taking a torch.LongTensor
tokens
as argument. This is the embedder called inembed_tokens()
to convert input tokens to embeddings.token_pos_embedder –
An instance of torch.nn.Module, or a function taking two torch.LongTensors
tokens
andpositions
as argument. This is the embedder called inembed_tokens()
to convert input tokens with positions to embeddings.Note
Only one among
token_embedder
andtoken_pos_embedder
should be specified. If neither is specified, you must subclassBasicRNNDecoder
and overrideembed_tokens()
.cell (RNNCellBase, optional) – An instance of
RNNCellBase
. If None (default), a cell is created as specified inhparams
.output_layer (optional) – An instance of torch.nn.Module. Apply to the RNN cell output to get logits. If None, a torch.nn.Linear layer is used with output dimension set to
vocab_size
. Setoutput_layer
toidentity()
if you do not want to have an output layer after the RNN cell outputs.hparams (dict, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
See
forward()
for the inputs and outputs of the decoder. The decoder returns(outputs, final_state, sequence_lengths)
, whereoutputs
is an instance ofBasicRNNDecoderOutput
.Example
embedder = WordEmbedder(vocab_size=data.vocab.size) decoder = BasicRNNDecoder(vocab_size=data.vocab.size) # Training loss outputs, _, _ = decoder( decoding_strategy='train_greedy', inputs=embedder(data_batch['text_ids']), sequence_length=data_batch['length']-1) loss = tx.losses.sequence_sparse_softmax_cross_entropy( labels=data_batch['text_ids'][:, 1:], logits=outputs.logits, sequence_length=data_batch['length']-1) # Create helper helper = decoder.create_helper( decoding_strategy='infer_sample', start_tokens=[data.vocab.bos_token_id]*100, end_token=data.vocab.eos.token_id, embedding=embedder) # Inference sample outputs, _, _ = decoder( helper=helerp, max_decoding_length=60) sample_text = tx.utils.map_ids_to_strs( outputs.sample_id, data.vocab) print(sample_text) # [ # the first sequence sample . # the second sequence sample . # ... # ]
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "rnn_cell": default_rnn_cell_hparams(), "max_decoding_length_train": None, "max_decoding_length_infer": None, "helper_train": { "type": "TrainingHelper", "kwargs": {} } "helper_infer": { "type": "SampleEmbeddingHelper", "kwargs": {} } "name": "basic_rnn_decoder" }
Here:
- “rnn_cell”: dict
A dictionary of RNN cell hyperparameters. Ignored if
cell
is given to the decoder constructor. The default value is defined indefault_rnn_cell_hparams()
.- “max_decoding_length_train”: int or None
Maximum allowed number of decoding steps in training mode. If None (default), decoding is performed until fully done, e.g., encountering the
<EOS>
token. Ignored if"max_decoding_length"
is not None given when calling the decoder.- “max_decoding_length_infer”: int or None
Same as
"max_decoding_length_train"
but for inference mode.- “helper_train”: dict
The hyperparameters of the helper used in training.
"type"
can be a helper class, its name or module path, or a helper instance. If a class name is given, the class must be from moduletexar.torch.modules
, ortexar.torch.custom
. This is used only when both"decoding_strategy"
and"helper"
arguments are None when calling the decoder. Seeforward()
for more details.- “helper_infer”: dict
Same as
"helper_train"
but during inference mode.- “name”: str
Name of the decoder. The default value is
"basic_rnn_decoder"
.
- next_inputs(helper, time, outputs)[source]¶
Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(next_inputs, finished)
.next_inputs
is the tensor that should be used as input for the next step.finished
is a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.
BasicRNNDecoderOutput¶
- class texar.torch.modules.BasicRNNDecoderOutput(logits, sample_id, cell_output)[source]¶
The outputs of
BasicRNNDecoder
that include both RNN outputs and sampled IDs at each step. This is also used to store results of all the steps after decoding the whole sequence.- property logits¶
The outputs of RNN (at each step/of all steps) by applying the output layer on cell outputs. For example, in
BasicRNNDecoder
with default hyperparameters, this is a torch.Tensor of shape[batch_size, max_time, vocab_size]
after decoding the whole sequence.
- property sample_id¶
The sampled results (at each step/of all steps). For example, in
BasicRNNDecoder
with decoding strategy of"train_greedy"
, this is a torch.LongTensor of shape[batch_size, max_time]
containing the sampled token indices of all steps. Note that the shape ofsample_id
is different for different decoding strategy or helper. Please refer toHelper
for the detailed information.
- property cell_output¶
The output of RNN cell (at each step/of all steps). This contains the results prior to the output layer. For example, in
BasicRNNDecoder
with default hyperparameters, this is a torch.Tensor of shape[batch_size, max_time, cell_output_size]
after decoding the whole sequence.
AttentionRNNDecoder¶
- class texar.torch.modules.AttentionRNNDecoder(input_size, encoder_output_size, vocab_size, token_embedder=None, token_pos_embedder=None, cell=None, output_layer=None, cell_input_fn=None, hparams=None)[source]¶
RNN decoder with attention mechanism.
- Parameters
input_size (int) – Dimension of input embeddings.
encoder_output_size (int) – The output size of the encoder cell.
vocab_size (int) – Vocabulary size. Required if
output_layer
is None.token_embedder – An instance of torch.nn.Module, or a function taking a torch.LongTensor
tokens
as argument. This is the embedder called inembed_tokens()
to convert input tokens to embeddings.token_pos_embedder –
An instance of torch.nn.Module, or a function taking two torch.LongTensors
tokens
andpositions
as argument. This is the embedder called inembed_tokens()
to convert input tokens with positions to embeddings.Note
Only one among
token_embedder
andtoken_pos_embedder
should be specified. If neither is specified, you must subclassAttentionRNNDecoder
and overrideembed_tokens()
.cell (RNNCellBase, optional) – An instance of
RNNCellBase
. If None, a cell is created as specified inhparams
.output_layer (optional) –
An output layer that transforms cell output to logits. This can be:
A callable layer, e.g., an instance of torch.nn.Module.
A tensor. A dense layer will be created using the tensor as the kernel weights. The bias of the dense layer is determined by hparams.output_layer_bias. This can be used to tie the output layer with the input embedding matrix, as proposed in https://arxiv.org/pdf/1608.05859.pdf
None. A dense layer will be created based on
vocab_size
and hparams.output_layer_bias.If no output layer after the cell output is needed, set (vocab_size=None, output_layer=texar.torch.core.identity).
cell_input_fn (callable, optional) – A callable that produces RNN cell inputs. If None (default), the default is used:
lambda inputs, attention: torch.cat([inputs, attention], -1)
, which concatenates regular RNN cell inputs with attentions.hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
See
texar.torch.modules.RNNDecoderBase.forward()
for the inputs and outputs of the decoder. The decoder returns (outputs, final_state, sequence_lengths), where outputs is an instance ofAttentionRNNDecoderOutput
.Example
# Encodes the source enc_embedder = WordEmbedder(data.source_vocab.size, ...) encoder = UnidirectionalRNNEncoder(...) enc_outputs, _ = encoder( inputs=enc_embedder(data_batch['source_text_ids']), sequence_length=data_batch['source_length']) # Decodes while attending to the source dec_embedder = WordEmbedder(vocab_size=data.target_vocab.size, ...) decoder = AttentionRNNDecoder( encoder_output_size=(self.encoder.cell_fw.hidden_size + self.encoder.cell_bw.hidden_size), input_size=dec_embedder.dim, vocab_size=data.target_vocab.size) outputs, _, _ = decoder( decoding_strategy='train_greedy', memory=enc_outputs, memory_sequence_length=data_batch['source_length'], inputs=dec_embedder(data_batch['target_text_ids']), sequence_length=data_batch['target_length']-1)
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values. Common hyperparameters are the same as in
BasicRNNDecoder
.default_hparams()
. Additional hyperparameters are for attention mechanism configuration.{ "attention": { "type": "LuongAttention", "kwargs": { "num_units": 256, }, "attention_layer_size": None, "alignment_history": False, "output_attention": True, }, # The following hyperparameters are the same as with # `BasicRNNDecoder` "rnn_cell": default_rnn_cell_hparams(), "max_decoding_length_train": None, "max_decoding_length_infer": None, "helper_train": { "type": "TrainingHelper", "kwargs": {} } "helper_infer": { "type": "SampleEmbeddingHelper", "kwargs": {} } "name": "attention_rnn_decoder" }
Here:
- “attention”: dict
Attention hyperparameters, including:
- “type”: str or class or instance
The attention type. Can be an attention class, its name or module path, or a class instance. The class must be a subclass of
AttentionMechanism
. See Attention Mechanism for all supported attention mechanisms. If class name is given, the class must be from modulestexar.torch.core
ortexar.torch.custom
.Example:
# class name "type": "LuongAttention" "type": "BahdanauAttention" # module path "type": "texar.torch.core.BahdanauMonotonicAttention" "type": "my_module.MyAttentionMechanismClass" # class "type": texar.torch.core.LuongMonotonicAttention # instance "type": LuongAttention(...)
- “kwargs”: dict
keyword arguments for the attention class constructor. Arguments
memory
andmemory_sequence_length
should not be specified here because they are given to the decoder constructor. Ignored if “type” is an attention class instance. For example:"type": "LuongAttention", "kwargs": { "num_units": 256, "probability_fn": torch.nn.functional.softmax, }
Here “probability_fn” can also be set to the string name or module path to a probability function.
- “attention_layer_size”: int or None
The depth of the attention (output) layer. The context and cell output are fed into the attention layer to generate attention at each time step. If None (default), use the context as attention at each time step.
- “alignment_history”: bool
whether to store alignment history from all time steps in the final output state. (Stored as a time major TensorArray on which you must call stack().)
- “output_attention”: bool
If True (default), the output at each time step is the attention value. This is the behavior of Luong-style attention mechanisms. If False, the output at each time step is the output of cell. This is the behavior of Bahdanau-style attention mechanisms. In both cases, the attention tensor is propagated to the next time step via the state and is used there. This flag only controls whether the attention mechanism is propagated up to the next cell in an RNN stack or to the top RNN output.
- next_inputs(helper, time, outputs)[source]¶
Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(next_inputs, finished)
.next_inputs
is the tensor that should be used as input for the next step.finished
is a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.
- forward(memory, memory_sequence_length=None, inputs=None, sequence_length=None, initial_state=None, helper=None, max_decoding_length=None, impute_finished=False, infer_mode=None, beam_width=None, length_penalty=0.0, **kwargs)[source]¶
Performs decoding.
Implementation calls initialize() once and step() repeatedly on the Decoder object. Please refer to tf.contrib.seq2seq.dynamic_decode.
See also
Arguments of
create_helper()
.- Parameters
memory – The memory to query; usually the output of an RNN encoder. This tensor should be shaped [batch_size, max_time, …].
memory_sequence_length – (optional) Sequence lengths for the batch entries in memory. If provided, the memory tensor rows are masked with zeros for values past the respective sequence lengths.
inputs (optional) –
Input tensors for teacher forcing decoding. Used when
decoding_strategy
is set to"train_greedy"
, or when hparams-configured helper is used.The attr:inputs is a torch.LongTensor used as index to look up embeddings and feed in the decoder. For example, if
embedder
is an instance ofWordEmbedder
, theninputs
is usually a 2D int Tensor [batch_size, max_time] (or [max_time, batch_size] if input_time_major == True) containing the token indexes.sequence_length (optional) – A 1D int Tensor containing the sequence length of
inputs
. Used when decoding_strategy=”train_greedy” or hparams-configured helper is used.initial_state (optional) – Initial state of decoding. If None (default), zero state is used.
helper (optional) – An instance of
Helper
that defines the decoding strategy. If given,decoding_strategy
and helper configurations inhparams
are ignored.max_decoding_length – A int scalar Tensor indicating the maximum allowed number of decoding steps. If None (default), either hparams[“max_decoding_length_train”] or hparams[“max_decoding_length_infer”] is used according to
mode
.impute_finished (bool) – If True, then states for batch entries which are marked as finished get copied through and the corresponding outputs get zeroed out. This causes some slowdown at each time step, but ensures that the final state and outputs have the correct values and that backprop ignores time steps that were marked as finished.
infer_mode (optional) – If not None, overrides mode given by self.training.
beam_width (int) – Set to use beam search. If given,
decoding_strategy
is ignored.length_penalty (float) – Length penalty coefficient used in beam search decoding. Refer to https://arxiv.org/abs/1609.08144 for more details. It should be larger if longer sentences are desired.
**kwargs – Other keyword arguments for constructing helpers defined by
hparams["helper_train"]
orhparams["helper_infer"]
.
- Returns
For beam search decoding, returns a
dict
containing keys"sample_id"
and"log_prob"
."sample_id"
is a torch.LongTensor of shape[batch_size, max_time, beam_width]
containing generated token indexes.sample_id[:,:,0]
is the highest-probable sample."log_prob"
is a torch.Tensor of shape[batch_size, beam_width]
containing the log probability of each sequence sample.
For “infer_greedy” and “infer_sample” decoding or decoding with
helper
, returns a tuple (outputs, final_state, sequence_lengths), whereoutputs: an object containing the decoder output on all time steps.
final_state: is the cell state of the final time step.
sequence_lengths: is an int Tensor of shape [batch_size] containing the length of each sample.
AttentionRNNDecoderOutput¶
- class texar.torch.modules.AttentionRNNDecoderOutput(logits, sample_id, cell_output, attention_scores, attention_context)[source]¶
The outputs of
AttentionRNNDecoder
that additionally includes attention results.- property logits¶
The outputs of RNN (at each step/of all steps) by applying the output layer on cell outputs. For example, in
AttentionRNNDecoder
with default hyperparameters, this is a torch.Tensor of shape[batch_size, max_time, vocab_size]
after decoding the whole sequence.
- property sample_id¶
The sampled results (at each step/of all steps). For example, in
AttentionRNNDecoder
with decoding strategy of"train_greedy"
, this is a torch.LongTensor of shape[batch_size, max_time]
containing the sampled token indices of all steps. Note that the shape ofsample_id
is different for different decoding strategy or helper. Please refer toHelper
for the detailed information.
- property cell_output¶
The output of RNN cell (at each step/of all steps). This contains the results prior to the output layer. For example, in
AttentionRNNDecoder
with default hyperparameters, this is a torch.Tensor of shape[batch_size, max_time, cell_output_size]
after decoding the whole sequence.
- property attention_scores¶
A single or tuple of Tensor(s) containing the alignments emitted (at the previous time step/of all time steps) for each attention mechanism.
- property attention_context¶
The attention emitted (at the previous time step/of all time steps).
GPT2Decoder¶
- class texar.torch.modules.GPT2Decoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Raw GPT2 Transformer for decoding sequences. Please see
PretrainedGPT2Mixin
for a brief description of GPT2.This module basically stacks
WordEmbedder
,PositionEmbedder
,TransformerDecoder
.This module supports the architecture first proposed in (Radford et al.) GPT2.
- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
gpt2-small
). Please refer toPretrainedGPT2Mixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The decoder arch is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored.Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "name": "gpt2_decoder", "pretrained_model_name": "gpt2-small", "vocab_size": 50257, "context_size": 1024, "embedding_size": 768, "embed": { "dim": 768, "name": "word_embeddings" }, "position_size": 1024, "position_embed": { "dim": 768, "name": "position_embeddings" }, # hparams for TransformerDecoder "decoder": { "dim": 768, "num_blocks": 12, "embedding_dropout": 0, "residual_dropout": 0, "multihead_attention": { "use_bias": True, "num_units": 768, "num_heads": 12, "dropout_rate": 0.0, "output_dim": 768 }, "initializer": { "type": "variance_scaling_initializer", "kwargs": { "factor": 1.0, "mode": "FAN_AVG", "uniform": True } }, "eps": 1e-5, "poswise_feedforward": { "layers": [ { "type": "Linear", "kwargs": { "in_features": 768, "out_features": 3072, "bias": True } }, { "type": "GPTGELU", "kwargs": {} }, { "type": "Linear", "kwargs": { "in_features": 3072, "out_features": 768, "bias": True } } ], "name": "ffn" } }, }
Here:
The default parameters are values for 124M GPT2 model.
- “pretrained_model_name”: str or None
The name of the pre-trained GPT2 model. If None, the model will be randomly initialized.
- “embed”: dict
Hyperparameters for word embedding layer.
- “vocab_size”: int
The vocabulary size of inputs in GPT2Model.
- “position_embed”: dict
Hyperparameters for position embedding layer.
- “eps”: float
Epsilon values for layer norm layers.
- “position_size”: int
The maximum sequence length that this model might ever be used with.
- “name”: str
Name of the module.
- forward(inputs=None, sequence_length=None, memory=None, memory_sequence_length=None, memory_attention_bias=None, context=None, context_sequence_length=None, helper=None, decoding_strategy='train_greedy', max_decoding_length=None, impute_finished=False, infer_mode=None, beam_width=None, length_penalty=0.0, **kwargs)[source]¶
Performs decoding. Has exact the same interfaces with
texar.torch.modules.TransformerDecoder.forward()
. Please refer to it for the detailed usage.
XLNetDecoder¶
- class texar.torch.modules.XLNetDecoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Raw XLNet module for decoding sequences. Please see
PretrainedXLNetMixin
for a brief description of XLNet.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
xlnet-based-cased
). Please refer toPretrainedXLNetMixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The decoder arch is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored.Otherwise, the decoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the decoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "pretrained_model_name": "xlnet-base-cased", "untie_r": True, "num_layers": 12, "mem_len": 0, "reuse_len": 0, "num_heads": 12, "hidden_dim": 768, "head_dim": 64, "dropout": 0.1, "attention_dropout": 0.1, "use_segments": True, "ffn_inner_dim": 3072, "activation": 'gelu', "vocab_size": 32000, "max_seq_length": 512, "initializer": None, "name": "xlnet_decoder", }
Here:
The default parameters are values for cased XLNet-Base model.
- “pretrained_model_name”: str or None
The name of the pre-trained XLNet model. If None, the model will be randomly initialized.
- “untie_r”: bool
Whether to untie the biases in attention.
- “num_layers”: int
The number of stacked layers.
- “mem_len”: int
The number of tokens to cache.
- “reuse_len”: int
The number of tokens in the current batch to be cached and reused in the future.
- “num_heads”: int
The number of attention heads.
- “hidden_dim”: int
The hidden size.
- “head_dim”: int
The dimension size of each attention head.
- “dropout”: float
Dropout rate.
- “attention_dropout”: float
Dropout rate on attention probabilities.
- “use_segments”: bool
Whether to use segment embedding.
- “ffn_inner_dim”: int
The hidden size in feed-forward layers.
- “activation”: str
relu or gelu.
- “vocab_size”: int
The vocabulary size.
- “max_seq_length”: int
The maximum sequence length for RelativePositionalEncoding.
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()
for details.- “name”: str
Name of the module.
- embed_tokens(tokens, positions)[source]¶
Convert tokens along with positions to embeddings.
- Parameters
tokens – A torch.LongTensor denoting the token indices to convert to embeddings.
positions – A torch.LongTensor with the same size as
tokens
, denoting the positions of the tokens. This is useful if the decoder uses positional embeddings.
- Returns
A torch.Tensor of size
tokens.size() + (embed_dim,)
, denoting the converted embeddings.
- next_inputs(helper, time, outputs)[source]¶
Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(next_inputs, finished)
.next_inputs
is the tensor that should be used as input for the next step.finished
is a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.
- forward(start_tokens, memory=None, cache_len=512, max_decoding_length=500, recompute_memory=True, print_steps=False, helper_type=None, **helper_kwargs)[source]¶
Perform autoregressive decoding using XLNet. The algorithm is largely inspired by: https://github.com/rusiaaman/XLNet-gen.
- Parameters
start_tokens – A LongTensor of shape [batch_size, prompt_len], representing the tokenized initial prompt.
memory (optional) – The initial memory.
cache_len – Length of memory (number of tokens) to cache.
max_decoding_length (int) – Maximum number of tokens to decode.
recompute_memory (bool) – If True, the entire memory is recomputed for each token to generate. This leads to better performance because it enables every generated token to attend to each other, compared to reusing previous memory which is equivalent to using a causal attention mask. However, it is computationally more expensive. Defaults to True.
print_steps (bool) – If True, will print decoding progress.
helper – Type (or name of the type) of any sub-class of
Helper
.helper_kwargs – The keyword arguments to pass to constructor of the specific helper type.
- Returns
A tuple of (output, new_memory): - `output`: The sampled tokens as a list of integers. - `new_memory`: The memory of the sampled tokens.
XLNetDecoderOutput¶
- class texar.torch.modules.XLNetDecoderOutput(logits, sample_id)[source]¶
The output of
XLNetDecoder
.- property logits¶
A torch.Tensor of shape
[batch_size, max_time, vocab_size]
containing the logits.
- property sample_id¶
A torch.LongTensor of shape
[batch_size, max_time]
(or[batch_size, max_time, vocab_size]
) containing the sampled token indices. Note that the shape ofsample_id
is different for different decoding strategy or helper. Please refer toHelper
for the detailed information.
TransformerDecoder¶
- class texar.torch.modules.TransformerDecoder(token_embedder=None, token_pos_embedder=None, vocab_size=None, output_layer=None, hparams=None)[source]¶
Transformer decoder that applies multi-head self-attention for sequence decoding.
It is a stack of
MultiheadAttentionEncoder
,FeedForwardNetwork
, and residual connections.- Parameters
token_embedder – An instance of torch.nn.Module, or a function taking a torch.LongTensor
tokens
as argument. This is the embedder called inembed_tokens()
to convert input tokens to embeddings.token_pos_embedder –
An instance of torch.nn.Module, or a function taking two torch.LongTensors
tokens
andpositions
as argument. This is the embedder called inembed_tokens()
to convert input tokens with positions to embeddings.Note
Only one among
token_embedder
andtoken_pos_embedder
should be specified. If neither is specified, you must subclassTransformerDecoder
and overrideembed_tokens()
.vocab_size (int, optional) – Vocabulary size. Required if
output_layer
is None.output_layer (optional) –
An output layer that transforms cell output to logits. This can be:
A callable layer, e.g., an instance of torch.nn.Module.
A tensor. A torch.nn.Linear layer will be created using the tensor as weights. The bias of the dense layer is determined by
hparams.output_layer_bias
. This can be used to tie the output layer with the input embedding matrix, as proposed in https://arxiv.org/pdf/1608.05859.pdf.None. A torch.nn.Linear layer will be created based on
vocab_size
andhparams.output_layer_bias
.If no output layer is needed at the end, set
vocab_size
to None andoutput_layer
toidentity()
.
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- initialize_blocks()[source]¶
Helper function which initializes blocks for decoder.
Should be overridden by any classes where block initialization varies.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # Same as in TransformerEncoder "num_blocks": 6, "dim": 512, "embedding_dropout": 0.1, "residual_dropout": 0.1, "poswise_feedforward": default_transformer_poswise_net_hparams, "multihead_attention": { 'name': 'multihead_attention', 'num_units': 512, 'output_dim': 512, 'num_heads': 8, 'dropout_rate': 0.1, 'use_bias': False, }, "eps": 1e-12, "initializer": None, "name": "transformer_decoder" # Additional for TransformerDecoder "embedding_tie": True, "output_layer_bias": False, "max_decoding_length": int(1e10), }
Here:
- “num_blocks”: int
Number of stacked blocks.
- “dim”: int
Hidden dimension of the encoder.
- “embedding_dropout”: float
Dropout rate of the input word and position embeddings.
- “residual_dropout”: float
Dropout rate of the residual connections.
- “poswise_feedforward”: dict
Hyperparameters for a feed-forward network used in residual connections. Make sure the dimension of the output tensor is equal to
dim
.See
default_transformer_poswise_net_hparams()
for details.- “multihead_attention”: dict
Hyperparameters for the multi-head attention strategy. Make sure the
output_dim
in this module is equal todim
.See
MultiheadAttentionEncoder
for details.- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module.
See
get_initializer()
for details.- “embedding_tie”: bool
Whether to use the word embedding matrix as the output layer that computes logits. If False, a new dense layer is created.
- “eps”: float
Epsilon values for layer norm layers.
- “output_layer_bias”: bool
Whether to use bias to the output layer.
- “max_decoding_length”: int
The maximum allowed number of decoding steps. Set to a very large number of avoid the length constraint. Ignored if provided in
forward()
or"train_greedy"
decoding is used.- “name”: str
Name of the module.
- forward(inputs=None, sequence_length=None, memory=None, memory_sequence_length=None, memory_attention_bias=None, context=None, context_sequence_length=None, helper=None, decoding_strategy='train_greedy', max_decoding_length=None, impute_finished=False, infer_mode=None, beam_width=None, length_penalty=0.0, **kwargs)[source]¶
Performs decoding.
The interface is very similar to that of RNN decoders (
RNNDecoderBase
). In particular, the function provides 3 ways to specify the decoding method, with varying flexibility:The
decoding_strategy
argument.“train_greedy”: decoding in teacher-forcing fashion (i.e., feeding ground truth to decode the next step), and for each step sample is obtained by taking the argmax of logits. Argument
inputs
is required for this strategy.sequence_length
is optional.“infer_greedy”: decoding in inference fashion (i.e., feeding generated sample to decode the next step), and for each step sample is obtained by taking the argmax of logits. Arguments
(start_tokens, end_token)
are required for this strategy, and argumentmax_decoding_length
is optional.“infer_sample”: decoding in inference fashion, and for each step sample is obtained by random sampling from the logits. Arguments
(start_tokens, end_token)
are required for this strategy, and argumentmax_decoding_length
is optional.
This argument is used only when arguments
helper
andbeam_width
are both None.The
helper
argument: An instance of subclass ofHelper
. This provides a superset of decoding strategies than above. The interface is the same as in RNN decoders. Please refer totexar.torch.modules.RNNDecoderBase.forward()
for detailed usage and examples.Note that, here, though using a
TrainingHelper
corresponding to the"train_greedy"
strategy above, the implementation is slower than directly settingdecoding_strategy="train_greedy"
(though output results are the same).Argument
max_decoding_length
is optional.Beam search: set
beam_width
to use beam search decoding. Arguments(start_tokens, end_token)
are required, and argumentmax_decoding_length
is optional.
- Parameters
memory (optional) – The memory to attend, e.g., the output of an RNN encoder. A torch.Tensor of shape
[batch_size, memory_max_time, dim]
.memory_sequence_length (optional) – A torch.Tensor of shape
[batch_size]
containing the sequence lengths for the batch entries in memory. Used to create attention bias ofmemory_attention_bias
is not given. Ignored ifmemory_attention_bias
is provided.memory_attention_bias (optional) – A torch.Tensor of shape
[batch_size, num_heads, memory_max_time, dim]
. An attention bias typically sets the value of a padding position to a large negative value for masking. If not given,memory_sequence_length
is used to automatically create an attention bias.inputs (optional) –
Input tensors for teacher forcing decoding. Used when
decoding_strategy
is set to"train_greedy"
, or when hparams-configured helper is used.The attr:inputs is a torch.LongTensor used as index to look up embeddings and feed in the decoder. For example, if
embedder
is an instance ofWordEmbedder
, theninputs
is usually a 2D int Tensor [batch_size, max_time] (or [max_time, batch_size] if input_time_major == True) containing the token indexes.sequence_length (optional) – A torch.LongTensor of shape
[batch_size]
, containing the sequence length ofinputs
. Tokens beyond the respective sequence length are masked out. Used whendecoding_strategy
is set to"train_greedy"
.decoding_strategy (str) – A string specifying the decoding strategy, including
"train_greedy"
,"infer_greedy"
,"infer_sample"
. Different arguments are required based on the strategy. See above for details. Ignored ifbeam_width
orhelper
is set.beam_width (int) – Set to use beam search. If given,
decoding_strategy
is ignored.length_penalty (float) – Length penalty coefficient used in beam search decoding. Refer to https://arxiv.org/abs/1609.08144 for more details. It should be larger if longer sentences are desired.
context (optional) – An torch.LongTensor of shape
[batch_size, length]
, containing the starting tokens for decoding. If context is set,start_tokens
of theHelper
will be ignored.context_sequence_length (optional) – Specify the length of context.
max_decoding_length (int, optional) – The maximum allowed number of decoding steps. If None (default), use
"max_decoding_length"
defined inhparams
. Ignored in"train_greedy"
decoding.impute_finished (bool) – If True, then states for batch entries which are marked as finished get copied through and the corresponding outputs get zeroed out. This causes some slowdown at each time step, but ensures that the final state and outputs have the correct values and that backprop ignores time steps that were marked as finished. Ignored in
"train_greedy"
decoding.helper (optional) – An instance of
Helper
that defines the decoding strategy. If given,decoding_strategy
and helper configurations inhparams
are ignored.infer_mode (optional) – If not None, overrides mode given by
self.training
.**kwargs (optional, dict) –
Other keyword arguments. Typically ones such as:
start_tokens: A torch.LongTensor of shape
[batch_size]
, the start tokens. Used whendecoding_strategy
is"infer_greedy"
or"infer_sample"
or whenbeam_search
is set. Ignored whencontext
is set.When used with the Texar data module, to get
batch_size
samples wherebatch_size
is changing according to the data module, this can be set asstart_tokens=torch.full_like(batch['length'], bos_token_id)
.end_token: An integer or 0D torch.LongTensor, the token that marks the end of decoding. Used when
decoding_strategy
is"infer_greedy"
or"infer_sample"
, or whenbeam_search
is set.
- Returns
For “train_greedy” decoding, returns an instance of
TransformerDecoderOutput
which contains sample_id and logits.For “infer_greedy” and “infer_sample” decoding or decoding with
helper
, returns a tuple(outputs, sequence_lengths)
, whereoutputs
is an instance ofTransformerDecoderOutput
as in “train_greedy”, andsequence_lengths
is a torch.LongTensor of shape[batch_size]
containing the length of each sample.For beam search decoding, returns a
dict
containing keys"sample_id"
and"log_prob"
."sample_id"
is a torch.LongTensor of shape[batch_size, max_time, beam_width]
containing generated token indexes.sample_id[:,:,0]
is the highest-probable sample."log_prob"
is a torch.Tensor of shape[batch_size, beam_width]
containing the log probability of each sequence sample.
- property output_size¶
Output size of one step.
- initialize(helper, inputs, sequence_length, initial_state)[source]¶
Called before any decoding iterations.
This methods must compute initial input values and initial state.
- Parameters
helper – The
Helper
instance to use.inputs (optional) – A (structure of) input tensors.
sequence_length (optional) – A torch.LongTensor representing lengths of each sequence.
initial_state – A possibly nested structure of tensors indicating the initial decoder state.
- Returns
A tuple
(finished, initial_inputs, initial_state)
representing initial values offinished
flags, inputs, and state.
- step(helper, time, inputs, state)[source]¶
Compute the output and the state at the current time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(outputs, next_state)
.outputs
is an object containing the decoder output.next_state
is the decoder state for the next time step.
- next_inputs(helper, time, outputs)[source]¶
Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).
- Parameters
- Returns
A tuple
(next_inputs, finished)
.next_inputs
is the tensor that should be used as input for the next step.finished
is a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.
- finalize(outputs, final_state, sequence_lengths)[source]¶
Called after all decoding iterations have finished.
- Parameters
outputs – Outputs at each time step.
final_state – The RNNCell state after the last time step.
sequence_lengths – Sequence lengths for each sequence in batch.
- Returns
A tuple
(outputs, final_state)
.outputs
is an object containing the decoder output.final_state
is the final decoder state.
TransformerDecoderOutput¶
- class texar.torch.modules.TransformerDecoderOutput(logits, sample_id)[source]¶
The output of
TransformerDecoder
.- property logits¶
A torch.Tensor of shape
[batch_size, max_time, vocab_size]
containing the logits.
- property sample_id¶
A torch.LongTensor of shape
[batch_size, max_time]
(or[batch_size, max_time, vocab_size]
) containing the sampled token indices. Note that the shape ofsample_id
is different for different decoding strategy or helper. Please refer toHelper
for the detailed information.
Helper¶
- class texar.torch.modules.Helper(*args, **kwds)[source]¶
Interface for implementing sampling in seq2seq decoders.
Please refer to the documentation for the TensorFlow counterpart tf.contrib.seq2seq.Helper.
- initialize(embedding_fn, inputs, sequence_length)[source]¶
Initialize the current batch.
- Parameters
embedding_fn – A function taking input tokens and timestamps, returning embedding tensors.
inputs – Input tensors.
sequence_length – An int32 vector tensor.
- Returns
(initial_finished, initial_inputs)
.
TrainingHelper¶
- class texar.torch.modules.TrainingHelper(time_major=False)[source]¶
A helper for use during training. Only reads inputs.
Returned
sample_ids
are the argmax of the RNN output logits.Please refer to the documentation for the TensorFlow counterpart tf.contrib.seq2seq.TrainingHelper.
- Parameters
time_major (bool) – Whether the tensors in
inputs
are time major. If False (default), they are assumed to be batch major.
EmbeddingHelper¶
- class texar.torch.modules.EmbeddingHelper(start_tokens, end_token)[source]¶
A generic helper for use during inference.
Uses output logits for sampling, and passes the result through an embedding layer to get the next input.
- Parameters
start_tokens – 1D torch.LongTensor shaped
[batch_size]
, representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
- Raises
ValueError – if
start_tokens
is not a 1D tensor orend_token
is not a scalar.
GreedyEmbeddingHelper¶
- class texar.torch.modules.GreedyEmbeddingHelper(start_tokens, end_token)[source]¶
A helper for use during inference.
Uses the argmax of the output (treated as logits) and passes the result through an embedding layer to get the next input.
Note that for greedy decoding, Texar’s decoders provide a simpler interface by specifying
decoding_strategy='infer_greedy'
when calling a decoder (see, e.g.,,RNN decoder
). In this case, use ofGreedyEmbeddingHelper
is not necessary.Please refer to the documentation for the TensorFlow counterpart tf.contrib.seq2seq.GreedyEmbeddingHelper.
- Parameters
start_tokens – 1D torch.LongTensor shaped
[batch_size]
, representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
- Raises
ValueError – if
start_tokens
is not a 1D tensor orend_token
is not a scalar.
SampleEmbeddingHelper¶
- class texar.torch.modules.SampleEmbeddingHelper(start_tokens, end_token, softmax_temperature=None)[source]¶
A helper for use during inference.
Uses sampling (from a distribution) instead of argmax and passes the result through an embedding layer to get the next input.
Please refer to the documentation for the TensorFlow counterpart tf.contrib.seq2seq.SampleEmbeddingHelper.
- Parameters
embedding – A callable or the
params
argument for torch.nn.functional.embedding. If a callable, it can take a vector tensor ofids
(argmax ids), or take two arguments (ids
,times
), whereids
is a vector of argmax ids, andtimes
is a vector of current time steps (i.e., position ids). The latter case can be used whenembedding
is a combination of word embedding and position embedding. The returned tensor will be passed to the decoder input.start_tokens – 1D torch.LongTensor shaped
[batch_size]
, representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
softmax_temperature (float, optional) – Value to divide the logits by before computing the softmax. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to 1.0.
- Raises
ValueError – if
start_tokens
is not a 1D tensor orend_token
is not a scalar.
TopKSampleEmbeddingHelper¶
- class texar.torch.modules.TopKSampleEmbeddingHelper(start_tokens, end_token, top_k=10, softmax_temperature=None)[source]¶
A helper for use during inference.
Samples from
top_k
most likely candidates from a vocab distribution, and passes the result through an embedding layer to get the next input.- Parameters
start_tokens – 1D torch.LongTensor shaped
[batch_size]
, representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
top_k (int, optional) – Number of top candidates to sample from. Must be >=0. If set to 0, samples from all candidates (i.e., regular random sample decoding). Defaults to 10.
softmax_temperature (float, optional) – Value to divide the logits by before computing the softmax. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to 1.0.
- Raises
ValueError – if
start_tokens
is not a 1D tensor orend_token
is not a scalar.
TopPSampleEmbeddingHelper¶
- class texar.torch.modules.TopPSampleEmbeddingHelper(start_tokens, end_token, p=0.9, softmax_temperature=None)[source]¶
A helper for use during inference.
Samples from candidates that have a cumulative probability of at most p when arranged in decreasing order, and passes the result through an embedding layer to get the next input. This is also named as “Nucleus Sampling” as proposed in the paper “The Curious Case of Neural Text Degeneration(Holtzman et al.)”.
- Parameters
start_tokens – 1D torch.LongTensor shaped
[batch_size]
, representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
p (float, optional) – A value used to filter out tokens whose cumulative probability is greater than p when arranged in decreasing order of probabilities. Must be between [0, 1.0]. If set to 1, samples from all candidates (i.e., regular random sample decoding). Defaults to 0.5.
softmax_temperature (float, optional) – Value to divide the logits by before computing the softmax. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to 1.0.
- Raises
ValueError – if
start_tokens
is not a 1D tensor orend_token
is not a scalar.
SoftmaxEmbeddingHelper¶
- class texar.torch.modules.SoftmaxEmbeddingHelper(start_tokens, end_token, tau, stop_gradient=False, use_finish=True)[source]¶
A helper that feeds softmax probabilities over vocabulary to the next step.
Uses the softmax probability vector to pass through word embeddings to get the next input (i.e., a mixed word embedding).
A subclass of
Helper
. Used as a helper toRNNDecoderBase
in inference mode.- Parameters
embedding – A callable or the
params
argument for torch.nn.functional.embedding. If a callable, it can take a vector tensor ofids
(argmax ids), or take two arguments (ids
,times
), whereids
is a vector of argmax ids, andtimes
is a vector of current time steps (i.e., position ids). The latter case can be used whenembedding
is a combination of word embedding and position embedding. The returned tensor will be passed to the decoder input.start_tokens – 1D torch.LongTensor shaped
[batch_size]
, representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
tau – A float scalar tensor, the softmax temperature.
stop_gradient (bool) – Whether to stop the gradient backpropagation when feeding softmax vector to the next step.
use_finish (bool) – Whether to stop decoding once
end_token
is generated. If False, decoding will continue untilmax_decoding_length
of the decoder is reached.
- Raises
ValueError – if
start_tokens
is not a 1D tensor orend_token
is not a scalar.
GumbelSoftmaxEmbeddingHelper¶
- class texar.torch.modules.GumbelSoftmaxEmbeddingHelper(start_tokens, end_token, tau, straight_through=False, stop_gradient=False, use_finish=True)[source]¶
A helper that feeds Gumbel softmax sample to the next step.
Uses the Gumbel softmax vector to pass through word embeddings to get the next input (i.e., a mixed word embedding).
A subclass of
Helper
. Used as a helper toRNNDecoderBase
in inference mode.Same as
SoftmaxEmbeddingHelper
except that here Gumbel softmax (instead of softmax) is used.- Parameters
embedding – A callable or the
params
argument for torch.nn.functional.embedding. If a callable, it can take a vector tensor ofids
(argmax ids), or take two arguments (ids
,times
), whereids
is a vector of argmax ids, andtimes
is a vector of current time steps (i.e., position ids). The latter case can be used whenembedding
is a combination of word embedding and position embedding. The returned tensor will be passed to the decoder input.start_tokens – 1D torch.LongTensor shaped
[batch_size]
, representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
tau – A float scalar tensor, the softmax temperature.
straight_through (bool) – Whether to use straight through gradient between time steps. If True, a single token with highest probability (i.e., greedy sample) is fed to the next step and gradient is computed using straight through. If False (default), the soft Gumbel-softmax distribution is fed to the next step.
stop_gradient (bool) – Whether to stop the gradient backpropagation when feeding softmax vector to the next step.
use_finish (bool) – Whether to stop decoding once
end_token
is generated. If False, decoding will continue untilmax_decoding_length
of the decoder is reached.
- Raises
ValueError – if
start_tokens
is not a 1D tensor orend_token
is not a scalar.
get_helper¶
- texar.torch.modules.get_helper(helper_type, start_tokens=None, end_token=None, **kwargs)[source]¶
Creates a Helper instance.
- Parameters
helper_type – A
Helper
class, its name or module path, or a class instance. If a class instance is given, it is returned directly.start_tokens – 1D torch.LongTensor shaped
[batch_size]
, representing the start tokens for each sequence in batch.end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.
**kwargs – Additional keyword arguments for constructing the helper.
- Returns
A helper instance.
Classifiers¶
BERTClassifier¶
- class texar.torch.modules.BERTClassifier(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Classifier based on BERT modules. Please see
PretrainedBERTMixin
for a brief description of BERT.This is a combination of the
BERTEncoder
with a classification layer. Both step-wise classification and sequence-level classification are supported, specified inhparams
.Arguments are the same as in
BERTEncoder
.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
bert-base-uncased
). Please refer toPretrainedBERTMixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in BertEncoder ... # (2) Additional hyperparameters "num_classes": 2, "logit_layer_kwargs": None, "clas_strategy": "cls_time", "max_seq_length": None, "dropout": 0.1, "name": "bert_classifier" }
Here:
Same hyperparameters as in
BERTEncoder
. See thedefault_hparams()
. An instance of BERTEncoder is created for feature extraction.Additional hyperparameters:
- “num_classes”: int
Number of classes:
If > 0, an additional Linear layer is appended to the encoder to compute the logits over classes.
If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.
- “logit_layer_kwargs”: dict
Keyword arguments for the logit Dense layer constructor, except for argument “units” which is set to num_classes. Ignored if no extra logit layer is appended.
- “clas_strategy”: str
The classification strategy, one of:
cls_time: Sequence-level classification based on the output of the first time step (which is the CLS token). Each sequence has a class.
all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.
time_wise: Step-wise classification, i.e., make classification for each time step based on its output.
- “max_seq_length”: int, optional
Maximum possible length of input sequences. Required if clas_strategy is all_time.
- “dropout”: float
The dropout rate of the BERT encoder output.
- “name”: str
Name of the classifier.
- forward(inputs, sequence_length=None, segment_ids=None)[source]¶
Feeds the inputs through the network and makes classification.
The arguments are the same as in
BERTEncoder
.- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
segment_ids (optional) – A 2D Tensor of shape [batch_size, max_time], containing the segment ids of tokens in input sequences. If None (default), a tensor with all elements set to zero is used.
- Returns
A tuple (logits, preds), containing the logits over classes and the predictions, respectively.
If
clas_strategy
iscls_time
orall_time
:If
num_classes
== 1,logits
andpred
are both of shape[batch_size]
.If
num_classes
> 1,logits
is of shape[batch_size, num_classes]
andpred
is of shape[batch_size]
.
If
clas_strategy
istime_wise
:num_classes
== 1,logits
andpred
are both of shape[batch_size, max_time]
.If
num_classes
> 1,logits
is of shape[batch_size, max_time, num_classes]
andpred
is of shape[batch_size, max_time]
.
RoBERTaClassifier¶
- class texar.torch.modules.RoBERTaClassifier(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Classifier based on RoBERTa modules. Please see
PretrainedRoBERTaMixin
for a brief description of RoBERTa.This is a combination of the
RoBERTaEncoder
with a classification layer. Both step-wise classification and sequence-level classification are supported, specified inhparams
.Arguments are the same as in
RoBERTaEncoder
.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
roberta-base
). Please refer toPretrainedRoBERTaMixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in RoBertaEncoder ... # (2) Additional hyperparameters "num_classes": 2, "logit_layer_kwargs": None, "clas_strategy": "cls_time", "max_seq_length": None, "dropout": 0.1, "name": "roberta_classifier" }
Here:
Same hyperparameters as in
RoBERTaEncoder
. See thedefault_hparams()
. An instance of RoBERTaEncoder is created for feature extraction.Additional hyperparameters:
- “num_classes”: int
Number of classes:
If > 0, an additional Linear layer is appended to the encoder to compute the logits over classes.
If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.
- “logit_layer_kwargs”: dict
Keyword arguments for the logit Dense layer constructor, except for argument “units” which is set to num_classes. Ignored if no extra logit layer is appended.
- “clas_strategy”: str
The classification strategy, one of:
cls_time: Sequence-level classification based on the output of the first time step (which is the CLS token). Each sequence has a class.
all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.
time_wise: Step-wise classification, i.e., make classification for each time step based on its output.
- “max_seq_length”: int, optional
Maximum possible length of input sequences. Required if clas_strategy is all_time.
- “dropout”: float
The dropout rate of the RoBERTa encoder output.
- “name”: str
Name of the classifier.
- forward(inputs, sequence_length=None)[source]¶
Feeds the inputs through the network and makes classification.
The arguments are the same as in
RoBERTaEncoder
.- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A tuple (logits, preds), containing the logits over classes and the predictions, respectively.
If
clas_strategy
iscls_time
orall_time
:If
num_classes
== 1,logits
andpred
are both of shape[batch_size]
.If
num_classes
> 1,logits
is of shape[batch_size, num_classes]
andpred
is of shape[batch_size]
.
If
clas_strategy
istime_wise
:num_classes
== 1,logits
andpred
are both of shape[batch_size, max_time]
.If
num_classes
> 1,logits
is of shape[batch_size, max_time, num_classes]
andpred
is of shape[batch_size, max_time]
.
GPT2Classifier¶
- class texar.torch.modules.GPT2Classifier(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Classifier based on GPT2 modules. Please see
PretrainedGPT2Mixin
for a brief description of GPT2.This is a combination of the
GPT2Encoder
with a classification layer. Both step-wise classification and sequence-level classification are supported, specified inhparams
.Arguments are the same as in
GPT2Encoder
.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
gpt2-small
). Please refer toPretrainedGPT2Mixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in GPT2Encoder ... # (2) Additional hyperparameters "num_classes": 2, "logit_layer_kwargs": None, "clas_strategy": `cls_time`, "max_seq_length": None, "dropout": 0.1, "name": `gpt2_classifier` }
Here:
Same hyperparameters as in
GPT2Encoder
. See thedefault_hparams()
. An instance of GPT2Encoder is created for feature extraction.Additional hyperparameters:
- “num_classes”: int
Number of classes:
If > 0, an additional Linear layer is appended to the encoder to compute the logits over classes.
If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.
- “logit_layer_kwargs”: dict
Keyword arguments for the logit Dense layer constructor, except for argument “units” which is set to num_classes. Ignored if no extra logit layer is appended.
- “clas_strategy”: str
The classification strategy, one of:
cls_time: Sequence-level classification based on the output of the last time step. Each sequence has a class.
all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.
time_wise: Step-wise classification, i.e., make classification for each time step based on its output.
- “max_seq_length”: int, optional
Maximum possible length of input sequences. Required if clas_strategy is all_time.
- “dropout”: float
The dropout rate of the GPT2 encoder output.
- “name”: str
Name of the classifier.
- forward(inputs, sequence_length=None)[source]¶
Feeds the inputs through the network and makes classification.
The arguments are the same as in
GPT2Encoder
.- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A tuple (logits, preds), containing the logits over classes and the predictions, respectively.
If
clas_strategy
iscls_time
orall_time
:If
num_classes
== 1,logits
andpred
are of both shape[batch_size]
.If
num_classes
> 1,logits
is of shape[batch_size, num_classes]
andpred
is of shape[batch_size]
.
If
clas_strategy
istime_wise
:If
num_classes
== 1,logits
andpred
are of both shape[batch_size, max_time]
.If
num_classes
> 1,logits
is of shape[batch_size, max_time, num_classes]
andpred
is of shape[batch_size, max_time]
.
UnidirectionalRNNClassifier¶
- class texar.torch.modules.UnidirectionalRNNClassifier(input_size, cell=None, output_layer=None, hparams=None)[source]¶
One directional RNN classifier. This is a combination of the
UnidirectionalRNNEncoder
with a classification layer. Both step-wise classification and sequence-level classification are supported, specified inhparams
.Arguments are the same as in
UnidirectionalRNNEncoder
.- Parameters
input_size (int) – The number of expected features in the input for the cell.
cell – (RNNCell, optional) If not specified, a cell is created as specified in
hparams["rnn_cell"]
.output_layer (optional) – An instance of torch.nn.Module. Applies to the RNN cell output of each step. If None (default), the output layer is created as specified in
hparams["output_layer"]
.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in UnidirectionalRNNEncoder ... # (2) Additional hyperparameters "num_classes": 2, "logit_layer_kwargs": None, "clas_strategy": "final_time", "max_seq_length": None, "name": "unidirectional_rnn_classifier" }
Here:
Same hyperparameters as in
UnidirectionalRNNEncoder
. See thedefault_hparams()
. An instance of UnidirectionalRNNEncoder is created for feature extraction.Additional hyperparameters:
- “num_classes”: int
Number of classes:
If > 0, an additional Linear layer is appended to the encoder to compute the logits over classes.
If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.
- “logit_layer_kwargs”: dict
Keyword arguments for the logit Dense layer constructor, except for argument “units” which is set to num_classes. Ignored if no extra logit layer is appended.
- “clas_strategy”: str
The classification strategy, one of:
final_time: Sequence-level classification based on the output of the final time step. Each sequence has a class.
all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.
time_wise: Step-wise classification, i.e., make classification for each time step based on its output.
- “max_seq_length”: int, optional
Maximum possible length of input sequences. Required if clas_strategy is all_time.
- “name”: str
Name of the classifier.
- forward(inputs, sequence_length=None, initial_state=None, time_major=False)[source]¶
Feeds the inputs through the network and makes classification.
The arguments are the same as in
UnidirectionalRNNEncoder
.- Parameters
inputs – A 3D Tensor of shape
[batch_size, max_time, dim]
. The first two dimensionsbatch_size
andmax_time
are exchanged iftime_major
is True.sequence_length (optional) – A 1D torch.LongTensor of shape
[batch_size]
. Sequence lengths of the batch inputs. Used to copy-through state and zero-out outputs when past a batch element’s sequence length.initial_state (optional) – Initial state of the RNN.
time_major (bool) – The shape format of the
inputs
andoutputs
Tensors. If True, these tensors are of shape[max_time, batch_size, depth]
. If False (default), these tensors are of shape[batch_size, max_time, depth]
.
- Returns
A tuple (logits, preds), containing the logits over classes and the predictions, respectively.
If
clas_strategy
isfinal_time
orall_time
:If
num_classes
== 1,logits
andpred
are both of shape[batch_size]
.If
num_classes
> 1,logits
is of shape[batch_size, num_classes]
andpred
is of shape[batch_size]
.
If
clas_strategy
istime_wise
:num_classes
== 1,logits
andpred
are both of shape[batch_size, max_time]
.If
num_classes
> 1,logits
is of shape[batch_size, max_time, num_classes]
andpred
is of shape[batch_size, max_time]
.If
time_major
is True, the batch and time dimensions are exchanged.
Conv1DClassifier¶
- class texar.torch.modules.Conv1DClassifier(in_channels, in_features=None, hparams=None)[source]¶
Simple Conv-1D classifier. This is a combination of the
Conv1DEncoder
with a classification layer.- Parameters
in_channels (int) – Number of channels in the input tensor.
in_features (int) – Size of the feature dimension in the input tensor.
hparams (dict, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
See
forward()
for the inputs and outputs. If"data_format"
is set to"channels_first"
(this is the default), inputs must be a tensor of shape [batch_size, channels, length]. If"data_format"
is set to"channels_last"
, inputs must be a tensor of shape [batch_size, length, channels]. For example, for sequence classification, length corresponds to time steps, and channels corresponds to embedding dim.Example:
inputs = torch.randn([64, 20, 256]) clas = Conv1DClassifier(in_channels=20, in_features=256, hparams={'num_classes': 10}) logits, pred = clas(inputs) # logits == Tensor of shape [64, 10] # pred == Tensor of shape [64]
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in Conv1DEncoder ... # (2) Additional hyperparameters "num_classes": 2, "logit_layer_kwargs": { "use_bias": False }, "name": "conv1d_classifier" }
Here:
Same hyperparameters as in
Conv1DEncoder
. See thedefault_hparams()
. An instance ofConv1DEncoder
is created for feature extraction.Additional hyperparameters:
- “num_classes”: int
Number of classes:
If > 0, an additional torch.nn.Linear layer is appended to the encoder to compute the logits over classes.
If <= 0, no dense layer is appended. The number of classes is assumed to be equal to
out_features
of the final dense layer size of the encoder.
- “logit_layer_kwargs”: dict
Keyword arguments for the logit torch.nn.Linear layer constructor, except for argument
out_features
which is set to"num_classes"
. Ignored if no extra logit layer is appended.- “name”: str
Name of the classifier.
- forward(input, sequence_length=None, dtype=None, data_format=None)[source]¶
Feeds the inputs through the network and makes classification.
The arguments are the same as in
Conv1DEncoder
.The predictions of binary classification (
num_classes
=1) and multi-way classification (num_classes
>1) are different, as explained below.- Parameters
input – The inputs to the network, which is a 3D tensor. See
Conv1DEncoder
for more details.sequence_length (optional) – An int tensor of shape [batch_size] or a python array containing the length of each element in
inputs
. If given, time steps beyond the length will first be masked out before feeding to the layers.dtype (optional) – Type of the inputs. If not provided, infers from inputs automatically.
data_format (optional) – Data type of the input tensor. If
channels_last
, the last dimension will be treated as channel dimension so the size of theinput
should be [batch_size, X, channel]. Ifchannels_first
, first dimension will be treated as channel dimension so the size should be [batch_size, channel, X]. Defaults to None. If None, the value will be picked from hyperparameters.
- Returns
A tuple
(logits, pred)
, wherelogits
is a torch.Tensor of shape[batch_size, num_classes]
fornum_classes
>1, and[batch_size]
fornum_classes
=1 (i.e., binary classification).pred
is the prediction, a torch.LongTensor of shape[batch_size]
. For binary classification, the standard sigmoid function is used for prediction, and the class labels are{0, 1}
.
- property num_classes¶
The number of classes.
- property encoder¶
The classifier neural network.
- has_layer(layer_name)[source]¶
Returns True if the network with the name exists. Returns False otherwise.
- Parameters
layer_name (str) – Name of the layer.
- layer_by_name(layer_name)[source]¶
Returns the layer with the name. Returns None if the layer name does not exist.
- Parameters
layer_name (str) – Name of the layer.
- property layers_by_name¶
A dictionary mapping layer names to the layers.
- property layers¶
A list of the layers.
- property layer_names¶
A list of uniquified layer names.
XLNetClassifier¶
- class texar.torch.modules.XLNetClassifier(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Classifier based on XLNet modules. Please see
PretrainedXLNetMixin
for a brief description of XLNet.Arguments are the same as in
XLNetEncoder
.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
xlnet-based-cased
). Please refer toPretrainedXLNetMixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in XLNetEncoder ... # (2) Additional hyperparameters "clas_strategy": "cls_time", "use_projection": True, "num_classes": 2, "name": "xlnet_classifier", }
Here:
- Same hyperparameters as in
XLNetEncoder
. See thedefault_hparams()
. An instance of XLNetEncoder is created for feature extraction.
Additional hyperparameters:
- “clas_strategy”: str
The classification strategy, one of:
cls_time: Sequence-level classification based on the output of the last time step (which is the CLS token). Each sequence has a class.
all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.
time_wise: Step-wise classification, i.e., make classification for each time step based on its output.
- “use_projection”: bool
If True, an additional Linear layer is added after the summary step.
- “num_classes”: int
Number of classes:
If > 0, an additional torch.nn.Linear layer is appended to the encoder to compute the logits over classes.
If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.
- “name”: str
Name of the classifier.
- param_groups(lr=None, lr_layer_scale=1.0, decay_base_params=False)[source]¶
Create parameter groups for optimizers. When
lr_layer_decay_rate
is not 1.0, parameters from each layer form separate groups with different base learning rates.The return value of this method can be used in the constructor of optimizers, for example:
model = XLNetClassifier(...) param_groups = model.param_groups(lr=2e-5, lr_layer_scale=0.8) optim = torch.optim.Adam(param_groups)
- Parameters
lr (float) – The learning rate. Can be omitted if
lr_layer_decay_rate
is 1.0.lr_layer_scale (float) – Per-layer LR scaling rate. The i-th layer will be scaled by lr_layer_scale ^ (num_layers - i - 1).
decay_base_params (bool) – If True, treat non-layer parameters (e.g. embeddings) as if they’re in layer 0. If False, these parameters are not scaled.
- Returns
The parameter groups, used as the first argument for optimizers.
- forward(inputs, segment_ids=None, input_mask=None)[source]¶
Feeds the inputs through the network and makes classification.
- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
segment_ids – Shape [batch_size, max_time].
input_mask – Float tensor of shape [batch_size, max_time]. Note that positions with value 1 are masked out.
- Returns
A tuple (logits, preds), containing the logits over classes and the predictions, respectively.
If
clas_strategy
iscls_time
orall_time
:If
num_classes
== 1,logits
andpred
are both of shape[batch_size]
.If
num_classes
> 1,logits
is of shape[batch_size, num_classes]
andpred
is of shape[batch_size]
.
If
clas_strategy
istime_wise
:num_classes
== 1,logits
andpred
are both of shape[batch_size, max_time]
.If
num_classes
> 1,logits
is of shape[batch_size, max_time, num_classes]
andpred
is of shape[batch_size, max_time]
.
Regressors¶
XLNetRegressor¶
- class texar.torch.modules.XLNetRegressor(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Regressor based on XLNet modules. Please see
PretrainedXLNetMixin
for a brief description of XLNet.Arguments are the same as in
XLNetEncoder
.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
xlnet-based-cased
). Please refer toPretrainedXLNetMixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Same hyperparameters as in XLNetEncoder ... # (2) Additional hyperparameters "regr_strategy": "cls_time", "use_projection": True, "logit_layer_kwargs": None, "name": "xlnet_regressor", }
Here:
Same hyperparameters as in
XLNetEncoder
. See thedefault_hparams()
. An instance of XLNetEncoder is created for feature extraction.Additional hyperparameters:
- “regr_strategy”: str
The regression strategy, one of:
cls_time: Sequence-level regression based on the output of the first time step (which is the CLS token). Each sequence has a prediction.
all_time: Sequence-level regression based on the output of all time steps. Each sequence has a prediction.
time_wise: Step-wise regression, i.e., make regression for each time step based on its output.
- “logit_layer_kwargs”: dict
Keyword arguments for the logit torch.nn.Linear layer constructor. Ignored if no extra logit layer is appended.
- “use_projection”: bool
If True, an additional torch.nn.Linear layer is added after the summary step.
- “name”: str
Name of the regressor.
- param_groups(lr=None, lr_layer_scale=1.0, decay_base_params=False)[source]¶
Create parameter groups for optimizers. When
lr_layer_decay_rate
is not 1.0, parameters from each layer form separate groups with different base learning rates.The return value of this method can be used in the constructor of optimizers, for example:
model = XLNetRegressor(...) param_groups = model.param_groups(lr=2e-5, lr_layer_scale=0.8) optim = torch.optim.Adam(param_groups)
- Parameters
lr (float) – The learning rate. Can be omitted if
lr_layer_decay_rate
is 1.0.lr_layer_scale (float) – Per-layer LR scaling rate. The i-th layer will be scaled by lr_layer_scale ^ (num_layers - i - 1).
decay_base_params (bool) – If True, treat non-layer parameters (e.g. embeddings) as if they’re in layer 0. If False, these parameters are not scaled.
- Returns
The parameter groups, used as the first argument for optimizers.
- forward(inputs, segment_ids=None, input_mask=None)[source]¶
Feeds the inputs through the network and makes regression.
- Parameters
inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.
segment_ids – Shape [batch_size, max_time].
input_mask – Float tensor of shape [batch_size, max_time]. Note that positions with value 1 are masked out.
- Returns
Regression predictions.
If
regr_strategy
iscls_time
orall_time
, predictions have shape [batch_size].If
clas_strategy
istime_wise
, predictions have shape [batch_size, max_time].
EncoderDecoders¶
T5EncoderDecoder¶
- class texar.torch.modules.T5EncoderDecoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
The pre-trained T5 model. Please see
PretrainedT5Mixin
for a brief description of T5.This module basically stacks
WordEmbedder
,T5Encoder
, andT5Decoder
.- Parameters
pretrained_model_name (optional) – a str, the name of pre-trained model (e.g.,
T5-Small
). Please refer toPretrainedT5Mixin
for all supported models. If None, the model name inhparams
is used.cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (
texar_data
folder under user’s home directory) will be used.hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- reset_parameters()[source]¶
Initialize parameters of the pre-trained model. This method is only called if pre-trained checkpoints are not loaded.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
The model arch is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored.Otherwise, the model arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.
{ "pretrained_model_name": "T5-Small", "embed": { "dim": 768, "name": "word_embeddings" }, "vocab_size": 32128, "encoder": { "dim": 768, "embedding_dropout": 0.1, "multihead_attention": { "dropout_rate": 0.1, "name": "self", "num_heads": 12, "num_units": 768, "output_dim": 768, "use_bias": False, "is_decoder": False, "relative_attention_num_buckets": 32, }, "eps": 1e-6, "name": "encoder", "num_blocks": 12, "poswise_feedforward": { "layers": [ { "kwargs": { "in_features": 768, "out_features": 3072, "bias": False }, "type": "Linear" }, {"type": "ReLU"}, { "kwargs": { "in_features": 3072, "out_features": 768, "bias": False }, "type": "Linear" } ] }, "residual_dropout": 0.1, }, "decoder": { "eps": 1e-6, "dim": 768, "embedding_dropout": 0.1, "multihead_attention": { "dropout_rate": 0.1, "name": "self", "num_heads": 12, "num_units": 768, "output_dim": 768, "use_bias": False, "is_decoder": True, "relative_attention_num_buckets": 32, }, "name": "decoder", "num_blocks": 12, "poswise_feedforward": { "layers": [ { "kwargs": { "in_features": 768, "out_features": 3072, "bias": False }, "type": "Linear" }, {"type": "ReLU"}, { "kwargs": { "in_features": 3072, "out_features": 768, "bias": False }, "type": "Linear" } ] }, "residual_dropout": 0.1, }, "hidden_size": 768, "initializer": None, "name": "t5_encoder_decoder", }
Here:
The default parameters are values for T5-Small model.
- “pretrained_model_name”: str or None
The name of the pre-trained T5 model. If None, the model will be randomly initialized.
- “embed”: dict
Hyperparameters for word embedding layer.
- “vocab_size”: int
The vocabulary size of inputs in T5 model.
- “encoder”: dict
Hyperparameters for the T5Encoder. See
default_hparams()
for details.- “decoder”: dict
Hyperparameters for the T5Decoder. See
default_hparams()
for details.- “hidden_size”: int
Size of the hidden layer.
- “initializer”: dict, optional
Hyperparameters of the default initializer that initializes variables created in this module. See
get_initializer()
for details.- “name”: str
Name of the module.
- forward(inputs, sequence_length=None)[source]¶
Performs encoding and decoding.
- Parameters
inputs – Either a 2D Tensor of shape
[batch_size, max_time]
, containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.sequence_length – A 1D torch.Tensor of shape
[batch_size]
. Input tokens beyond respective sequence lengths are masked out automatically.
- Returns
A pair
(encoder_output, decoder_output)
encoder_output
: A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.decoder_output
: An instance ofTransformerDecoderOutput
which contains sample_id and logits.
Pre-trained¶
PretrainedMixin¶
- class texar.torch.modules.PretrainedMixin(hparams=None)[source]¶
A mixin class for all pre-trained classes to inherit.
- load_pretrained_config(pretrained_model_name=None, cache_dir=None, hparams=None)[source]¶
Load paths and configurations of the pre-trained model.
- Parameters
pretrained_model_name (optional) – A str with the name of a pre-trained model to load. If None, will use the model name in
hparams
.cache_dir (optional) – The path to a folder in which the pre-trained models will be cached. If None (default), a default directory will be used.
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- reset_parameters()[source]¶
Initialize parameters of the pre-trained model. This method is only called if pre-trained checkpoints are not loaded.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "pretrained_model_name": None, "name": "pretrained_base" }
- classmethod download_checkpoint(pretrained_model_name, cache_dir=None)[source]¶
Download the specified pre-trained checkpoint, and return the directory in which the checkpoint is cached.
- abstract classmethod _transform_config(pretrained_model_name, cache_dir)[source]¶
Load the official configuration file and transform it into Texar-style hyperparameters.
PretrainedBERTMixin¶
- class texar.torch.modules.PretrainedBERTMixin(hparams=None)[source]¶
A mixin class to support loading pre-trained checkpoints for modules that implement the BERT model.
Both standard BERT models and many domain specific BERT-based models are supported. You can specify the
pretrained_model_name
argument to pick which pre-trained BERT model to use. All available categories of pre-trained models (and names) include:Standard BERT: proposed in (Devlin et al. 2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . A bidirectional Transformer language model pre-trained on large text corpora. Available model names include:
bert-base-uncased
: 12-layer, 768-hidden, 12-heads, 110M parameters.bert-large-uncased
: 24-layer, 1024-hidden, 16-heads, 340M parameters.bert-base-cased
: 12-layer, 768-hidden, 12-heads , 110M parameters.bert-large-cased
: 24-layer, 1024-hidden, 16-heads, 340M parameters.bert-base-multilingual-uncased
: 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters.bert-base-multilingual-cased
: 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters.bert-base-chinese
: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters.
BioBERT: proposed in (Lee et al. 2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining . A domain specific language representation model pre-trained on large-scale biomedical corpora. Based on the BERT architecture, BioBERT effectively transfers the knowledge from a large amount of biomedical texts to biomedical text mining models with minimal task-specific architecture modifications. Available model names include:
biobert-v1.0-pmc
: BioBERT v1.0 (+ PMC 270K) - based on BERT-base-Cased (same vocabulary).biobert-v1.0-pubmed-pmc
: BioBERT v1.0 (+ PubMed 200K + PMC 270K) - based on BERT-base-Cased (same vocabulary).biobert-v1.0-pubmed
: BioBERT v1.0 (+ PubMed 200K) - based on BERT-base-Cased (same vocabulary).biobert-v1.1-pubmed
: BioBERT v1.1 (+ PubMed 1M) - based on BERT-base-Cased (same vocabulary).
SciBERT: proposed in (Beltagy et al. 2019) SciBERT: A Pretrained Language Model for Scientific Text. A BERT model trained on scientific text. SciBERT leverages unsupervised pre-training on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. Available model names include:
scibert-scivocab-uncased
: Uncased version of the model trained on its own vocabulary.scibert-scivocab-cased
: Cased version of the model trained on its own vocabulary.scibert-basevocab-uncased
: Uncased version of the model trained on the original BERT vocabulary.scibert-basevocab-cased
: Cased version of the model trained on the original BERT vocabulary.
SpanBERT: proposed in (Joshi et al. 2019) SpanBERT: Improving Pre-training by Representing and Predicting Spans. As a variant of the standard BERT model, SpanBERT extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Differing from the standard BERT, the SpanBERT model does not use segmentation embedding. Available model names include:
spanbert-base-cased
: SpanBERT using the BERT-base architecture, 12-layer, 768-hidden, 12-heads , 110M parameters.spanbert-large-cased
: SpanBERT using the BERT-large architecture, 24-layer, 1024-hidden, 16-heads, 340M parameters.
We provide the following BERT classes:
BERTEncoder
for text encoding.BERTClassifier
for text classification and sequence tagging.
PretrainedRoBERTaMixin¶
- class texar.torch.modules.PretrainedRoBERTaMixin(hparams=None)[source]¶
A mixin class to support loading pre-trained checkpoints for modules that implement the RoBERTa model.
The RoBERTa model was proposed in (Liu et al. 2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. As a variant of the standard BERT model, RoBERTa trains for more iterations on more data with a larger batch size as well as other tweaks in pre-training. Differing from the standard BERT, the RoBERTa model does not use segmentation embedding. Available model names include:
roberta-base
: RoBERTa using the BERT-base architecture, 125M parameters.roberta-large
: RoBERTa using the BERT-large architecture, 355M parameters.
We provide the following RoBERTa classes:
RoBERTaEncoder
for text encoding.RoBERTaClassifier
for text classification and sequence tagging.
PretrainedGPT2Mixin¶
- class texar.torch.modules.PretrainedGPT2Mixin(hparams=None)[source]¶
A mixin class to support loading pre-trained checkpoints for modules that implement the GPT2 model.
The GPT2 model was proposed in Language Models are Unsupervised Multitask Learners by Radford et al. from OpenAI. It is a unidirectional Transformer model pre-trained using the vanilla language modeling objective on a large corpus.
The available GPT2 models are as follows:
gpt2-small
: Small version of GPT-2, 124M parameters.gpt2-medium
: Medium version of GPT-2, 355M parameters.gpt2-large
: Large version of GPT-2, 774M parameters.gpt2-xl
: XL version of GPT-2, 1558M parameters.
We provide the following GPT2 classes:
GPT2Encoder
for text encoding.GPT2Decoder
for text generation and decoding.GPT2Classifier
for text classification and sequence tagging.
PretrainedXLNetMixin¶
- class texar.torch.modules.PretrainedXLNetMixin(hparams=None)[source]¶
A mixin class to support loading pre-trained checkpoints for modules that implement the XLNet model.
The XLNet model was proposed in XLNet: Generalized Autoregressive Pretraining for Language Understanding by Yang et al. It is based on the Transformer-XL model, pre-trained on a large corpus using a language modeling objective that considers all permutations of the input sentence.
The available XLNet models are as follows:
xlnet-based-cased
: 12-layer, 768-hidden, 12-heads. This model is trained on full data (different from the one in the paper).xlnet-large-cased
: 24-layer, 1024-hidden, 16-heads.
We provide the following XLNet classes:
XLNetEncoder
for text encoding.XLNetDecoder
for text generation and decoding.XLNetClassifier
for text classification and sequence tagging.XLNetRegressor
for text regression.
PretrainedT5Mixin¶
- class texar.torch.modules.PretrainedT5Mixin(hparams=None)[source]¶
A mixin class to support loading pre-trained checkpoints for modules that implement the T5 model.
The T5 model was proposed in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Raffel et al. from Google. It treats multiple NLP tasks in a similar manner by encoding the different tasks as text directives in the input stream. This enables a single model to be trained supervised on a wide variety of NLP tasks. The T5 model examines factors relevant for leveraging transfer learning at scale from pure unsupervised pre-training to supervised tasks.
The available T5 models are as follows:
T5-Small
: Small version of T5, 60 million parameters.T5-Base
: Base-line version of T5, 220 million parameters.T5-Large
: Large Version of T5, 770 million parameters.T5-3B
: A version of T5 with 3 billion parameters.T5-11B
: A version of T5 with 11 billion parameters.
We provide the following classes:
T5Encoder
for loading weights for the encoder stack.T5Decoder
for loading weights for the decoding stack.T5EncoderDecoder
as a raw pre-trained model.
Connectors¶
ConnectorBase¶
- class texar.torch.modules.ConnectorBase(output_size, hparams=None)[source]¶
Base class inherited by all connector classes. A connector is to transform inputs into outputs with any specified structure and shape. For example, transforming the final state of an encoder to the initial state of a decoder, and performing stochastic sampling in between as in Variational Autoencoders (VAEs).
- Parameters
output_size – Size of output excluding the batch dimension. For example, set
output_size
todim
to generate output of shape[batch_size, dim]
. Can be an int, a tuple of int, a torch.Size, or a tuple of torch.Sizes. For example, to transform inputs to have decoder state size, setoutput_size=decoder.state_size
.hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
- property output_size¶
The feature size of
forward()
output tensor(s), usually it is equal to the last dimension value of the output tensor size.
ConstantConnector¶
- class texar.torch.modules.ConstantConnector(output_size, hparams=None)[source]¶
Creates a constant tensor or (nested) tuple of Tensors that contains a constant value.
- Parameters
output_size – Size of output excluding the batch dimension. For example, set
output_size
todim
to generate output of shape[batch_size, dim]
. Can be anint
, a tuple ofint
, atorch.Size
, or a tuple oftorch.Size
. For example, to transform inputs to have decoder state size, setoutput_size=decoder.state_size
. Ifoutput_size
is a tuple(1, 2, 3)
, then the output structure will be([batch_size * 1], [batch_size * 2], [batch_size * 3])
. Ifoutput_size
istorch.Size([1, 2, 3])
, then the output structure will be[batch_size, 1, 2, 3]
.hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
This connector does not have trainable parameters.
Example
state_size = (1, 2, 3) connector = ConstantConnector(state_size, hparams={"value": 1.}) one_state = connector(batch_size=64) # `one_state` structure: (Tensor_1, Tensor_2, Tensor_3), # Tensor_1.size() == torch.Size([64, 1]) # Tensor_2.size() == torch.Size([64, 2]) # Tensor_3.size() == torch.Size([64, 3]) # Tensors are filled with 1.0. size = torch.Size([1, 2, 3]) connector_size = ConstantConnector(size, hparams={"value": 2.}) size_state = connector_size(batch_size=64) # `size_state` structure: Tensor with size [64, 1, 2, 3]. # Tensor is filled with 2.0.
ForwardConnector¶
- class texar.torch.modules.ForwardConnector(output_size, hparams=None)[source]¶
Transforms inputs to have specified structure.
Example:
state_size = namedtuple('LSTMStateTuple', ['h', 'c'])(256, 256) # state_size == LSTMStateTuple(c=256, h=256) connector = ForwardConnector(state_size) output = connector([tensor_1, tensor_2]) # output == LSTMStateTuple(c=tensor_1, h=tensor_2)
- Parameters
output_size – Size of output excluding the batch dimension. For example, set
output_size
todim
to generate output of shape[batch_size, dim]
. Can be anint
, a tuple ofint
, atorch.Size
, or a tuple oftorch.Size
. For example, to transform inputs to have decoder state size, setoutput_size=decoder.state_size
.hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
This connector does not have trainable parameters. See
forward()
for the inputs and outputs of the connector. The input to the connector must have the same structure withoutput_size
, or must have the same number of elements and be re-packable into the structure ofoutput_size
. Note that if input is or contains adict
instance, the keys will be sorted to pack in deterministic order (Seepack_sequence_as()
).- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "name": "forward_connector" }
Here:
- “name”: str
Name of the connector.
- forward(inputs)[source]¶
Transforms inputs to have the same structure as with
output_size
. Values of the inputs are not changed.inputs
must either have the same structure, or have the same number of elements withoutput_size
.- Parameters
inputs – The input (structure of) tensor to pass forward.
- Returns
A (structure of) tensors that re-packs
inputs
to have the specified structure ofoutput_size
.
MLPTransformConnector¶
- class texar.torch.modules.MLPTransformConnector(output_size, linear_layer_dim, hparams=None)[source]¶
Transforms inputs with an MLP layer and packs the results into the specified structure and size.
Example
cell = LSTMCell(num_units=256) # cell.state_size == LSTMStateTuple(c=256, h=256) connector = MLPTransformConnector(cell.state_size) inputs = torch.zeros([64, 10]) output = connector(inputs) # output == LSTMStateTuple(c=tensor_of_shape_(64, 256), # h=tensor_of_shape_(64, 256))
## Use to connect encoder and decoder with different state size encoder = UnidirectionalRNNEncoder(...) _, final_state = encoder(inputs=...) decoder = BasicRNNDecoder(...) connector = MLPTransformConnector(decoder.state_size) _ = decoder( initial_state=connector(final_state), ...)
- Parameters
output_size – Size of output excluding the batch dimension. For example, set
output_size
todim
to generate output of shape[batch_size, dim]
. Can be anint
, a tuple ofint
, atorch.Size
, or a tuple oftorch.Size
. For example, to transform inputs to have decoder state size, setoutput_size=decoder.state_size
.linear_layer_dim (int) – Value of final dim of the input tensors i.e. the input dim of the mlp linear layer.
hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
The input to the connector can have arbitrary structure and size.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "activation_fn": "texar.torch.core.layers.identity", "name": "mlp_connector" }
Here:
- “activation_fn”: str or callable
The activation function applied to the outputs of the MLP transformation layer. Can be a function, or its name or module path.
- “name”: str
Name of the connector.
- forward(inputs)[source]¶
Transforms inputs with an MLP layer and packs the results to have the same structure as specified by
output_size
.- Parameters
inputs – Input (structure of) tensors to be transformed. Must be a tensor of shape
[batch_size, ...]
or a (nested) tuple of such Tensors. That is, the first dimension of (each) tensor must be the batch dimension.- Returns
A tensor or a (nested) tuple of tensors of the same structure of
output_size
.
Networks¶
FeedForwardNetworkBase¶
- class texar.torch.modules.FeedForwardNetworkBase(hparams=None)[source]¶
Base class inherited by all feed-forward network classes.
- Parameters
hparams (dict, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
See
forward()
for the inputs and outputs.- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "name": "NN" }
- forward(input)[source]¶
Feeds forward inputs through the network layers and returns outputs.
- Parameters
input – The inputs to the network. The requirements on inputs depends on the first layer and subsequent layers in the network.
- Returns
The output of the network.
- append_layer(layer)[source]¶
Appends a layer to the end of the network.
- Parameters
layer – A subclass of torch.nn.Module, or a dict of layer hyperparameters.
- has_layer(layer_name)[source]¶
Returns True if the network with the name exists. Returns False otherwise.
- Parameters
layer_name (str) – Name of the layer.
- layer_by_name(layer_name)[source]¶
Returns the layer with the name. Returns None if the layer name does not exist.
- Parameters
layer_name (str) – Name of the layer.
- property layers_by_name¶
A dictionary mapping layer names to the layers.
- property layers¶
A list of the layers.
- property layer_names¶
A list of uniquified layer names.
FeedForwardNetwork¶
- class texar.torch.modules.FeedForwardNetwork(layers=None, hparams=None)[source]¶
Feed-forward neural network that consists of a sequence of layers.
- Parameters
layers (list, optional) – A list of torch.nn.Linear instances composing the network. If not given, layers are created according to
hparams
.hparams (dict, optional) – Embedder hyperparameters. Missing hyperparameters will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
See
forward()
for the inputs and outputs.Example
hparams = { # Builds a two-layer dense NN "layers": [ { "type": "Dense", "kwargs": { "units": 256 }, { "type": "Dense", "kwargs": { "units": 10 } ] } nn = FeedForwardNetwork(hparams=hparams) inputs = torch.randn([64, 100]) outputs = nn(inputs) # outputs == Tensor of shape [64, 10]
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ "layers": [], "name": "NN" }
Here:
- “layers”: list
A list of layer hyperparameters. See
get_layer()
for details on layer hyperparameters.- “name”: str
Name of the network.
- forward(input)¶
Feeds forward inputs through the network layers and returns outputs.
- Parameters
input – The inputs to the network. The requirements on inputs depends on the first layer and subsequent layers in the network.
- Returns
The output of the network.
- append_layer(layer)¶
Appends a layer to the end of the network.
- Parameters
layer – A subclass of torch.nn.Module, or a dict of layer hyperparameters.
- has_layer(layer_name)¶
Returns True if the network with the name exists. Returns False otherwise.
- Parameters
layer_name (str) – Name of the layer.
- layer_by_name(layer_name)¶
Returns the layer with the name. Returns None if the layer name does not exist.
- Parameters
layer_name (str) – Name of the layer.
- property layers_by_name¶
A dictionary mapping layer names to the layers.
- property layers¶
A list of the layers.
- property layer_names¶
A list of uniquified layer names.
Conv1DNetwork¶
- class texar.torch.modules.Conv1DNetwork(in_channels, in_features=None, hparams=None)[source]¶
Simple Conv-1D network which consists of a sequence of convolutional layers followed with a sequence of dense layers.
- Parameters
in_channels (int) – Number of channels in the input tensor.
in_features (int) – Size of the feature dimension in the input tensor.
hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
See
forward()
for the inputs and outputs. If"data_format"
is set to"channels_first"
(this is the default), inputs must be a tensor of shape [batch_size, channels, length]. If"data_format"
is set to"channels_last"
, inputs must be a tensor of shape [batch_size, length, channels]. For example, for sequence classification, length corresponds to time steps, and channels corresponds to embedding dim.Example:
nn = Conv1DNetwork(in_channels=20, in_features=256) # Use the default inputs = torch.randn([64, 20, 256]) outputs = nn(inputs) # outputs == Tensor of shape [64, 256], because the final dense layer # has size 256.
- static default_hparams()[source]¶
Returns a dictionary of hyperparameters with default values.
{ # (1) Conv layers "num_conv_layers": 1, "out_channels": 128, "kernel_size": [3, 4, 5], "conv_activation": "ReLU", "conv_activation_kwargs": None, "other_conv_kwargs": {}, "data_format": "channels_first", # (2) Pooling layers "pooling": "MaxPool1d", "pool_size": None, "pool_stride": 1, "other_pool_kwargs": {}, # (3) Dense layers "num_dense_layers": 1, "out_features": 256, "dense_activation": None, "dense_activation_kwargs": None, "final_dense_activation": None, "final_dense_activation_kwargs": None, "other_dense_kwargs": None, # (4) Dropout "dropout_conv": [1], "dropout_dense": [], "dropout_rate": 0.75, # (5) Others "name": "conv1d_network" }
Here:
For convolutional layers:
- “num_conv_layers”: int
Number of convolutional layers.
- “out_channels”: int or list
The number of out_channels in the convolution, i.e., the dimensionality of the output space.
If
"num_conv_layers"
> 1 and"out_channels"
is an int, all convolution layers will have the same number of output channels.If
"num_conv_layers"
> 1 and"out_channels"
is a list, the length must equal"num_conv_layers"
. The number of output channels of each convolution layer will be the corresponding element from this list.
- “kernel_size”: int or list
Lengths of 1D convolution windows.
If “num_conv_layers” = 1, this can also be a
int
list of arbitrary length denoting differently sized convolution windows. The number of output channels of each size is specified by"out_channels"
. For example, the default values will create 3 convolution layers, each of which has kernel size of 3, 4, and 5, respectively, and has output channel 128.If “num_conv_layers” > 1, this must be a list of length
"num_conv_layers"
. Each element can be anint
or aint
list of arbitrary length denoting the kernel size of each layer.
- “conv_activation”: str or callable
Activation applied to the output of the convolutional layers. Set to None to maintain a linear activation. See
get_layer()
for more details.- “conv_activation_kwargs”: dict, optional
Keyword arguments for the activation following the convolutional layer. See
get_layer()
for more details.- “other_conv_kwargs”: list or dict, optional
Other keyword arguments for torch.nn.Conv1d constructor, e.g.,
padding
.If a dict, the same dict is applied to all the convolution layers.
If a list, the length must equal
"num_conv_layers"
. This list can contain nested lists. If the convolution layer at index i has multiple kernel sizes, then the corresponding element of this list can also be a list of length equal to"kernel_size"
at index i. If the element at index i is instead a dict, then the same dict gets applied to all the convolution layers at index i.
- “data_format”: str, optional
Data format of the input tensor. Defaults to
channels_first
denoting the first dimension to be the channel dimension. Set it tochannels_last
to treat last dimension as the channel dimension. This argument can also be passed inforward
function, in which case the value specified here will be ignored.
For pooling layers:
- “pooling”: str or class or instance
Pooling layer after each of the convolutional layer(s). Can be a pooling layer class, its name or module path, or a class instance.
- “pool_size”: int or list, optional
Size of the pooling window. If an
int
, all pooling layer will have the same pool size. If a list, the list length must equal"num_conv_layers"
. If None and the pooling type is either MaxPool1d or AvgPool1d, the pool size will be set to input size. That is, the output of the pooling layer is a single unit.- “pool_stride”: int or list, optional
Strides of the pooling operation. If an
int
, all layers will have the same stride. If a list, the list length must equal"num_conv_layers"
.- “other_pool_kwargs”: list or dict, optional
Other keyword arguments for pooling layer class constructor.
If a dict, the same dict is applied to all the pooling layers.
If a list, the length must equal
"num_conv_layers"
. The pooling arguments for layer i will be the element at index i from this list.
For dense layers (note that here dense layers always follow convolutional and pooling layers):
- “num_dense_layers”: int
Number of dense layers.
- “out_features”: int or list
Dimension of features after the dense layers. If an
int
, all dense layers will have the same feature dimension. If a list ofint
, the list length must equal"num_dense_layers"
.- “dense_activation”: str or callable
Activation function applied to the output of the dense layers except the last dense layer output. Set to None to maintain a linear activation.
- “dense_activation_kwargs”: dict, optional
Keyword arguments for dense layer activation functions before the last dense layer.
- “final_dense_activation”: str or callable
Activation function applied to the output of the last dense layer. Set to None to maintain a linear activation.
- “final_dense_activation_kwargs”: dict, optional
Keyword arguments for the activation function of last dense layer.
- “other_dense_kwargs”: dict, optional
Other keyword arguments for dense layer class constructor.
For dropouts:
- “dropout_conv”: int or list
The indices of convolutional layers (starting from 0) whose inputs are applied with dropout. The index =
num_conv_layers
means dropout applies to the final convolutional layer output. For example,{ "num_conv_layers": 2, "dropout_conv": [0, 2] }
will leads to a series of layers as -dropout-conv0-conv1-dropout-.
The dropout mode (training or not) is controlled by
self.training
.- “dropout_dense”: int or list
Same as
"dropout_conv"
but applied to dense layers (index starting from 0).- “dropout_rate”: float
The dropout rate, between 0 and 1. For example,
"dropout_rate": 0.1
would drop out 10% of elements.
Others:
- “name”: str
Name of the network.
- forward(input, sequence_length=None, dtype=None, data_format=None)[source]¶
Feeds forward inputs through the network layers and returns outputs.
- Parameters
input – The inputs to the network, which is a 3D tensor.
sequence_length (optional) – An torch.LongTensor of shape
[batch_size]
or a python array containing the length of each element ininputs
. If given, time steps beyond the length will first be masked out before feeding to the layers.dtype (optional) – Type of the inputs. If not provided, infers from inputs automatically.
data_format (optional) – Data type of the input tensor. If
channels_last
, the last dimension will be treated as channel dimension so the size of theinput
should be [batch_size, X, channel]. Ifchannels_first
, first dimension will be treated as channel dimension so the size should be [batch_size, channel, X]. Defaults to None. If None, the value will be picked from hyperparameters.
- Returns
The output of the final layer.
- append_layer(layer)¶
Appends a layer to the end of the network.
- Parameters
layer – A subclass of torch.nn.Module, or a dict of layer hyperparameters.
- has_layer(layer_name)¶
Returns True if the network with the name exists. Returns False otherwise.
- Parameters
layer_name (str) – Name of the layer.
- layer_by_name(layer_name)¶
Returns the layer with the name. Returns None if the layer name does not exist.
- Parameters
layer_name (str) – Name of the layer.
- property layers_by_name¶
A dictionary mapping layer names to the layers.
- property layers¶
A list of the layers.
- property layer_names¶
A list of uniquified layer names.