Modules

ModuleBase

class texar.torch.ModuleBase(hparams=None)[source]

Base class inherited by modules that are configurable through hyperparameters.

This is a subclass of torch.nn.Module.

A Texar module inheriting ModuleBase is configurable through hyperparameters. That is, each module defines allowed hyperparameters and default values. Hyperparameters not specified by users will take default values.

Parameters

hparams (dict, optional) – Hyperparameters of the module. See default_hparams() for the structure and default values.

static default_hparams()[source]

Returns a dict of hyperparameters of the module with default values. Used to replace the missing values of input hparams during module construction.

{
    "name": "module"
}
property trainable_variables

The list of trainable variables (parameters) of the module. Parameters of this module and all its submodules are included.

Note

The list returned may contain duplicate parameters (e.g. output layer shares parameters with embeddings). For most usages, it’s not necessary to ensure uniqueness.

property hparams

An HParams instance. The hyperparameters of the module.

property output_size

The feature size of forward() output tensor(s), usually it is equal to the last dimension value of the output tensor size.

Embedders

WordEmbedder

class texar.torch.modules.WordEmbedder(init_value=None, vocab_size=None, hparams=None)[source]

Simple word embedder that maps indexes into embeddings. The indexes can be soft (e.g., distributions over vocabulary).

Either init_value or vocab_size is required. If both are given, there must be init_value.shape[0]==vocab_size.

Parameters
  • init_value (optional) –

    A Tensor or numpy array that contains the initial value of embeddings. It is typically of shape [vocab_size] + embedding-dim. Embeddings can have dimensionality > 1.

    If None, embedding is initialized as specified in hparams["initializer"]. Otherwise, the "initializer" and "dim" hyperparameters in hparams are ignored.

  • vocab_size (int, optional) – The vocabulary size. Required if init_value is not given.

  • hparams (dict, optional) – Embedder hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

See forward() for the inputs and outputs of the embedder.

Example:

ids = torch.empty([32, 10]).uniform_(to=10).type(torch.int64).
soft_ids = torch.empty([32, 10, 100]).uniform_()

embedder = WordEmbedder(vocab_size=100, hparams={'dim': 256})
ids_emb = embedder(ids=ids) # shape: [32, 10, 256]
soft_ids_emb = embedder(soft_ids=soft_ids) # shape: [32, 10, 256]
# Use with Texar data module
hparams={
    'dataset': {
        'embedding_init': {'file': 'word2vec.txt'}
        ...
    },
}
data = MonoTextData(data_params)
iterator = DataIterator(data)
batch = next(iter(iterator))

# Use data vocab size
embedder_1 = WordEmbedder(vocab_size=data.vocab.size)
emb_1 = embedder_1(batch['text_ids'])

# Use pre-trained embedding
embedder_2 = WordEmbedder(init_value=data.embedding_init_value)
emb_2 = embedder_2(batch['text_ids'])
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "dim": 100,
    "dropout_rate": 0,
    "dropout_strategy": 'element',
    "initializer": {
        "type": "random_uniform_initializer",
        "kwargs": {
            "minval": -0.1,
            "maxval": 0.1,
            "seed": None
        }
    },
    "trainable": True,
    "name": "word_embedder",
}

Here:

“dim”: int or list

Embedding dimension. Can be a list of integers to yield embeddings with dimensionality > 1.

Ignored if init_value is given to the embedder constructor.

“dropout_rate”: float

The dropout rate between 0 and 1. For example, dropout_rate=0.1 would zero out 10% of the embeddings. Set to 0 to disable dropout.

“dropout_strategy”: str

The dropout strategy. Can be one of the following

  • "element": The regular strategy that drops individual elements in the embedding vectors.

  • "item": Drops individual items (e.g., words) entirely. For example, for the word sequence “the simpler the better”, the strategy can yield “_ simpler the better”, where the first “the” is dropped.

  • "item_type": Drops item types (e.g., word types). For example, for the above sequence, the strategy can yield “_ simpler _ better”, where the word type “the” is dropped. The dropout will never yield “_ simpler the better” as in the "item" strategy.

“initializer”: dict or None

Hyperparameters of the initializer for embedding values. See get_initializer() for the details. Ignored if init_value is given to the embedder constructor.

“trainable”: bool

Whether the embedding parameters are trainable. If false, freeze the embedding parameters.

“name”: str

Name of the embedding variable.

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(ids=None, soft_ids=None, **kwargs)[source]

Embeds (soft) ids.

Either ids or soft_ids must be given, and they must not be given at the same time.

Parameters
  • ids (optional) – An integer tensor containing the ids to embed.

  • soft_ids (optional) – A tensor of weights (probabilities) used to mix the embedding vectors.

  • kwargs – Additional keyword arguments for torch.nn.functional.embedding besides params and ids.

Returns

If ids is given, returns a Tensor of shape list(ids.shape) + embedding-dim. For example, if list(ids.shape) == [batch_size, max_time] and list(embedding.shape) == [vocab_size, emb_dim], then the return tensor has shape [batch_size, max_time, emb_dim].

If soft_ids is given, returns a Tensor of shape list(soft_ids.shape)[:-1] + embedding-dim. For example, if list(soft_ids.shape) == [batch_size, max_time, vocab_size] and list(embedding.shape) == [vocab_size, emb_dim], then the return tensor has shape [batch_size, max_time, emb_dim].

property embedding

The embedding tensor, of shape [vocab_size] + dim.

property dim

The embedding dimension.

property vocab_size

The vocabulary size.

property num_embeddings

The vocabulary size. This interface matches torch.nn.Embedding.

property output_size

The feature size of forward() output. If the dim hyperparameter is a list or tuple, the feature size equals its final dimension; otherwise, if dim is an int, the feature size equals dim.

PositionEmbedder

class texar.torch.modules.PositionEmbedder(position_size=None, init_value=None, hparams=None)[source]

Simple position embedder that maps position indexes into embeddings via lookup.

Either init_value or position_size is required. If both are given, there must be init_value.shape[0]==position_size.

Parameters
  • init_value (optional) –

    A Tensor or numpy array that contains the initial value of embeddings. It is typically of shape [position_size, embedding dim].

    If None, embedding is initialized as specified in hparams["initializer"]. Otherwise, the "initializer" and "dim" hyperparameters in hparams are ignored.

  • position_size (int, optional) – The number of possible positions, e.g., the maximum sequence length. Required if init_value is not given.

  • hparams (dict, optional) – Embedder hyperparameters. If it is not specified, the default hyperparameter setting is used. See default_hparams for the structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "dim": 100,
    "initializer": {
        "type": "random_uniform_initializer",
        "kwargs": {
            "minval": -0.1,
            "maxval": 0.1,
            "seed": None
        }
    },
    "dropout_rate": 0,
    "dropout_strategy": 'element',
    "trainable": True,
    "name": "position_embedder"
}

The hyperparameters have the same meaning as those in texar.torch.modules.WordEmbedder.default_hparams().

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(positions=None, sequence_length=None, **kwargs)[source]

Embeds the positions.

Either positions or sequence_length is required:

  • If both are given, sequence_length is used to mask out embeddings of those time steps beyond the respective sequence lengths.

  • If only sequence_length is given, then positions from 0 to sequence_length - 1 are embedded.

Parameters
  • positions (optional) – A torch.LongTensor containing the position IDs to embed.

  • sequence_length (optional) – An torch.LongTensor of shape [batch_size]. Time steps beyond the respective sequence lengths will have zero-valued embeddings.

  • kwargs – Additional keyword arguments for torch.nn.functional.embedding besides params and ids.

Returns

A Tensor of shape shape(inputs) + embedding dimension.

property embedding

The embedding tensor.

property dim

The embedding dimension.

property position_size

The position size, i.e., maximum number of positions.

property output_size

The feature size of forward() output. If the dim hyperparameter is a list or tuple, the feature size equals its final dimension; otherwise, if dim is an int, the feature size equals dim.

SinusoidsPositionEmbedder

class texar.torch.modules.SinusoidsPositionEmbedder(position_size=None, hparams=None)[source]

Sinusoid position embedder that maps position indexes into embeddings via sinusoid calculation. This module does not have trainable parameters. Used in, e.g., Transformer models (Vaswani et al.) “Attention Is All You Need”.

Each channel of the input Tensor is incremented by a sinusoid of a different frequency and phase. This allows attention to learn to use absolute and relative positions.

Timing signals should be added to some precursors of both the query and the memory inputs to attention. The use of relative position is possible because sin(x+y) and cos(x+y) can be expressed in terms of y, sin(x), and cos(x). In particular, we use a geometric sequence of timescales starting with min_timescale and ending with max_timescale. The number of different timescales is equal to dim / 2. For each timescale, we generate the two sinusoidal signals sin(timestep/timescale) and cos(timestep/timescale). All of these sinusoids are concatenated in the dim dimension.

Parameters

position_size (int) – The number of possible positions, e.g., the maximum sequence length. Set position_size=None and hparams['cache_embeddings']=False to use arbitrarily large or negative position indices.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values We use a geometric sequence of timescales starting with min_timescale and ending with max_timescale. The number of different timescales is equal to dim / 2.

{
    'min_timescale': 1.0,
    'max_timescale': 10000.0,
    'dim': 512,
    'cache_embeddings': True,
    'name':'sinusoid_position_embedder',
}

Here:

“cache_embeddings”: bool

If True, precompute embeddings for positions in range [0, position_size - 1]. This leads to faster lookup but requires lookup indices to be within this range.

If False, embeddings are computed on-the-fly during lookup. Set to False if your application needs to handle sequences of arbitrary length, or requires embeddings at negative positions.

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(positions=None, sequence_length=None, **kwargs)[source]

Embeds. Either positions or sequence_length is required:

  • If both are given, sequence_length is used to mask out embeddings of those time steps beyond the respective sequence lengths.

  • If only sequence_length is given, then positions from 0 to sequence_length - 1 are embedded.

Parameters
  • positions (optional) – An torch.LongTensor containing the position IDs to embed.

  • sequence_length (optional) – An torch.LongTensor of shape [batch_size]. Time steps beyond the respective sequence lengths will have zero-valued embeddings.

Returns

A Tensor of shape [batch_size, position_size, dim].

property dim

The embedding dimension.

property output_size

The feature size of forward() output. If the dim hyperparameter is a list or tuple, the feature size equals its final dimension; otherwise, if dim is an int, the feature size equals dim.

EmbedderBase

class texar.torch.modules.EmbedderBase(num_embeds=None, init_value=None, hparams=None)[source]

The base embedder class that all embedder classes inherit.

Parameters
  • num_embeds (int, optional) – The number of embedding elements, e.g., the vocabulary size of a word embedder.

  • init_value (Tensor or numpy array, optional) – Initial values of the embedding variable. If not given, embedding is initialized as specified in hparams["initializer"].

  • hparams (dict or HParams, optional) – Embedder hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "name": "embedder"
}
property num_embeds

The number of embedding elements.

Encoders

UnidirectionalRNNEncoder

class texar.torch.modules.UnidirectionalRNNEncoder(*args, **kwds)[source]

One directional RNN encoder.

Parameters
  • input_size (int) – The number of expected features in the input for the cell.

  • cell – (RNNCell, optional) If not specified, a cell is created as specified in hparams["rnn_cell"].

  • output_layer (optional) – An instance of torch.nn.Module. Applies to the RNN cell output of each step. If None (default), the output layer is created as specified in hparams["output_layer"].

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

See forward() for the inputs and outputs of the encoder.

Example:

# Use with embedder
embedder = WordEmbedder(vocab_size, hparams=emb_hparams)
encoder = UnidirectionalRNNEncoder(hparams=enc_hparams)

outputs, final_state = encoder(
    inputs=embedder(data_batch['text_ids']),
    sequence_length=data_batch['length'])
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "rnn_cell": default_rnn_cell_hparams(),
    "output_layer": {
        "num_layers": 0,
        "layer_size": 128,
        "activation": "identity",
        "final_layer_activation": None,
        "other_dense_kwargs": None,
        "dropout_layer_ids": [],
        "dropout_rate": 0.5,
        "variational_dropout": False
    },
    "name": "unidirectional_rnn_encoder"
}

Here:

“rnn_cell”: dict

A dictionary of RNN cell hyperparameters. Ignored if cell is given to the encoder constructor.

The default value is defined in default_rnn_cell_hparams().

“output_layer”: dict

Output layer hyperparameters. Ignored if output_layer is given to the encoder constructor. Includes:

“num_layers”: int

The number of output (dense) layers. Set to 0 to avoid any output layers applied to the cell outputs.

“layer_size”: int or list

The size of each of the output (dense) layers.

If an int, each output layer will have the same size. If a list, the length must equal to num_layers.

“activation”: str or callable or None

Activation function for each of the output (dense) layer except for the final layer. This can be a function, or its string name or module path. If function name is given, the function must be from torch.nn. For example:

"activation": "relu" # function name
"activation": "my_module.my_activation_fn" # module path
"activation": my_module.my_activation_fn # function

Default is None which results in an identity activation.

“final_layer_activation”: str or callable or None

The activation function for the final output layer.

“other_dense_kwargs”: dict or None

Other keyword arguments to construct each of the output dense layers, e.g., bias. See torch.nn.Linear for the keyword arguments.

“dropout_layer_ids”: int or list

The indexes of layers (starting from 0) whose inputs are applied with dropout. The index = num_layers means dropout applies to the final layer output. For example,

{
    "num_layers": 2,
    "dropout_layer_ids": [0, 2]
}

will leads to a series of layers as -dropout-layer0-layer1-dropout-.

The dropout mode (training or not) is controlled by self.training.

“dropout_rate”: float

The dropout rate, between 0 and 1. For example, "dropout_rate": 0.1 would zero out 10% of elements.

“variational_dropout”: bool

Whether the dropout mask is the same across all time steps.

“name”: str

Name of the encoder

forward(inputs, sequence_length=None, initial_state=None, time_major=False, return_cell_output=False, return_output_size=False)[source]

Encodes the inputs.

Parameters
  • inputs – A 3D Tensor of shape [batch_size, max_time, dim]. The first two dimensions batch_size and max_time are exchanged if time_major is True.

  • sequence_length (optional) – A 1D torch.LongTensor of shape [batch_size]. Sequence lengths of the batch inputs. Used to copy-through state and zero-out outputs when past a batch element’s sequence length.

  • initial_state (optional) – Initial state of the RNN.

  • time_major (bool) – The shape format of the inputs and outputs Tensors. If True, these tensors are of shape [max_time, batch_size, depth]. If False (default), these tensors are of shape [batch_size, max_time, depth].

  • return_cell_output (bool) – Whether to return the output of the RNN cell. This is the results prior to the output layer.

  • return_output_size (bool) – Whether to return the size of the output (i.e., the results after output layers).

Returns

  • By default (both return_cell_output and return_output_size are False), returns a pair (outputs, final_state), where

    • outputs: The RNN output tensor by the output layer (if exists) or the RNN cell (otherwise). The tensor is of shape [batch_size, max_time, output_size] if time_major is False, or [max_time, batch_size, output_size] if time_major is True. If RNN cell output is a (nested) tuple of Tensors, then the outputs will be a (nested) tuple having the same nest structure as the cell output.

    • final_state: The final state of the RNN, which is a Tensor of shape [batch_size] + cell.state_size or a (nested) tuple of Tensors if cell.state_size is a (nested) tuple.

  • If return_cell_output is True, returns a triple (outputs, final_state, cell_outputs)

    • cell_outputs: The outputs by the RNN cell prior to the output layer, having the same structure with outputs except for the output_dim.

  • If return_output_size is True, returns a tuple (outputs, final_state, output_size)

    • output_size: A (possibly nested tuple of) int representing the size of outputs. If a single int or an int array, then outputs has shape [batch/time, time/batch] + output_size. If a (nested) tuple, then output_size has the same structure as with outputs.

  • If both return_cell_output and return_output_size are True, returns (outputs, final_state, cell_outputs, output_size).

property cell

The RNN cell.

property state_size

The state size of encoder cell. Same as encoder.cell.state_size.

property output_layer

The output layer.

property output_size

The feature size of forward() output outputs. If output layer does not exist, the feature size is equal to encoder.cell.hidden_size, otherwise the feature size is equal to last dimension value of output layer output size.

BidirectionalRNNEncoder

class texar.torch.modules.BidirectionalRNNEncoder(*args, **kwds)[source]

Bidirectional forward-backward RNN encoder.

Parameters
  • cell_fw (RNNCell, optional) – The forward RNN cell. If not given, a cell is created as specified in hparams["rnn_cell_fw"].

  • cell_bw (RNNCell, optional) – The backward RNN cell. If not given, a cell is created as specified in hparams["rnn_cell_bw"].

  • output_layer_fw (optional) – An instance of torch.nn.Module. Apply to the forward RNN cell output of each step. If None (default), the output layer is created as specified in hparams["output_layer_fw"].

  • output_layer_bw (optional) – An instance of torch.nn.Module. Apply to the backward RNN cell output of each step. If None (default), the output layer is created as specified in hparams["output_layer_bw"].

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

See forward() for the inputs and outputs of the encoder.

Example

# Use with embedder
embedder = WordEmbedder(vocab_size, hparams=emb_hparams)
encoder = BidirectionalRNNEncoder(hparams=enc_hparams)

outputs, final_state = encoder(
    inputs=embedder(data_batch['text_ids']),
    sequence_length=data_batch['length'])
# outputs == (outputs_fw, outputs_bw)
# final_state == (final_state_fw, final_state_bw)
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "rnn_cell_fw": default_rnn_cell_hparams(),
    "rnn_cell_bw": default_rnn_cell_hparams(),
    "rnn_cell_share_config": True,
    "output_layer_fw": {
        "num_layers": 0,
        "layer_size": 128,
        "activation": "identity",
        "final_layer_activation": None,
        "other_dense_kwargs": None,
        "dropout_layer_ids": [],
        "dropout_rate": 0.5,
        "variational_dropout": False
    },
    "output_layer_bw": {
        # Same hyperparams and default values as "output_layer_fw"
        # ...
    },
    "output_layer_share_config": True,
    "name": "bidirectional_rnn_encoder"
}

Here:

“rnn_cell_fw”: dict

Hyperparameters of the forward RNN cell. Ignored if cell_fw is given to the encoder constructor.

The default value is defined in default_rnn_cell_hparams().

“rnn_cell_bw”: dict

Hyperparameters of the backward RNN cell. Ignored if cell_bw is given to the encoder constructor, or if “rnn_cell_share_config” is True.

The default value is defined in default_rnn_cell_hparams().

“rnn_cell_share_config”: bool

Whether share hyperparameters of the backward cell with the forward cell. Note that the cell parameters (variables) are not shared.

“output_layer_fw”: dict

Hyperparameters of the forward output layer. Ignored if output_layer_fw is given to the constructor. See the "output_layer" field of UnidirectionalRNNEncoder() for details.

“output_layer_bw”: dict

Hyperparameters of the backward output layer. Ignored if output_layer_bw is given to the constructor. Have the same structure and defaults with "output_layer_fw".

Ignored if output_layer_share_config is True.

“output_layer_share_config”: bool

Whether share hyperparameters of the backward output layer with the forward output layer. Note that the layer parameters (variables) are not shared.

“name”: str

Name of the encoder

forward(inputs, sequence_length=None, initial_state_fw=None, initial_state_bw=None, time_major=False, return_cell_output=False, return_output_size=False)[source]

Encodes the inputs.

Parameters
  • inputs – A 3D Tensor of shape [batch_size, max_time, dim]. The first two dimensions batch_size and max_time may be exchanged if time_major is True.

  • sequence_length (optional) – A 1D torch.LongTensor of shape [batch_size]. Sequence lengths of the batch inputs. Used to copy-through state and zero-out outputs when past a batch element’s sequence length.

  • initial_state_fw – (optional): Initial state of the forward RNN.

  • initial_state_bw – (optional): Initial state of the backward RNN.

  • time_major (bool) – The shape format of the inputs and outputs Tensors. If True, these tensors are of shape [max_time, batch_size, depth]. If False (default), these tensors are of shape [batch_size, max_time, depth].

  • return_cell_output (bool) – Whether to return the output of the RNN cell. This is the results prior to the output layer.

  • return_output_size (bool) – Whether to return the output size of the RNN cell. This is the results after the output layer.

Returns

  • By default (both return_cell_output and return_output_size are False), returns a pair (outputs, final_state)

    • outputs: A tuple (outputs_fw, outputs_bw) containing the forward and the backward RNN outputs, each of which is of shape [batch_size, max_time, output_dim] if time_major is False, or [max_time, batch_size, output_dim] if time_major is True. If RNN cell output is a (nested) tuple of Tensors, then outputs_fw and outputs_bw will be a (nested) tuple having the same structure as the cell output.

    • final_state: A tuple (final_state_fw, final_state_bw) containing the final states of the forward and backward RNNs, each of which is a Tensor of shape [batch_size] + cell.state_size, or a (nested) tuple of Tensors if cell.state_size is a (nested) tuple.

  • If return_cell_output is True, returns a triple (outputs, final_state, cell_outputs) where

    • cell_outputs: A tuple (cell_outputs_fw, cell_outputs_bw) containing the outputs by the forward and backward RNN cells prior to the output layers, having the same structure with outputs except for the output_dim.

  • If return_output_size is True, returns a tuple (outputs, final_state, output_size) where

    • output_size: A tuple (output_size_fw, output_size_bw) containing the size of outputs_fw and outputs_bw, respectively. Take *_fw for example, output_size_fw is a (possibly nested tuple of) int. If a single int or an int array, then outputs_fw has shape [batch/time, time/batch] + output_size_fw. If a (nested) tuple, then output_size_fw has the same structure as outputs_fw. The same applies to output_size_bw.

  • If both return_cell_output and return_output_size are True, returns (outputs, final_state, cell_outputs, output_size).

property cell_fw

The forward RNN cell.

property cell_bw

The backward RNN cell.

property state_size_fw

The state size of the forward encoder cell. Same as encoder.cell_fw.state_size.

property state_size_bw

The state size of the backward encoder cell. Same as encoder.cell_bw.state_size.

property output_layer_fw

The output layer of the forward RNN.

property output_layer_bw

The output layer of the backward RNN.

property output_size

The feature sizes of forward() outputs output_size_fw and output_size_bw. Each feature size is equal to last dimension value of corresponding result size.

MultiheadAttentionEncoder

class texar.torch.modules.MultiheadAttentionEncoder(input_size, hparams=None)[source]

Multi-head Attention Encoder.

Parameters

hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "initializer": None,
    'num_heads': 8,
    'output_dim': 512,
    'num_units': 512,
    'dropout_rate': 0.1,
    'use_bias': False,
    "name": "multihead_attention"
}

Here:

“initializer”: dict, optional

Hyperparameters of the default initializer that initializes variables created in this module. See get_initializer() for details.

“num_heads”: int

Number of heads for attention calculation.

“output_dim”: int

Output dimension of the returned tensor.

“num_units”: int

Hidden dimension of the unsplit attention space. Should be divisible by “num_heads”.

“dropout_rate”: float

Dropout rate in the attention.

“use_bias”: bool

Use bias when projecting the key, value and query.

“name”: str

Name of the module.

forward(queries, memory, memory_attention_bias, cache=None)[source]

Encodes the inputs.

Parameters
  • queries – A 3D tensor with shape of [batch, length_query, depth_query].

  • memory – A 3D tensor with shape of [batch, length_key, depth_key].

  • memory_attention_bias – A 3D tensor with shape of [batch, length_key, num_units].

  • cache – Memory cache only when inferring the sentence from scratch.

Returns

A tensor of shape [batch_size, max_time, dim] containing the encoded vectors.

property output_size

The feature size of forward() output.

TransformerEncoder

class texar.torch.modules.TransformerEncoder(hparams=None)[source]

Transformer encoder that applies multi-head self attention for encoding sequences.

This module basically stacks MultiheadAttentionEncoder, FeedForwardNetwork and residual connections. This module supports two types of architectures, namely, the standard Transformer Encoder architecture first proposed in (Vaswani et al.) “Attention is All You Need”, and the variant first used in (Devlin et al.) BERT. See default_hparams() for the nuance between the two types of architectures.

Parameters

hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

initialize_blocks()[source]

Helper function which initializes blocks for encoder.

Should be overridden by any classes where block initialization varies.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "num_blocks": 6,
    "dim": 512,
    'use_bert_config': False,
    "embedding_dropout": 0.1,
    "residual_dropout": 0.1,
    "poswise_feedforward": default_transformer_poswise_net_hparams,
    'multihead_attention': {
        'name': 'multihead_attention',
        'num_units': 512,
        'num_heads': 8,
        'dropout_rate': 0.1,
        'output_dim': 512,
        'use_bias': False,
    },
    "eps": 1e-6,
    "initializer": None,
    "name": "transformer_encoder"
}

Here:

“num_blocks”: int

Number of stacked blocks.

“dim”: int

Hidden dimension of the encoders.

“use_bert_config”: bool

If False, apply the standard Transformer Encoder architecture from the original paper (Vaswani et al.) “Attention is All You Need”. If True, apply the Transformer Encoder architecture used in BERT (Devlin et al.) and the default setting of TensorFlow. The differences lie in:

  1. The standard arch restricts the word embedding of PAD token to all zero. The BERT arch does not.

  2. The attention bias for padding tokens: Standard architectures use -1e8 for negative attention mask. BERT uses -1e4 instead.

  3. The residual connections between internal tensors: In BERT, a residual layer connects the tensors after layer normalization. In standard architectures, the tensors are connected before layer normalization.

“embedding_dropout”: float

Dropout rate of the input embedding.

“residual_dropout”: float

Dropout rate of the residual connections.

“eps”: float

Epsilon values for layer norm layers.

“poswise_feedforward”: dict

Hyperparameters for a feed-forward network used in residual connections. Make sure the dimension of the output tensor is equal to "dim". See default_transformer_poswise_net_hparams() for details.

“multihead_attention”: dict

Hyperparameters for the multi-head attention strategy. Make sure the "output_dim" in this module is equal to "dim". See MultiheadAttentionEncoder for details.

“initializer”: dict, optional

Hyperparameters of the default initializer that initializes variables created in this module. See get_initializer() for details.

“name”: str

Name of the module.

forward(inputs, sequence_length)[source]

Encodes the inputs.

Parameters
  • inputs – A 3D Tensor of shape [batch_size, max_time, dim], containing the embedding of input sequences. Note that the embedding dimension dim must equal “dim” in hparams. The input embedding is typically an aggregation of word embedding and position embedding.

  • sequence_length – A 1D torch.LongTensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.

Returns

A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.

property output_size

The feature size of forward() output tensor(s), usually it is equal to the last dimension value of the output tensor size.

BERTEncoder

class texar.torch.modules.BERTEncoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Raw BERT Transformer for encoding sequences. Please see PretrainedBERTMixin for a brief description of BERT.

This module basically stacks WordEmbedder, PositionEmbedder, TransformerEncoder and a dense pooler.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., bert-base-uncased). Please refer to PretrainedBERTMixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

reset_parameters()[source]

Initialize parameters of the pre-trained model. This method is only called if pre-trained checkpoints are not loaded.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • The encoder arch is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.

  • Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.

  • If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.

{
    "pretrained_model_name": "bert-base-uncased",
    "embed": {
        "dim": 768,
        "name": "word_embeddings"
    },
    "vocab_size": 30522,
    "segment_embed": {
        "dim": 768,
        "name": "token_type_embeddings"
    },
    "type_vocab_size": 2,
    "position_embed": {
        "dim": 768,
        "name": "position_embeddings"
    },
    "position_size": 512,

    "encoder": {
        "dim": 768,
        "embedding_dropout": 0.1,
        "multihead_attention": {
            "dropout_rate": 0.1,
            "name": "self",
            "num_heads": 12,
            "num_units": 768,
            "output_dim": 768,
            "use_bias": True
        },
        "name": "encoder",
        "num_blocks": 12,
        "eps": 1e-12,
        "poswise_feedforward": {
            "layers": [
                {
                    "kwargs": {
                        "in_features": 768,
                        "out_features": 3072,
                        "bias": True
                    },
                    "type": "Linear"
                },
                {"type": "BertGELU"},
                {
                    "kwargs": {
                        "in_features": 3072,
                        "out_features": 768,
                        "bias": True
                    },
                    "type": "Linear"
                }
            ]
        },
        "residual_dropout": 0.1,
        "use_bert_config": True
        },
    "hidden_size": 768,
    "initializer": None,
    "name": "bert_encoder",
}

Here:

The default parameters are values for uncased BERT-Base model.

“pretrained_model_name”: str or None

The name of the pre-trained BERT model. If None, the model will be randomly initialized.

“embed”: dict

Hyperparameters for word embedding layer.

“vocab_size”: int

The vocabulary size of inputs in BERT model.

“segment_embed”: dict

Hyperparameters for segment embedding layer.

“type_vocab_size”: int

The vocabulary size of the segment_ids passed into BertModel.

“position_embed”: dict

Hyperparameters for position embedding layer.

“position_size”: int

The maximum sequence length that this model might ever be used with.

“encoder”: dict

Hyperparameters for the TransformerEncoder. See default_hparams() for details.

“hidden_size”: int

Size of the pooler dense layer.

“eps”: float

Epsilon values for layer norm layers.

“initializer”: dict, optional

Hyperparameters of the default initializer that initializes variables created in this module. See get_initializer() for details.

“name”: str

Name of the module.

forward(inputs, sequence_length=None, segment_ids=None)[source]

Encodes the inputs. Note that the SpanBERT model does not use segmentation embedding. As a result, SpanBERT does not require segment_ids as an input when you use pre-trained SpanBERT checkpoint files.

Parameters
  • inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.

  • segment_ids (optional) – A 2D Tensor of shape [batch_size, max_time], containing the segment ids of tokens in input sequences. If None (default), a tensor with all elements set to zero is used.

  • sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.

Returns

A pair (outputs, pooled_output)

  • outputs: A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.

  • pooled_output: A Tensor of size [batch_size, hidden_size] which is the output of a pooler pre-trained on top of the hidden state associated to the first character of the input (CLS), see BERT’s paper.

property output_size

The feature size of forward() output pooled_output.

RoBERTaEncoder

class texar.torch.modules.RoBERTaEncoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

RoBERTa Transformer for encoding sequences. Please see PretrainedRoBERTaMixin for a brief description of RoBERTa.

This module basically stacks WordEmbedder, PositionEmbedder, TransformerEncoder and a dense pooler.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., roberta-base). Please refer to PretrainedRoBERTaMixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • The encoder arch is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.

  • Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.

  • If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.

{
    "pretrained_model_name": "roberta-base",
    "embed": {
        "dim": 768,
        "name": "word_embeddings"
    },
    "vocab_size": 50265,
    "position_embed": {
        "dim": 768,
        "name": "position_embeddings"
    },
    "position_size": 514,

    "encoder": {
        "dim": 768,
        "embedding_dropout": 0.1,
        "multihead_attention": {
            "dropout_rate": 0.1,
            "name": "self",
            "num_heads": 12,
            "num_units": 768,
            "output_dim": 768,
            "use_bias": True
        },
        "name": "encoder",
        "num_blocks": 12,
        "eps": 1e-12,
        "poswise_feedforward": {
            "layers": [
                {
                    "kwargs": {
                        "in_features": 768,
                        "out_features": 3072,
                        "bias": True
                    },
                    "type": "Linear"
                },
                {"type": "BertGELU"},
                {
                    "kwargs": {
                        "in_features": 3072,
                        "out_features": 768,
                        "bias": True
                    },
                    "type": "Linear"
                }
            ]
        },
        "residual_dropout": 0.1,
        "use_bert_config": True
        },
    "hidden_size": 768,
    "initializer": None,
    "name": "roberta_encoder",
}

Here:

The default parameters are values for RoBERTa-Base model.

“pretrained_model_name”: str or None

The name of the pre-trained RoBERTa model. If None, the model will be randomly initialized.

“embed”: dict

Hyperparameters for word embedding layer.

“vocab_size”: int

The vocabulary size of inputs in RoBERTa model.

“position_embed”: dict

Hyperparameters for position embedding layer.

“position_size”: int

The maximum sequence length that this model might ever be used with.

“encoder”: dict

Hyperparameters for the TransformerEncoder. See default_hparams() for details.

“hidden_size”: int

Size of the pooler dense layer.

“eps”: float

Epsilon values for layer norm layers.

“initializer”: dict, optional

Hyperparameters of the default initializer that initializes variables created in this module. See get_initializer() for details.

“name”: str

Name of the module.

forward(inputs, sequence_length=None, segment_ids=None)[source]

Encodes the inputs. Differing from the standard BERT, the RoBERTa model does not use segmentation embedding. As a result, RoBERTa does not require segment_ids as an input.

Parameters
  • inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.

  • sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.

Returns

A pair (outputs, pooled_output)

  • outputs: A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.

  • pooled_output: A Tensor of size [batch_size, hidden_size] which is the output of a pooler pre-trained on top of the hidden state associated to the first character of the input (CLS), see RoBERTa’s paper.

GPT2Encoder

class texar.torch.modules.GPT2Encoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Raw GPT2 Transformer for encoding sequences. Please see PretrainedGPT2Mixin for a brief description of GPT2.

This module basically stacks WordEmbedder, PositionEmbedder, TransformerEncoder.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., gpt2-small). Please refer to PretrainedGPT2Mixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • The encoder arch is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.

  • Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.

  • If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.

{
    "pretrained_model_name": "gpt2-small",
    "vocab_size": 50257,
    "context_size": 1024,
    "embedding_size": 768,
    "embed": {
        "dim": 768,
        "name": "word_embeddings"
    },
    "position_size": 1024,
    "position_embed": {
        "dim": 768,
        "name": "position_embeddings"
    },

    "encoder": {
        "dim": 768,
        "num_blocks": 12,
        "use_bert_config": False,
        "embedding_dropout": 0,
        "residual_dropout": 0,
        "multihead_attention": {
            "use_bias": True,
            "num_units": 768,
            "num_heads": 12,
            "output_dim": 768
        },
        "eps": 1e-6,
        "initializer": {
            "type": "variance_scaling_initializer",
            "kwargs": {
                "factor": 1.0,
                "mode": "FAN_AVG",
                "uniform": True
            }
        },
        "poswise_feedforward": {
            "layers": [
                {
                    "type": "Linear",
                    "kwargs": {
                        "in_features": 768,
                        "out_features": 3072,
                        "bias": True
                    }
                },
                {
                    "type": "GPTGELU",
                    "kwargs": {}
                },
                {
                    "type": "Linear",
                    "kwargs": {
                        "in_features": 3072,
                        "out_features": 768,
                        "bias": True
                    }
                }
            ],
            "name": "ffn"
        }
    },
    "initializer": None,
    "name": "gpt2_encoder",
}

Here:

The default parameters are values for 124M GPT2 model.

“pretrained_model_name”: str or None

The name of the pre-trained GPT2 model. If None, the model will be randomly initialized.

“embed”: dict

Hyperparameters for word embedding layer.

“vocab_size”: int

The vocabulary size of inputs in GPT2Model.

“position_embed”: dict

Hyperparameters for position embedding layer.

“position_size”: int

The maximum sequence length that this model might ever be used with.

“decoder”: dict

Hyperparameters for the TransformerDecoder. See default_hparams() for details.

“eps”: float

Epsilon values for layer norm layers.

“initializer”: dict, optional

Hyperparameters of the default initializer that initializes variables created in this module. See get_initializer() for details.

“name”: str

Name of the module.

forward(inputs, sequence_length=None)[source]

Encodes the inputs.

Parameters
  • inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.

  • sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.

Returns

A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.

Return type

outputs

property output_size

The feature size of forward() output.

XLNetEncoder

class texar.torch.modules.XLNetEncoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Raw XLNet module for encoding sequences. Please see PretrainedXLNetMixin for a brief description of XLNet.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., xlnet-based-cased). Please refer to PretrainedXLNetMixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • The encoder arch is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.

  • Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.

  • If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.

{
    "pretrained_model_name": "xlnet-base-cased",
    "untie_r": True,
    "num_layers": 12,
    "mem_len": 0,
    "reuse_len": 0,
    "num_heads": 12,
    "hidden_dim": 768,
    "head_dim": 64,
    "dropout": 0.1,
    "attention_dropout": 0.1,
    "use_segments": True,
    "ffn_inner_dim": 3072,
    "activation": 'gelu',
    "vocab_size": 32000,
    "max_seq_length": 512,
    "initializer": None,
    "name": "xlnet_encoder",
}

Here:

The default parameters are values for cased XLNet-Base model.

“pretrained_model_name”: str or None

The name of the pre-trained XLNet model. If None, the model will be randomly initialized.

“untie_r”: bool

Whether to untie the biases in attention.

“num_layers”: int

The number of stacked layers.

“mem_len”: int

The number of tokens to cache.

“reuse_len”: int

The number of tokens in the current batch to be cached and reused in the future.

“num_heads”: int

The number of attention heads.

“hidden_dim”: int

The hidden size.

“head_dim”: int

The dimension size of each attention head.

“dropout”: float

Dropout rate.

“attention_dropout”: float

Dropout rate on attention probabilities.

“use_segments”: bool

Whether to use segment embedding.

“ffn_inner_dim”: int

The hidden size in feed-forward layers.

“activation”: str

relu or gelu.

“vocab_size”: int

The vocabulary size.

“max_seq_length”: int

The maximum sequence length for RelativePositionalEncoding.

“initializer”: dict, optional

Hyperparameters of the default initializer that initializes variables created in this module. See get_initializer() for details.

“name”: str

Name of the module.

param_groups(lr=None, lr_layer_scale=1.0, decay_base_params=False)[source]

Create parameter groups for optimizers. When lr_layer_decay_rate is not 1.0, parameters from each layer form separate groups with different base learning rates.

The return value of this method can be used in the constructor of optimizers, for example:

model = XLNetEncoder(...)
param_groups = model.param_groups(lr=2e-5, lr_layer_scale=0.8)
optim = torch.optim.Adam(param_groups)
Parameters
  • lr (float) – The learning rate. Can be omitted if lr_layer_decay_rate is 1.0.

  • lr_layer_scale (float) – Per-layer LR scaling rate. The i-th layer will be scaled by lr_layer_scale ^ (num_layers - i - 1).

  • decay_base_params (bool) – If True, treat non-layer parameters (e.g. embeddings) as if they’re in layer 0. If False, these parameters are not scaled.

Returns

The parameter groups, used as the first argument for optimizers.

property output_size

The feature size of forward() output.

forward(inputs, segment_ids=None, input_mask=None, memory=None, permute_mask=None, target_mapping=None, bi_data=False, clamp_len=None, cache_len=0, same_length=False, attn_type='bi', two_stream=False)[source]

Compute XLNet representations for the input.

Parameters
  • inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.

  • segment_ids – Shape [batch_size, max_time].

  • input_mask – Float tensor of shape [batch_size, max_time]. Note that positions with value 1 are masked out.

  • memory – Memory from previous batches. A list of length num_layers, each tensor of shape [batch_size, mem_len, hidden_dim].

  • permute_mask – The permutation mask. Float tensor of shape [batch_size, max_time, max_time]. A value of 0 for permute_mask[i, j, k] indicates that position i attends to position j in batch k.

  • target_mapping – The target token mapping. Float tensor of shape [batch_size, num_targets, max_time]. A value of 1 for target_mapping[i, j, k] indicates that the i-th target token (in order of permutation) in batch k is the token at position j. Each row target_mapping[i, :, k] can have no more than one value of 1.

  • bi_data (bool) – Whether to use bidirectional data input pipeline.

  • clamp_len (int) – Clamp all relative distances larger than clamp_len. A value of -1 means no clamping.

  • cache_len (int) – Length of memory (number of tokens) to cache.

  • same_length (bool) – Whether to use the same attention length for each token.

  • attn_type (str) – Attention type. Supported values are “uni” and “bi”.

  • two_stream (bool) – Whether to use two-stream attention. Only set to True when pre-training or generating text. Defaults to False.

Returns

A tuple of (output, new_memory):

  • `output`: The final layer output representations. Shape [batch_size, max_time, hidden_dim].

  • `new_memory`: The memory of the current batch. If cache_len is 0, then new_memory is None. Otherwise, it is a list of length num_layers, each tensor of shape [batch_size, cache_len, hidden_dim]. This can be used as the memory argument in the next batch.

Conv1DEncoder

class texar.torch.modules.Conv1DEncoder(in_channels, in_features=None, hparams=None)[source]

Simple Conv-1D encoder which consists of a sequence of convolutional layers followed with a sequence of dense layers.

Wraps Conv1DNetwork to be a subclass of EncoderBase. Has exact the same functionality with Conv1DNetwork.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

The same as default_hparams() of Conv1DNetwork, except that the default name is "conv_encoder".

EncoderBase

class texar.torch.modules.EncoderBase(hparams=None)[source]

Base class inherited by all encoder classes.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

RNNEncoderBase

class texar.torch.modules.RNNEncoderBase(*args, **kwds)[source]

Base class for all RNN encoder classes to inherit.

Parameters

hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "name": "rnn_encoder"
}

default_transformer_poswise_net_hparams

texar.torch.modules.default_transformer_poswise_net_hparams(input_dim, output_dim=512)[source]

Returns default hyperparameters of a FeedForwardNetwork as a position-wise network used in TransformerEncoder and TransformerDecoder. This is a 2-layer dense network with dropout in-between.

{
    "layers": [
        {
            "type": "Linear",
            "kwargs": {
                "in_features": input_dim,
                "out_features": output_dim * 4,
                "bias": True,
            }
        },
        {
            "type": "nn.ReLU",
            "kwargs": {
                "inplace": True
            }
        },
        {
            "type": "Dropout",
            "kwargs": {
                "p": 0.1,
            }
        },
        {
            "type": "Linear",
            "kwargs": {
                "in_features": output_dim * 4,
                "out_features": output_dim,
                "bias": True,
            }
        }
    ],
    "name": "ffn"
}
Parameters
  • input_dim (int) – The size of dense layer input.

  • output_dim (int) – The size of dense layer output.

Decoders

DecoderBase

class texar.torch.modules.DecoderBase(*args, **kwds)[source]

Base class inherited by all RNN decoder classes. See BasicRNNDecoder for the arguments.

See forward() for the inputs and outputs of RNN decoders in general.

embed_tokens(tokens, positions)[source]

Convert tokens along with positions to embeddings.

Parameters
  • tokens – A torch.LongTensor denoting the token indices to convert to embeddings.

  • positions – A torch.LongTensor with the same size as tokens, denoting the positions of the tokens. This is useful if the decoder uses positional embeddings.

Returns

A torch.Tensor of size tokens.size() + (embed_dim,), denoting the converted embeddings.

create_helper(*, decoding_strategy=None, start_tokens=None, end_token=None, softmax_temperature=None, infer_mode=None, **kwargs)[source]

Create a helper instance for the decoder. This is a shared interface for both BasicRNNDecoder and AttentionRNNDecoder.

The function provides 3 ways to specify the decoding method, with varying flexibility:

  1. The decoding_strategy argument: A string taking value of:

    • “train_greedy”: decoding in teacher-forcing fashion (i.e., feeding ground truth to decode the next step), and each sample is obtained by taking the argmax of the output logits. Arguments (inputs, sequence_length) are required for this strategy, and argument embedding is optional.

    • “infer_greedy”: decoding in inference fashion (i.e., feeding the generated sample to decode the next step), and each sample is obtained by taking the argmax of the output logits. Arguments (embedding, start_tokens, end_token) are required for this strategy, and argument max_decoding_length is optional.

    • “infer_sample”: decoding in inference fashion, and each sample is obtained by random sampling from the RNN output distribution. Arguments (embedding, start_tokens, end_token) are required for this strategy, and argument max_decoding_length is optional.

This argument is used only when argument helper is None.

Example:

embedder = WordEmbedder(vocab_size=data.vocab.size)
decoder = BasicRNNDecoder(vocab_size=data.vocab.size)

# Teacher-forcing decoding
outputs_1, _, _ = decoder(
    decoding_strategy='train_greedy',
    inputs=embedder(data_batch['text_ids']),
    sequence_length=data_batch['length'] - 1)

# Random sample decoding. Gets 100 sequence samples
outputs_2, _, sequence_length = decoder(
    decoding_strategy='infer_sample',
    start_tokens=[data.vocab.bos_token_id] * 100,
    end_token=data.vocab.eos.token_id,
    embedding=embedder,
    max_decoding_length=60)
  1. The helper argument: An instance of subclass of Helper. This provides a superset of decoding strategies than above, for example:

This means gives the maximal flexibility of configuring the decoding strategy.

Example:

embedder = WordEmbedder(vocab_size=data.vocab.size)
decoder = BasicRNNDecoder(vocab_size=data.vocab.size)

# Teacher-forcing decoding, same as above with
# `decoding_strategy='train_greedy'`
helper_1 = TrainingHelper(
    inputs=embedders(data_batch['text_ids']),
    sequence_length=data_batch['length'] - 1)
outputs_1, _, _ = decoder(helper=helper_1)

# Gumbel-softmax decoding
helper_2 = GumbelSoftmaxEmbeddingHelper(
    embedding=embedder,
    start_tokens=[data.vocab.bos_token_id] * 100,
    end_token=data.vocab.eos_token_id,
    tau=0.1)
outputs_2, _, sequence_length = decoder(
    max_decoding_length=60, helper=helper_2)
  1. hparams["helper_train"] and hparams["helper_infer"]: Specifying the helper through hyperparameters. Train and infer strategy is toggled based on mode. Appropriate arguments (e.g., inputs, start_tokens, etc) are selected to construct the helper. Additional arguments for helper constructor can be provided either through **kwargs, or through hparams["helper_train/infer"]["kwargs"].

    This means is used only when both decoding_strategy and helper are None.

    Example:

    h = {
        "helper_infer": {
            "type": "GumbelSoftmaxEmbeddingHelper",
            "kwargs": { "tau": 0.1 }
        }
    }
    embedder = WordEmbedder(vocab_size=data.vocab.size)
    decoder = BasicRNNDecoder(vocab_size=data.vocab.size, hparams=h)
    
    # Gumbel-softmax decoding
    decoder.eval()  # disable dropout
    output, _, _ = decoder(
        decoding_strategy=None, # Sets to None explicit
        embedding=embedder,
        start_tokens=[data.vocab.bos_token_id] * 100,
        end_token=data.vocab.eos_token_id,
        max_decoding_length=60)
    
Parameters
  • decoding_strategy (str) – A string specifying the decoding strategy. Different arguments are required based on the strategy. Ignored if helper is given.

  • start_tokens (optional) – A torch.LongTensor of shape [batch_size], the start tokens. Used when decoding_strategy is "infer_greedy" or "infer_sample", or when hparams-configured helper is used. When used with the Texar data module, to get batch_size samples where batch_size is changing according to the data module, this can be set as start_tokens=torch.full_like(batch['length'], bos_token_id).

  • end_token (optional) – A integer or 0D torch.LongTensor, the token that marks the end of decoding. Used when decoding_strategy is "infer_greedy" or "infer_sample", or when hparams-configured helper is used.

  • softmax_temperature (float, optional) – Value to divide the logits by before computing the softmax. Larger values (above 1.0) result in more random samples. Must be > 0. If None, 1.0 is used. Used when decoding_strategy="infer_sample".

  • infer_mode (optional) – If not None, overrides mode given by self.training.

  • **kwargs – Other keyword arguments for constructing helpers defined by hparams["helper_train"] or hparams["helper_infer"].

Returns

The constructed helper instance.

set_default_train_helper(helper)[source]

Set the default helper used in training mode.

Parameters

helper – The helper to set as default training helper.

set_default_infer_helper(helper)[source]

Set the default helper used in eval (inference) mode.

Parameters

helper – The helper to set as default inference helper.

dynamic_decode(helper, inputs, sequence_length, initial_state, max_decoding_length=None, impute_finished=False, step_hook=None)[source]

Generic routine for dynamic decoding. Please check the documentation for the TensorFlow counterpart.

Returns

A tuple of output, final state, and sequence lengths. Note that final state could be None, when all sequences are of zero length and initial_state is also None.

abstract initialize(helper, inputs, sequence_length, initial_state)[source]

Called before any decoding iterations.

This methods must compute initial input values and initial state.

Parameters
  • helper – The Helper instance to use.

  • inputs (optional) – A (structure of) input tensors.

  • sequence_length (optional) – A torch.LongTensor representing lengths of each sequence.

  • initial_state – A possibly nested structure of tensors indicating the initial decoder state.

Returns

A tuple (finished, initial_inputs, initial_state) representing initial values of finished flags, inputs, and state.

abstract step(helper, time, inputs, state)[source]

Compute the output and the state at the current time step. Called per step of decoding (but only once for dynamic decoding).

Parameters
  • helper – The Helper instance to use.

  • time (int) – Current step number.

  • inputs – Inputs for this time step.

  • state – Decoder state from the previous time step.

Returns

A tuple (outputs, next_state).

  • outputs is an object containing the decoder output.

  • next_state is the decoder state for the next time step.

abstract next_inputs(helper, time, outputs)[source]

Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).

Parameters
  • helper – The Helper instance to use.

  • time (int) – Current step number.

  • outputs – An object containing the decoder output.

Returns

A tuple (next_inputs, finished).

  • next_inputs is the tensor that should be used as input for the next step.

  • finished is a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.

finalize(outputs: Output, final_state: State, sequence_lengths: torch.LongTensor) → Tuple[Output, State][source]
finalize(outputs: Output, final_state: Optional[State], sequence_lengths: torch.LongTensor) → Tuple[Output, Optional[State]]

Called after all decoding iterations have finished.

Parameters
  • outputs – Outputs at each time step.

  • final_state – The RNNCell state after the last time step.

  • sequence_lengths – Sequence lengths for each sequence in batch.

Returns

A tuple (outputs, final_state).

  • outputs is an object containing the decoder output.

  • final_state is the final decoder state.

property vocab_size

The vocabulary size.

property output_layer

The output layer.

RNNDecoderBase

class texar.torch.modules.RNNDecoderBase(*args, **kwds)[source]

Base class inherited by all RNN decoder classes. See BasicRNNDecoder for the arguments.

See forward() for the inputs and outputs of RNN decoders in general.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

The hyperparameters are the same as in default_hparams() of BasicRNNDecoder, except that the default "name" here is "rnn_decoder".

forward(inputs=None, sequence_length=None, initial_state=None, helper=None, max_decoding_length=None, impute_finished=False, infer_mode=None, **kwargs)[source]

Performs decoding. This is a shared interface for both BasicRNNDecoder and AttentionRNNDecoder.

Implementation calls initialize() once and step() repeatedly on the decoder object. Please refer to tf.contrib.seq2seq.dynamic_decode.

See also

Arguments of create_helper(), for arguments like decoding_strategy.

Parameters
  • inputs (optional) –

    Input tensors for teacher forcing decoding. Used when decoding_strategy is set to "train_greedy", or when hparams-configured helper is used.

    The inputs is a torch.LongTensor used as index to look up embeddings and feed in the decoder. For example, if embedder is an instance of WordEmbedder, then inputs is usually a 2D int Tensor [batch_size, max_time] (or [max_time, batch_size] if input_time_major == True) containing the token indexes.

  • sequence_length (optional) – A 1D int Tensor containing the sequence length of inputs. Used when decoding_strategy=”train_greedy” or hparams-configured helper is used.

  • initial_state (optional) – Initial state of decoding. If None (default), zero state is used.

  • max_decoding_length – A int scalar Tensor indicating the maximum allowed number of decoding steps. If None (default), either hparams[“max_decoding_length_train”] or hparams[“max_decoding_length_infer”] is used according to mode.

  • impute_finished (bool) – If True, then states for batch entries which are marked as finished get copied through and the corresponding outputs get zeroed out. This causes some slowdown at each time step, but ensures that the final state and outputs have the correct values and that backprop ignores time steps that were marked as finished.

  • helper (optional) –

    An instance of Helper that defines the decoding strategy. If given, decoding_strategy and helper configurations in hparams are ignored.

    create_helper() can be used to create some of the common helpers for, e.g., teacher-forcing decoding, greedy decoding, sample decoding, etc.

  • infer_mode (optional) – If not None, overrides mode given by self.training.

  • **kwargs – Other keyword arguments for constructing helpers defined by hparams["helper_train"] or hparams["helper_infer"].

Returns

(outputs, final_state, sequence_lengths), where

  • outputs: an object containing the decoder output on all time steps.

  • final_state: the cell state of the final time step.

  • sequence_lengths: a torch.LongTensor of shape [batch_size] containing the length of each sample.

next_inputs(helper, time, outputs)[source]

Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).

Parameters
  • helper – The Helper instance to use.

  • time (int) – Current step number.

  • outputs – An object containing the decoder output.

Returns

A tuple (next_inputs, finished).

  • next_inputs is the tensor that should be used as input for the next step.

  • finished is a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.

property cell

The RNN cell.

zero_state(batch_size)[source]

Zero state of the RNN cell. Equivalent to decoder.cell.zero_state.

property state_size

The state size of decoder cell. Equivalent to decoder.cell.state_size.

property output_layer

The output layer.

BasicRNNDecoder

class texar.torch.modules.BasicRNNDecoder(*args, **kwds)[source]

Basic RNN decoder.

Parameters
  • input_size (int) – Dimension of input embeddings.

  • vocab_size (int, optional) – Vocabulary size. Required if output_layer is None.

  • token_embedder – An instance of torch.nn.Module, or a function taking a torch.LongTensor tokens as argument. This is the embedder called in embed_tokens() to convert input tokens to embeddings.

  • token_pos_embedder

    An instance of torch.nn.Module, or a function taking two torch.LongTensors tokens and positions as argument. This is the embedder called in embed_tokens() to convert input tokens with positions to embeddings.

    Note

    Only one among token_embedder and token_pos_embedder should be specified. If neither is specified, you must subclass BasicRNNDecoder and override embed_tokens().

  • cell (RNNCellBase, optional) – An instance of RNNCellBase. If None (default), a cell is created as specified in hparams.

  • output_layer (optional) – An instance of torch.nn.Module. Apply to the RNN cell output to get logits. If None, a torch.nn.Linear layer is used with output dimension set to vocab_size. Set output_layer to identity() if you do not want to have an output layer after the RNN cell outputs.

  • hparams (dict, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

See forward() for the inputs and outputs of the decoder. The decoder returns (outputs, final_state, sequence_lengths), where outputs is an instance of BasicRNNDecoderOutput.

Example

embedder = WordEmbedder(vocab_size=data.vocab.size)
decoder = BasicRNNDecoder(vocab_size=data.vocab.size)
# Training loss
outputs, _, _ = decoder(
    decoding_strategy='train_greedy',
    inputs=embedder(data_batch['text_ids']),
    sequence_length=data_batch['length']-1)
loss = tx.losses.sequence_sparse_softmax_cross_entropy(
    labels=data_batch['text_ids'][:, 1:],
    logits=outputs.logits,
    sequence_length=data_batch['length']-1)

# Create helper
helper = decoder.create_helper(
    decoding_strategy='infer_sample',
    start_tokens=[data.vocab.bos_token_id]*100,
    end_token=data.vocab.eos.token_id,
    embedding=embedder)

# Inference sample
outputs, _, _ = decoder(
    helper=helerp,
    max_decoding_length=60)

sample_text = tx.utils.map_ids_to_strs(
    outputs.sample_id, data.vocab)
print(sample_text)
# [
#   the first sequence sample .
#   the second sequence sample .
#   ...
# ]
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "rnn_cell": default_rnn_cell_hparams(),
    "max_decoding_length_train": None,
    "max_decoding_length_infer": None,
    "helper_train": {
        "type": "TrainingHelper",
        "kwargs": {}
    }
    "helper_infer": {
        "type": "SampleEmbeddingHelper",
        "kwargs": {}
    }
    "name": "basic_rnn_decoder"
}

Here:

“rnn_cell”: dict

A dictionary of RNN cell hyperparameters. Ignored if cell is given to the decoder constructor. The default value is defined in default_rnn_cell_hparams().

“max_decoding_length_train”: int or None

Maximum allowed number of decoding steps in training mode. If None (default), decoding is performed until fully done, e.g., encountering the <EOS> token. Ignored if "max_decoding_length" is not None given when calling the decoder.

“max_decoding_length_infer”: int or None

Same as "max_decoding_length_train" but for inference mode.

“helper_train”: dict

The hyperparameters of the helper used in training. "type" can be a helper class, its name or module path, or a helper instance. If a class name is given, the class must be from module texar.torch.modules, or texar.torch.custom. This is used only when both "decoding_strategy" and "helper" arguments are None when calling the decoder. See forward() for more details.

“helper_infer”: dict

Same as "helper_train" but during inference mode.

“name”: str

Name of the decoder. The default value is "basic_rnn_decoder".

next_inputs(helper, time, outputs)[source]

Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).

Parameters
  • helper – The Helper instance to use.

  • time (int) – Current step number.

  • outputs – An object containing the decoder output.

Returns

A tuple (next_inputs, finished).

  • next_inputs is the tensor that should be used as input for the next step.

  • finished is a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.

BasicRNNDecoderOutput

class texar.torch.modules.BasicRNNDecoderOutput(logits, sample_id, cell_output)[source]

The outputs of BasicRNNDecoder that include both RNN outputs and sampled IDs at each step. This is also used to store results of all the steps after decoding the whole sequence.

property logits

The outputs of RNN (at each step/of all steps) by applying the output layer on cell outputs. For example, in BasicRNNDecoder with default hyperparameters, this is a torch.Tensor of shape [batch_size, max_time, vocab_size] after decoding the whole sequence.

property sample_id

The sampled results (at each step/of all steps). For example, in BasicRNNDecoder with decoding strategy of "train_greedy", this is a torch.LongTensor of shape [batch_size, max_time] containing the sampled token indices of all steps. Note that the shape of sample_id is different for different decoding strategy or helper. Please refer to Helper for the detailed information.

property cell_output

The output of RNN cell (at each step/of all steps). This contains the results prior to the output layer. For example, in BasicRNNDecoder with default hyperparameters, this is a torch.Tensor of shape [batch_size, max_time, cell_output_size] after decoding the whole sequence.

AttentionRNNDecoder

class texar.torch.modules.AttentionRNNDecoder(*args, **kwds)[source]

RNN decoder with attention mechanism.

Parameters
  • input_size (int) – Dimension of input embeddings.

  • encoder_output_size (int) – The output size of the encoder cell.

  • vocab_size (int) – Vocabulary size. Required if output_layer is None.

  • token_embedder – An instance of torch.nn.Module, or a function taking a torch.LongTensor tokens as argument. This is the embedder called in embed_tokens() to convert input tokens to embeddings.

  • token_pos_embedder

    An instance of torch.nn.Module, or a function taking two torch.LongTensors tokens and positions as argument. This is the embedder called in embed_tokens() to convert input tokens with positions to embeddings.

    Note

    Only one among token_embedder and token_pos_embedder should be specified. If neither is specified, you must subclass AttentionRNNDecoder and override embed_tokens().

  • cell (RNNCellBase, optional) – An instance of RNNCellBase. If None, a cell is created as specified in hparams.

  • output_layer (optional) –

    An output layer that transforms cell output to logits. This can be:

    • A callable layer, e.g., an instance of torch.nn.Module.

    • A tensor. A dense layer will be created using the tensor as the kernel weights. The bias of the dense layer is determined by hparams.output_layer_bias. This can be used to tie the output layer with the input embedding matrix, as proposed in https://arxiv.org/pdf/1608.05859.pdf

    • None. A dense layer will be created based on vocab_size and hparams.output_layer_bias.

    • If no output layer after the cell output is needed, set (vocab_size=None, output_layer=texar.torch.core.identity).

  • cell_input_fn (callable, optional) – A callable that produces RNN cell inputs. If None (default), the default is used: lambda inputs, attention: torch.cat([inputs, attention], -1), which concatenates regular RNN cell inputs with attentions.

  • hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

See texar.torch.modules.RNNDecoderBase.forward() for the inputs and outputs of the decoder. The decoder returns (outputs, final_state, sequence_lengths), where outputs is an instance of AttentionRNNDecoderOutput.

Example

# Encodes the source
enc_embedder = WordEmbedder(data.source_vocab.size, ...)
encoder = UnidirectionalRNNEncoder(...)
enc_outputs, _ = encoder(
    inputs=enc_embedder(data_batch['source_text_ids']),
    sequence_length=data_batch['source_length'])
# Decodes while attending to the source
dec_embedder = WordEmbedder(vocab_size=data.target_vocab.size, ...)
decoder = AttentionRNNDecoder(
    encoder_output_size=(self.encoder.cell_fw.hidden_size +
                         self.encoder.cell_bw.hidden_size),
    input_size=dec_embedder.dim,
    vocab_size=data.target_vocab.size)
outputs, _, _ = decoder(
    decoding_strategy='train_greedy',
    memory=enc_outputs,
    memory_sequence_length=data_batch['source_length'],
    inputs=dec_embedder(data_batch['target_text_ids']),
    sequence_length=data_batch['target_length']-1)
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values. Common hyperparameters are the same as in BasicRNNDecoder. default_hparams(). Additional hyperparameters are for attention mechanism configuration.

{
    "attention": {
        "type": "LuongAttention",
        "kwargs": {
            "num_units": 256,
        },
        "attention_layer_size": None,
        "alignment_history": False,
        "output_attention": True,
    },
    # The following hyperparameters are the same as with
    # `BasicRNNDecoder`
    "rnn_cell": default_rnn_cell_hparams(),
    "max_decoding_length_train": None,
    "max_decoding_length_infer": None,
    "helper_train": {
        "type": "TrainingHelper",
        "kwargs": {}
    }
    "helper_infer": {
        "type": "SampleEmbeddingHelper",
        "kwargs": {}
    }
    "name": "attention_rnn_decoder"
}

Here:

“attention”: dict

Attention hyperparameters, including:

“type”: str or class or instance

The attention type. Can be an attention class, its name or module path, or a class instance. The class must be a subclass of AttentionMechanism. See Attention Mechanism for all supported attention mechanisms. If class name is given, the class must be from modules texar.torch.core or texar.torch.custom.

Example:

# class name
"type": "LuongAttention"
"type": "BahdanauAttention"
# module path
"type": "texar.torch.core.BahdanauMonotonicAttention"
"type": "my_module.MyAttentionMechanismClass"
# class
"type": texar.torch.core.LuongMonotonicAttention
# instance
"type": LuongAttention(...)
“kwargs”: dict

keyword arguments for the attention class constructor. Arguments memory and memory_sequence_length should not be specified here because they are given to the decoder constructor. Ignored if “type” is an attention class instance. For example:

"type": "LuongAttention",
"kwargs": {
    "num_units": 256,
    "probability_fn": torch.nn.functional.softmax,
}

Here “probability_fn” can also be set to the string name or module path to a probability function.

“attention_layer_size”: int or None

The depth of the attention (output) layer. The context and cell output are fed into the attention layer to generate attention at each time step. If None (default), use the context as attention at each time step.

“alignment_history”: bool

whether to store alignment history from all time steps in the final output state. (Stored as a time major TensorArray on which you must call stack().)

“output_attention”: bool

If True (default), the output at each time step is the attention value. This is the behavior of Luong-style attention mechanisms. If False, the output at each time step is the output of cell. This is the behavior of Bahdanau-style attention mechanisms. In both cases, the attention tensor is propagated to the next time step via the state and is used there. This flag only controls whether the attention mechanism is propagated up to the next cell in an RNN stack or to the top RNN output.

next_inputs(helper, time, outputs)[source]

Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).

Parameters
  • helper – The Helper instance to use.

  • time (int) – Current step number.

  • outputs – An object containing the decoder output.

Returns

A tuple (next_inputs, finished).

  • next_inputs is the tensor that should be used as input for the next step.

  • finished is a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.

forward(memory, memory_sequence_length=None, inputs=None, sequence_length=None, initial_state=None, helper=None, max_decoding_length=None, impute_finished=False, infer_mode=None, beam_width=None, length_penalty=0.0, **kwargs)[source]

Performs decoding.

Implementation calls initialize() once and step() repeatedly on the Decoder object. Please refer to tf.contrib.seq2seq.dynamic_decode.

See also

Arguments of create_helper().

Parameters
  • memory – The memory to query; usually the output of an RNN encoder. This tensor should be shaped [batch_size, max_time, …].

  • memory_sequence_length – (optional) Sequence lengths for the batch entries in memory. If provided, the memory tensor rows are masked with zeros for values past the respective sequence lengths.

  • inputs (optional) –

    Input tensors for teacher forcing decoding. Used when decoding_strategy is set to "train_greedy", or when hparams-configured helper is used.

    The attr:inputs is a torch.LongTensor used as index to look up embeddings and feed in the decoder. For example, if embedder is an instance of WordEmbedder, then inputs is usually a 2D int Tensor [batch_size, max_time] (or [max_time, batch_size] if input_time_major == True) containing the token indexes.

  • sequence_length (optional) – A 1D int Tensor containing the sequence length of inputs. Used when decoding_strategy=”train_greedy” or hparams-configured helper is used.

  • initial_state (optional) – Initial state of decoding. If None (default), zero state is used.

  • helper (optional) – An instance of Helper that defines the decoding strategy. If given, decoding_strategy and helper configurations in hparams are ignored.

  • max_decoding_length – A int scalar Tensor indicating the maximum allowed number of decoding steps. If None (default), either hparams[“max_decoding_length_train”] or hparams[“max_decoding_length_infer”] is used according to mode.

  • impute_finished (bool) – If True, then states for batch entries which are marked as finished get copied through and the corresponding outputs get zeroed out. This causes some slowdown at each time step, but ensures that the final state and outputs have the correct values and that backprop ignores time steps that were marked as finished.

  • infer_mode (optional) – If not None, overrides mode given by self.training.

  • beam_width (int) – Set to use beam search. If given, decoding_strategy is ignored.

  • length_penalty (float) – Length penalty coefficient used in beam search decoding. Refer to https://arxiv.org/abs/1609.08144 for more details. It should be larger if longer sentences are desired.

  • **kwargs – Other keyword arguments for constructing helpers defined by hparams["helper_train"] or hparams["helper_infer"].

Returns

  • For beam search decoding, returns a dict containing keys "sample_id" and "log_prob".

    • "sample_id" is a torch.LongTensor of shape [batch_size, max_time, beam_width] containing generated token indexes. sample_id[:,:,0] is the highest-probable sample.

    • "log_prob" is a torch.Tensor of shape [batch_size, beam_width] containing the log probability of each sequence sample.

  • For “infer_greedy” and “infer_sample” decoding or decoding with helper, returns a tuple (outputs, final_state, sequence_lengths), where

    • outputs: an object containing the decoder output on all time steps.

    • final_state: is the cell state of the final time step.

    • sequence_lengths: is an int Tensor of shape [batch_size] containing the length of each sample.

AttentionRNNDecoderOutput

class texar.torch.modules.AttentionRNNDecoderOutput(logits, sample_id, cell_output, attention_scores, attention_context)[source]

The outputs of AttentionRNNDecoder that additionally includes attention results.

property logits

The outputs of RNN (at each step/of all steps) by applying the output layer on cell outputs. For example, in AttentionRNNDecoder with default hyperparameters, this is a torch.Tensor of shape [batch_size, max_time, vocab_size] after decoding the whole sequence.

property sample_id

The sampled results (at each step/of all steps). For example, in AttentionRNNDecoder with decoding strategy of "train_greedy", this is a torch.LongTensor of shape [batch_size, max_time] containing the sampled token indices of all steps. Note that the shape of sample_id is different for different decoding strategy or helper. Please refer to Helper for the detailed information.

property cell_output

The output of RNN cell (at each step/of all steps). This contains the results prior to the output layer. For example, in AttentionRNNDecoder with default hyperparameters, this is a torch.Tensor of shape [batch_size, max_time, cell_output_size] after decoding the whole sequence.

property attention_scores

A single or tuple of Tensor(s) containing the alignments emitted (at the previous time step/of all time steps) for each attention mechanism.

property attention_context

The attention emitted (at the previous time step/of all time steps).

GPT2Decoder

class texar.torch.modules.GPT2Decoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Raw GPT2 Transformer for decoding sequences. Please see PretrainedGPT2Mixin for a brief description of GPT2.

This module basically stacks WordEmbedder, PositionEmbedder, TransformerDecoder.

This module supports the architecture first proposed in (Radford et al.) GPT2.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., gpt2-small). Please refer to PretrainedGPT2Mixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • The decoder arch is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.

  • Otherwise, the encoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.

  • If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.

{
    "name": "gpt2_decoder",
    "pretrained_model_name": "gpt2-small",
    "vocab_size": 50257,
    "context_size": 1024,
    "embedding_size": 768,
    "embed": {
        "dim": 768,
        "name": "word_embeddings"
    },
    "position_size": 1024,
    "position_embed": {
        "dim": 768,
        "name": "position_embeddings"
    },

    # hparams for TransformerDecoder
    "decoder": {
        "dim": 768,
        "num_blocks": 12,
        "embedding_dropout": 0,
        "residual_dropout": 0,
        "multihead_attention": {
            "use_bias": True,
            "num_units": 768,
            "num_heads": 12,
            "dropout_rate": 0.0,
            "output_dim": 768
        },
        "initializer": {
            "type": "variance_scaling_initializer",
            "kwargs": {
                "factor": 1.0,
                "mode": "FAN_AVG",
                "uniform": True
            }
        },
        "eps": 1e-5,
        "poswise_feedforward": {
            "layers": [
                {
                    "type": "Linear",
                    "kwargs": {
                        "in_features": 768,
                        "out_features": 3072,
                        "bias": True
                    }
                },
                {
                    "type": "GPTGELU",
                    "kwargs": {}
                },
                {
                    "type": "Linear",
                    "kwargs": {
                        "in_features": 3072,
                        "out_features": 768,
                        "bias": True
                    }
                }
            ],
            "name": "ffn"
        }
    },
}

Here:

The default parameters are values for 124M GPT2 model.

“pretrained_model_name”: str or None

The name of the pre-trained GPT2 model. If None, the model will be randomly initialized.

“embed”: dict

Hyperparameters for word embedding layer.

“vocab_size”: int

The vocabulary size of inputs in GPT2Model.

“position_embed”: dict

Hyperparameters for position embedding layer.

“eps”: float

Epsilon values for layer norm layers.

“position_size”: int

The maximum sequence length that this model might ever be used with.

“name”: str

Name of the module.

forward(inputs=None, sequence_length=None, memory=None, memory_sequence_length=None, memory_attention_bias=None, context=None, context_sequence_length=None, helper=None, decoding_strategy='train_greedy', max_decoding_length=None, impute_finished=False, infer_mode=None, beam_width=None, length_penalty=0.0, **kwargs)[source]

Performs decoding. Has exact the same interfaces with texar.torch.modules.TransformerDecoder.forward(). Please refer to it for the detailed usage.

XLNetDecoder

class texar.torch.modules.XLNetDecoder(*args, **kwds)[source]

Raw XLNet module for decoding sequences. Please see PretrainedXLNetMixin for a brief description of XLNet.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., xlnet-based-cased). Please refer to PretrainedXLNetMixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • The decoder arch is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.

  • Otherwise, the decoder arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.

  • If the above two are None, the decoder arch is defined by the configurations in hparams and weights are randomly initialized.

{
    "pretrained_model_name": "xlnet-base-cased",
    "untie_r": True,
    "num_layers": 12,
    "mem_len": 0,
    "reuse_len": 0,
    "num_heads": 12,
    "hidden_dim": 768,
    "head_dim": 64,
    "dropout": 0.1,
    "attention_dropout": 0.1,
    "use_segments": True,
    "ffn_inner_dim": 3072,
    "activation": 'gelu',
    "vocab_size": 32000,
    "max_seq_length": 512,
    "initializer": None,
    "name": "xlnet_decoder",
}

Here:

The default parameters are values for cased XLNet-Base model.

“pretrained_model_name”: str or None

The name of the pre-trained XLNet model. If None, the model will be randomly initialized.

“untie_r”: bool

Whether to untie the biases in attention.

“num_layers”: int

The number of stacked layers.

“mem_len”: int

The number of tokens to cache.

“reuse_len”: int

The number of tokens in the current batch to be cached and reused in the future.

“num_heads”: int

The number of attention heads.

“hidden_dim”: int

The hidden size.

“head_dim”: int

The dimension size of each attention head.

“dropout”: float

Dropout rate.

“attention_dropout”: float

Dropout rate on attention probabilities.

“use_segments”: bool

Whether to use segment embedding.

“ffn_inner_dim”: int

The hidden size in feed-forward layers.

“activation”: str

relu or gelu.

“vocab_size”: int

The vocabulary size.

“max_seq_length”: int

The maximum sequence length for RelativePositionalEncoding.

“initializer”: dict, optional

Hyperparameters of the default initializer that initializes variables created in this module. See get_initializer() for details.

“name”: str

Name of the module.

embed_tokens(tokens, positions)[source]

Convert tokens along with positions to embeddings.

Parameters
  • tokens – A torch.LongTensor denoting the token indices to convert to embeddings.

  • positions – A torch.LongTensor with the same size as tokens, denoting the positions of the tokens. This is useful if the decoder uses positional embeddings.

Returns

A torch.Tensor of size tokens.size() + (embed_dim,), denoting the converted embeddings.

next_inputs(helper, time, outputs)[source]

Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).

Parameters
  • helper – The Helper instance to use.

  • time (int) – Current step number.

  • outputs – An object containing the decoder output.

Returns

A tuple (next_inputs, finished).

  • next_inputs is the tensor that should be used as input for the next step.

  • finished is a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.

forward(start_tokens, memory=None, cache_len=512, max_decoding_length=500, recompute_memory=True, print_steps=False, helper_type=None, **helper_kwargs)[source]

Perform autoregressive decoding using XLNet. The algorithm is largely inspired by: https://github.com/rusiaaman/XLNet-gen.

Parameters
  • start_tokens – A LongTensor of shape [batch_size, prompt_len], representing the tokenized initial prompt.

  • memory (optional) – The initial memory.

  • cache_len – Length of memory (number of tokens) to cache.

  • max_decoding_length (int) – Maximum number of tokens to decode.

  • recompute_memory (bool) – If True, the entire memory is recomputed for each token to generate. This leads to better performance because it enables every generated token to attend to each other, compared to reusing previous memory which is equivalent to using a causal attention mask. However, it is computationally more expensive. Defaults to True.

  • print_steps (bool) – If True, will print decoding progress.

  • helper – Type (or name of the type) of any sub-class of Helper.

  • helper_kwargs – The keyword arguments to pass to constructor of the specific helper type.

Returns

A tuple of (output, new_memory): - `output`: The sampled tokens as a list of integers. - `new_memory`: The memory of the sampled tokens.

XLNetDecoderOutput

class texar.torch.modules.XLNetDecoderOutput(logits, sample_id)[source]

The output of XLNetDecoder.

property logits

A torch.Tensor of shape [batch_size, max_time, vocab_size] containing the logits.

property sample_id

A torch.LongTensor of shape [batch_size, max_time] (or [batch_size, max_time, vocab_size]) containing the sampled token indices. Note that the shape of sample_id is different for different decoding strategy or helper. Please refer to Helper for the detailed information.

TransformerDecoder

class texar.torch.modules.TransformerDecoder(*args, **kwds)[source]

Transformer decoder that applies multi-head self-attention for sequence decoding.

It is a stack of MultiheadAttentionEncoder, FeedForwardNetwork, and residual connections.

Parameters
  • token_embedder – An instance of torch.nn.Module, or a function taking a torch.LongTensor tokens as argument. This is the embedder called in embed_tokens() to convert input tokens to embeddings.

  • token_pos_embedder

    An instance of torch.nn.Module, or a function taking two torch.LongTensors tokens and positions as argument. This is the embedder called in embed_tokens() to convert input tokens with positions to embeddings.

    Note

    Only one among token_embedder and token_pos_embedder should be specified. If neither is specified, you must subclass TransformerDecoder and override embed_tokens().

  • vocab_size (int, optional) – Vocabulary size. Required if output_layer is None.

  • output_layer (optional) –

    An output layer that transforms cell output to logits. This can be:

    • A callable layer, e.g., an instance of torch.nn.Module.

    • A tensor. A torch.nn.Linear layer will be created using the tensor as weights. The bias of the dense layer is determined by hparams.output_layer_bias. This can be used to tie the output layer with the input embedding matrix, as proposed in https://arxiv.org/pdf/1608.05859.pdf.

    • None. A torch.nn.Linear layer will be created based on vocab_size and hparams.output_layer_bias.

    • If no output layer is needed at the end, set vocab_size to None and output_layer to identity().

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

initialize_blocks()[source]

Helper function which initializes blocks for decoder.

Should be overridden by any classes where block initialization varies.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    # Same as in TransformerEncoder
    "num_blocks": 6,
    "dim": 512,
    "embedding_dropout": 0.1,
    "residual_dropout": 0.1,
    "poswise_feedforward": default_transformer_poswise_net_hparams,
    "multihead_attention": {
        'name': 'multihead_attention',
        'num_units': 512,
        'output_dim': 512,
        'num_heads': 8,
        'dropout_rate': 0.1,
        'use_bias': False,
    },
    "eps": 1e-12,
    "initializer": None,
    "name": "transformer_decoder"

    # Additional for TransformerDecoder
    "embedding_tie": True,
    "output_layer_bias": False,
    "max_decoding_length": int(1e10),
}

Here:

“num_blocks”: int

Number of stacked blocks.

“dim”: int

Hidden dimension of the encoder.

“embedding_dropout”: float

Dropout rate of the input word and position embeddings.

“residual_dropout”: float

Dropout rate of the residual connections.

“poswise_feedforward”: dict

Hyperparameters for a feed-forward network used in residual connections. Make sure the dimension of the output tensor is equal to dim.

See default_transformer_poswise_net_hparams() for details.

“multihead_attention”: dict

Hyperparameters for the multi-head attention strategy. Make sure the output_dim in this module is equal to dim.

See MultiheadAttentionEncoder for details.

“initializer”: dict, optional

Hyperparameters of the default initializer that initializes variables created in this module.

See get_initializer() for details.

“embedding_tie”: bool

Whether to use the word embedding matrix as the output layer that computes logits. If False, a new dense layer is created.

“eps”: float

Epsilon values for layer norm layers.

“output_layer_bias”: bool

Whether to use bias to the output layer.

“max_decoding_length”: int

The maximum allowed number of decoding steps. Set to a very large number of avoid the length constraint. Ignored if provided in forward() or "train_greedy" decoding is used.

“name”: str

Name of the module.

forward(inputs=None, sequence_length=None, memory=None, memory_sequence_length=None, memory_attention_bias=None, context=None, context_sequence_length=None, helper=None, decoding_strategy='train_greedy', max_decoding_length=None, impute_finished=False, infer_mode=None, beam_width=None, length_penalty=0.0, **kwargs)[source]

Performs decoding.

The interface is very similar to that of RNN decoders (RNNDecoderBase). In particular, the function provides 3 ways to specify the decoding method, with varying flexibility:

  1. The decoding_strategy argument.

    • “train_greedy”: decoding in teacher-forcing fashion (i.e., feeding ground truth to decode the next step), and for each step sample is obtained by taking the argmax of logits. Argument inputs is required for this strategy. sequence_length is optional.

    • “infer_greedy”: decoding in inference fashion (i.e., feeding generated sample to decode the next step), and for each step sample is obtained by taking the argmax of logits. Arguments (start_tokens, end_token) are required for this strategy, and argument max_decoding_length is optional.

    • “infer_sample”: decoding in inference fashion, and for each step sample is obtained by random sampling from the logits. Arguments (start_tokens, end_token) are required for this strategy, and argument max_decoding_length is optional.

This argument is used only when arguments helper and beam_width are both None.

  1. The helper argument: An instance of subclass of Helper. This provides a superset of decoding strategies than above. The interface is the same as in RNN decoders. Please refer to texar.torch.modules.RNNDecoderBase.forward() for detailed usage and examples.

    Note that, here, though using a TrainingHelper corresponding to the "train_greedy" strategy above, the implementation is slower than directly setting decoding_strategy="train_greedy" (though output results are the same).

    Argument max_decoding_length is optional.

  2. Beam search: set beam_width to use beam search decoding. Arguments (start_tokens, end_token) are required, and argument max_decoding_length is optional.

Parameters
  • memory (optional) – The memory to attend, e.g., the output of an RNN encoder. A torch.Tensor of shape [batch_size, memory_max_time, dim].

  • memory_sequence_length (optional) – A torch.Tensor of shape [batch_size] containing the sequence lengths for the batch entries in memory. Used to create attention bias of memory_attention_bias is not given. Ignored if memory_attention_bias is provided.

  • memory_attention_bias (optional) – A torch.Tensor of shape [batch_size, num_heads, memory_max_time, dim]. An attention bias typically sets the value of a padding position to a large negative value for masking. If not given, memory_sequence_length is used to automatically create an attention bias.

  • inputs (optional) –

    Input tensors for teacher forcing decoding. Used when decoding_strategy is set to "train_greedy", or when hparams-configured helper is used.

    The attr:inputs is a torch.LongTensor used as index to look up embeddings and feed in the decoder. For example, if embedder is an instance of WordEmbedder, then inputs is usually a 2D int Tensor [batch_size, max_time] (or [max_time, batch_size] if input_time_major == True) containing the token indexes.

  • sequence_length (optional) – A torch.LongTensor of shape [batch_size], containing the sequence length of inputs. Tokens beyond the respective sequence length are masked out. Used when decoding_strategy is set to "train_greedy".

  • decoding_strategy (str) – A string specifying the decoding strategy, including "train_greedy", "infer_greedy", "infer_sample". Different arguments are required based on the strategy. See above for details. Ignored if beam_width or helper is set.

  • beam_width (int) – Set to use beam search. If given, decoding_strategy is ignored.

  • length_penalty (float) – Length penalty coefficient used in beam search decoding. Refer to https://arxiv.org/abs/1609.08144 for more details. It should be larger if longer sentences are desired.

  • context (optional) – An torch.LongTensor of shape [batch_size, length], containing the starting tokens for decoding. If context is set, start_tokens of the Helper will be ignored.

  • context_sequence_length (optional) – Specify the length of context.

  • max_decoding_length (int, optional) – The maximum allowed number of decoding steps. If None (default), use "max_decoding_length" defined in hparams. Ignored in "train_greedy" decoding.

  • impute_finished (bool) – If True, then states for batch entries which are marked as finished get copied through and the corresponding outputs get zeroed out. This causes some slowdown at each time step, but ensures that the final state and outputs have the correct values and that backprop ignores time steps that were marked as finished. Ignored in "train_greedy" decoding.

  • helper (optional) – An instance of Helper that defines the decoding strategy. If given, decoding_strategy and helper configurations in hparams are ignored.

  • infer_mode (optional) – If not None, overrides mode given by self.training.

  • **kwargs (optional, dict) –

    Other keyword arguments. Typically ones such as:

    • start_tokens: A torch.LongTensor of shape [batch_size], the start tokens. Used when decoding_strategy is "infer_greedy" or "infer_sample" or when beam_search is set. Ignored when context is set.

      When used with the Texar data module, to get batch_size samples where batch_size is changing according to the data module, this can be set as start_tokens=torch.full_like(batch[‘length’], bos_token_id).

    • end_token: An integer or 0D torch.LongTensor, the token that marks the end of decoding. Used when decoding_strategy is "infer_greedy" or "infer_sample", or when beam_search is set.

Returns

  • For “train_greedy” decoding, returns an instance of TransformerDecoderOutput which contains sample_id and logits.

  • For “infer_greedy” and “infer_sample” decoding or decoding with helper, returns a tuple (outputs, sequence_lengths), where outputs is an instance of TransformerDecoderOutput as in “train_greedy”, and sequence_lengths is a torch.LongTensor of shape [batch_size] containing the length of each sample.

  • For beam search decoding, returns a dict containing keys "sample_id" and "log_prob".

    • "sample_id" is a torch.LongTensor of shape [batch_size, max_time, beam_width] containing generated token indexes. sample_id[:,:,0] is the highest-probable sample.

    • "log_prob" is a torch.Tensor of shape [batch_size, beam_width] containing the log probability of each sequence sample.

property output_size

Output size of one step.

initialize(helper, inputs, sequence_length, initial_state)[source]

Called before any decoding iterations.

This methods must compute initial input values and initial state.

Parameters
  • helper – The Helper instance to use.

  • inputs (optional) – A (structure of) input tensors.

  • sequence_length (optional) – A torch.LongTensor representing lengths of each sequence.

  • initial_state – A possibly nested structure of tensors indicating the initial decoder state.

Returns

A tuple (finished, initial_inputs, initial_state) representing initial values of finished flags, inputs, and state.

step(helper, time, inputs, state)[source]

Compute the output and the state at the current time step. Called per step of decoding (but only once for dynamic decoding).

Parameters
  • helper – The Helper instance to use.

  • time (int) – Current step number.

  • inputs – Inputs for this time step.

  • state – Decoder state from the previous time step.

Returns

A tuple (outputs, next_state).

  • outputs is an object containing the decoder output.

  • next_state is the decoder state for the next time step.

next_inputs(helper, time, outputs)[source]

Compute the input for the next time step. Called per step of decoding (but only once for dynamic decoding).

Parameters
  • helper – The Helper instance to use.

  • time (int) – Current step number.

  • outputs – An object containing the decoder output.

Returns

A tuple (next_inputs, finished).

  • next_inputs is the tensor that should be used as input for the next step.

  • finished is a torch.ByteTensor tensor telling whether the sequence is complete, for each sequence in the batch.

finalize(outputs, final_state, sequence_lengths)[source]

Called after all decoding iterations have finished.

Parameters
  • outputs – Outputs at each time step.

  • final_state – The RNNCell state after the last time step.

  • sequence_lengths – Sequence lengths for each sequence in batch.

Returns

A tuple (outputs, final_state).

  • outputs is an object containing the decoder output.

  • final_state is the final decoder state.

TransformerDecoderOutput

class texar.torch.modules.TransformerDecoderOutput(logits, sample_id)[source]

The output of TransformerDecoder.

property logits

A torch.Tensor of shape [batch_size, max_time, vocab_size] containing the logits.

property sample_id

A torch.LongTensor of shape [batch_size, max_time] (or [batch_size, max_time, vocab_size]) containing the sampled token indices. Note that the shape of sample_id is different for different decoding strategy or helper. Please refer to Helper for the detailed information.

Helper

class texar.torch.modules.Helper(*args, **kwds)[source]

Interface for implementing sampling in seq2seq decoders.

Please refer to the documentation for the TensorFlow counterpart tf.contrib.seq2seq.Helper.

initialize(embedding_fn, inputs, sequence_length)[source]

Initialize the current batch.

Parameters
  • embedding_fn – A function taking input tokens and timestamps, returning embedding tensors.

  • inputs – Input tensors.

  • sequence_length – An int32 vector tensor.

Returns

(initial_finished, initial_inputs).

sample(time, outputs)[source]

Returns sample_ids.

next_inputs(embedding_fn, time, outputs, sample_ids)[source]

Returns (finished, next_inputs, next_state).

TrainingHelper

class texar.torch.modules.TrainingHelper(*args, **kwds)[source]

A helper for use during training. Only reads inputs.

Returned sample_ids are the argmax of the RNN output logits.

Please refer to the documentation for the TensorFlow counterpart tf.contrib.seq2seq.TrainingHelper.

Parameters

time_major (bool) – Whether the tensors in inputs are time major. If False (default), they are assumed to be batch major.

EmbeddingHelper

class texar.torch.modules.EmbeddingHelper(*args, **kwds)[source]

A generic helper for use during inference.

Uses output logits for sampling, and passes the result through an embedding layer to get the next input.

Parameters
  • start_tokens – 1D torch.LongTensor shaped [batch_size], representing the start tokens for each sequence in batch.

  • end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.

Raises

ValueError – if start_tokens is not a 1D tensor or end_token is not a scalar.

GreedyEmbeddingHelper

class texar.torch.modules.GreedyEmbeddingHelper(*args, **kwds)[source]

A helper for use during inference.

Uses the argmax of the output (treated as logits) and passes the result through an embedding layer to get the next input.

Note that for greedy decoding, Texar’s decoders provide a simpler interface by specifying decoding_strategy='infer_greedy' when calling a decoder (see, e.g.,, RNN decoder). In this case, use of GreedyEmbeddingHelper is not necessary.

Please refer to the documentation for the TensorFlow counterpart tf.contrib.seq2seq.GreedyEmbeddingHelper.

Parameters
  • start_tokens – 1D torch.LongTensor shaped [batch_size], representing the start tokens for each sequence in batch.

  • end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.

Raises

ValueError – if start_tokens is not a 1D tensor or end_token is not a scalar.

SampleEmbeddingHelper

class texar.torch.modules.SampleEmbeddingHelper(*args, **kwds)[source]

A helper for use during inference.

Uses sampling (from a distribution) instead of argmax and passes the result through an embedding layer to get the next input.

Please refer to the documentation for the TensorFlow counterpart tf.contrib.seq2seq.SampleEmbeddingHelper.

Parameters
  • embedding – A callable or the params argument for torch.nn.functional.embedding. If a callable, it can take a vector tensor of ids (argmax ids), or take two arguments (ids, times), where ids is a vector of argmax ids, and times is a vector of current time steps (i.e., position ids). The latter case can be used when embedding is a combination of word embedding and position embedding. The returned tensor will be passed to the decoder input.

  • start_tokens – 1D torch.LongTensor shaped [batch_size], representing the start tokens for each sequence in batch.

  • end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.

  • softmax_temperature (float, optional) – Value to divide the logits by before computing the softmax. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to 1.0.

Raises

ValueError – if start_tokens is not a 1D tensor or end_token is not a scalar.

TopKSampleEmbeddingHelper

class texar.torch.modules.TopKSampleEmbeddingHelper(*args, **kwds)[source]

A helper for use during inference.

Samples from top_k most likely candidates from a vocab distribution, and passes the result through an embedding layer to get the next input.

Parameters
  • start_tokens – 1D torch.LongTensor shaped [batch_size], representing the start tokens for each sequence in batch.

  • end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.

  • top_k (int, optional) – Number of top candidates to sample from. Must be >=0. If set to 0, samples from all candidates (i.e., regular random sample decoding). Defaults to 10.

  • softmax_temperature (float, optional) – Value to divide the logits by before computing the softmax. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to 1.0.

Raises

ValueError – if start_tokens is not a 1D tensor or end_token is not a scalar.

sample(time, outputs)[source]

Returns sample_ids.

TopPSampleEmbeddingHelper

class texar.torch.modules.TopPSampleEmbeddingHelper(*args, **kwds)[source]

A helper for use during inference.

Samples from candidates that have a cumulative probability of at most p when arranged in decreasing order, and passes the result through an embedding layer to get the next input. This is also named as “Nucleus Sampling” as proposed in the paper “The Curious Case of Neural Text Degeneration(Holtzman et al.)”.

Parameters
  • start_tokens – 1D torch.LongTensor shaped [batch_size], representing the start tokens for each sequence in batch.

  • end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.

  • p (float, optional) – A value used to filter out tokens whose cumulative probability is greater than p when arranged in decreasing order of probabilities. Must be between [0, 1.0]. If set to 1, samples from all candidates (i.e., regular random sample decoding). Defaults to 0.5.

  • softmax_temperature (float, optional) – Value to divide the logits by before computing the softmax. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to 1.0.

Raises

ValueError – if start_tokens is not a 1D tensor or end_token is not a scalar.

sample(time, outputs)[source]

Returns sample_ids.

SoftmaxEmbeddingHelper

class texar.torch.modules.SoftmaxEmbeddingHelper(*args, **kwds)[source]

A helper that feeds softmax probabilities over vocabulary to the next step.

Uses the softmax probability vector to pass through word embeddings to get the next input (i.e., a mixed word embedding).

A subclass of Helper. Used as a helper to RNNDecoderBase in inference mode.

Parameters
  • embedding – A callable or the params argument for torch.nn.functional.embedding. If a callable, it can take a vector tensor of ids (argmax ids), or take two arguments (ids, times), where ids is a vector of argmax ids, and times is a vector of current time steps (i.e., position ids). The latter case can be used when embedding is a combination of word embedding and position embedding. The returned tensor will be passed to the decoder input.

  • start_tokens – 1D torch.LongTensor shaped [batch_size], representing the start tokens for each sequence in batch.

  • end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.

  • tau – A float scalar tensor, the softmax temperature.

  • stop_gradient (bool) – Whether to stop the gradient backpropagation when feeding softmax vector to the next step.

  • use_finish (bool) – Whether to stop decoding once end_token is generated. If False, decoding will continue until max_decoding_length of the decoder is reached.

Raises

ValueError – if start_tokens is not a 1D tensor or end_token is not a scalar.

sample(time, outputs)[source]

Returns sample_id which is softmax distributions over vocabulary with temperature tau. Shape = [batch_size, vocab_size].

GumbelSoftmaxEmbeddingHelper

class texar.torch.modules.GumbelSoftmaxEmbeddingHelper(*args, **kwds)[source]

A helper that feeds Gumbel softmax sample to the next step.

Uses the Gumbel softmax vector to pass through word embeddings to get the next input (i.e., a mixed word embedding).

A subclass of Helper. Used as a helper to RNNDecoderBase in inference mode.

Same as SoftmaxEmbeddingHelper except that here Gumbel softmax (instead of softmax) is used.

Parameters
  • embedding – A callable or the params argument for torch.nn.functional.embedding. If a callable, it can take a vector tensor of ids (argmax ids), or take two arguments (ids, times), where ids is a vector of argmax ids, and times is a vector of current time steps (i.e., position ids). The latter case can be used when embedding is a combination of word embedding and position embedding. The returned tensor will be passed to the decoder input.

  • start_tokens – 1D torch.LongTensor shaped [batch_size], representing the start tokens for each sequence in batch.

  • end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.

  • tau – A float scalar tensor, the softmax temperature.

  • straight_through (bool) – Whether to use straight through gradient between time steps. If True, a single token with highest probability (i.e., greedy sample) is fed to the next step and gradient is computed using straight through. If False (default), the soft Gumbel-softmax distribution is fed to the next step.

  • stop_gradient (bool) – Whether to stop the gradient backpropagation when feeding softmax vector to the next step.

  • use_finish (bool) – Whether to stop decoding once end_token is generated. If False, decoding will continue until max_decoding_length of the decoder is reached.

Raises

ValueError – if start_tokens is not a 1D tensor or end_token is not a scalar.

sample(time, outputs)[source]

Returns sample_id of shape [batch_size, vocab_size]. If straight_through is False, this contains the Gumbel softmax distributions over vocabulary with temperature tau. If straight_through is True, this contains one-hot vectors of the greedy samples.

get_helper

texar.torch.modules.get_helper(helper_type, start_tokens=None, end_token=None, **kwargs)[source]

Creates a Helper instance.

Parameters
  • helper_type – A Helper class, its name or module path, or a class instance. If a class instance is given, it is returned directly.

  • start_tokens – 1D torch.LongTensor shaped [batch_size], representing the start tokens for each sequence in batch.

  • end_token – Python int or scalar torch.LongTensor, denoting the token that marks end of decoding.

  • **kwargs – Additional keyword arguments for constructing the helper.

Returns

A helper instance.

Classifiers

BERTClassifier

class texar.torch.modules.BERTClassifier(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Classifier based on BERT modules. Please see PretrainedBERTMixin for a brief description of BERT.

This is a combination of the BERTEncoder with a classification layer. Both step-wise classification and sequence-level classification are supported, specified in hparams.

Arguments are the same as in BERTEncoder.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., bert-base-uncased). Please refer to PretrainedBERTMixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    # (1) Same hyperparameters as in BertEncoder
    ...
    # (2) Additional hyperparameters
    "num_classes": 2,
    "logit_layer_kwargs": None,
    "clas_strategy": "cls_time",
    "max_seq_length": None,
    "dropout": 0.1,
    "name": "bert_classifier"
}

Here:

  1. Same hyperparameters as in BERTEncoder. See the default_hparams(). An instance of BERTEncoder is created for feature extraction.

  2. Additional hyperparameters:

    “num_classes”: int

    Number of classes:

    • If > 0, an additional Linear layer is appended to the encoder to compute the logits over classes.

    • If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.

    “logit_layer_kwargs”: dict

    Keyword arguments for the logit Dense layer constructor, except for argument “units” which is set to num_classes. Ignored if no extra logit layer is appended.

    “clas_strategy”: str

    The classification strategy, one of:

    • cls_time: Sequence-level classification based on the output of the first time step (which is the CLS token). Each sequence has a class.

    • all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.

    • time_wise: Step-wise classification, i.e., make classification for each time step based on its output.

    “max_seq_length”: int, optional

    Maximum possible length of input sequences. Required if clas_strategy is all_time.

    “dropout”: float

    The dropout rate of the BERT encoder output.

    “name”: str

    Name of the classifier.

forward(inputs, sequence_length=None, segment_ids=None)[source]

Feeds the inputs through the network and makes classification.

The arguments are the same as in BERTEncoder.

Parameters
  • inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.

  • sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.

  • segment_ids (optional) – A 2D Tensor of shape [batch_size, max_time], containing the segment ids of tokens in input sequences. If None (default), a tensor with all elements set to zero is used.

Returns

A tuple (logits, preds), containing the logits over classes and the predictions, respectively.

  • If clas_strategy is cls_time or all_time:

    • If num_classes == 1, logits and pred are both of shape [batch_size].

    • If num_classes > 1, logits is of shape [batch_size, num_classes] and pred is of shape [batch_size].

  • If clas_strategy is time_wise:

    • num_classes == 1, logits and pred are both of shape [batch_size, max_time].

    • If num_classes > 1, logits is of shape [batch_size, max_time, num_classes] and pred is of shape [batch_size, max_time].

property output_size

The feature size of forward() output logits. If logits size is only determined by input (i.e. if num_classes == 1), the feature size is equal to -1. Otherwise it is equal to last dimension value of logits size.

RoBERTaClassifier

class texar.torch.modules.RoBERTaClassifier(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Classifier based on RoBERTa modules. Please see PretrainedRoBERTaMixin for a brief description of RoBERTa.

This is a combination of the RoBERTaEncoder with a classification layer. Both step-wise classification and sequence-level classification are supported, specified in hparams.

Arguments are the same as in RoBERTaEncoder.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., roberta-base). Please refer to PretrainedRoBERTaMixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    # (1) Same hyperparameters as in RoBertaEncoder
    ...
    # (2) Additional hyperparameters
    "num_classes": 2,
    "logit_layer_kwargs": None,
    "clas_strategy": "cls_time",
    "max_seq_length": None,
    "dropout": 0.1,
    "name": "roberta_classifier"
}

Here:

  1. Same hyperparameters as in RoBERTaEncoder. See the default_hparams(). An instance of RoBERTaEncoder is created for feature extraction.

  2. Additional hyperparameters:

    “num_classes”: int

    Number of classes:

    • If > 0, an additional Linear layer is appended to the encoder to compute the logits over classes.

    • If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.

    “logit_layer_kwargs”: dict

    Keyword arguments for the logit Dense layer constructor, except for argument “units” which is set to num_classes. Ignored if no extra logit layer is appended.

    “clas_strategy”: str

    The classification strategy, one of:

    • cls_time: Sequence-level classification based on the output of the first time step (which is the CLS token). Each sequence has a class.

    • all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.

    • time_wise: Step-wise classification, i.e., make classification for each time step based on its output.

    “max_seq_length”: int, optional

    Maximum possible length of input sequences. Required if clas_strategy is all_time.

    “dropout”: float

    The dropout rate of the RoBERTa encoder output.

    “name”: str

    Name of the classifier.

forward(inputs, sequence_length=None)[source]

Feeds the inputs through the network and makes classification.

The arguments are the same as in RoBERTaEncoder.

Parameters
  • inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.

  • sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.

Returns

A tuple (logits, preds), containing the logits over classes and the predictions, respectively.

  • If clas_strategy is cls_time or all_time:

    • If num_classes == 1, logits and pred are both of shape [batch_size].

    • If num_classes > 1, logits is of shape [batch_size, num_classes] and pred is of shape [batch_size].

  • If clas_strategy is time_wise:

    • num_classes == 1, logits and pred are both of shape [batch_size, max_time].

    • If num_classes > 1, logits is of shape [batch_size, max_time, num_classes] and pred is of shape [batch_size, max_time].

GPT2Classifier

class texar.torch.modules.GPT2Classifier(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Classifier based on GPT2 modules. Please see PretrainedGPT2Mixin for a brief description of GPT2.

This is a combination of the GPT2Encoder with a classification layer. Both step-wise classification and sequence-level classification are supported, specified in hparams.

Arguments are the same as in GPT2Encoder.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., gpt2-small). Please refer to PretrainedGPT2Mixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    # (1) Same hyperparameters as in GPT2Encoder
    ...
    # (2) Additional hyperparameters
    "num_classes": 2,
    "logit_layer_kwargs": None,
    "clas_strategy": `cls_time`,
    "max_seq_length": None,
    "dropout": 0.1,
    "name": `gpt2_classifier`
}

Here:

  1. Same hyperparameters as in GPT2Encoder. See the default_hparams(). An instance of GPT2Encoder is created for feature extraction.

  2. Additional hyperparameters:

    “num_classes”: int

    Number of classes:

    • If > 0, an additional Linear layer is appended to the encoder to compute the logits over classes.

    • If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.

    “logit_layer_kwargs”: dict

    Keyword arguments for the logit Dense layer constructor, except for argument “units” which is set to num_classes. Ignored if no extra logit layer is appended.

    “clas_strategy”: str

    The classification strategy, one of:

    • cls_time: Sequence-level classification based on the output of the last time step. Each sequence has a class.

    • all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.

    • time_wise: Step-wise classification, i.e., make classification for each time step based on its output.

    “max_seq_length”: int, optional

    Maximum possible length of input sequences. Required if clas_strategy is all_time.

    “dropout”: float

    The dropout rate of the GPT2 encoder output.

    “name”: str

    Name of the classifier.

forward(inputs, sequence_length=None)[source]

Feeds the inputs through the network and makes classification.

The arguments are the same as in GPT2Encoder.

Parameters
  • inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.

  • sequence_length (optional) – A 1D Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.

Returns

A tuple (logits, preds), containing the logits over classes and the predictions, respectively.

  • If clas_strategy is cls_time or all_time:

    • If num_classes == 1, logits and pred are of both shape [batch_size].

    • If num_classes > 1, logits is of shape [batch_size, num_classes] and pred is of shape [batch_size].

  • If clas_strategy is time_wise:

    • If num_classes == 1, logits and pred are of both shape [batch_size, max_time].

    • If num_classes > 1, logits is of shape [batch_size, max_time, num_classes] and pred is of shape [batch_size, max_time].

property output_size

The feature size of forward() output logits. If logits size is only determined by input (i.e. if num_classes == 1), the feature size is equal to -1. Otherwise it is equal to last dimension value of logits size.

UnidirectionalRNNClassifier

class texar.torch.modules.UnidirectionalRNNClassifier(input_size, cell=None, output_layer=None, hparams=None)[source]

One directional RNN classifier. This is a combination of the UnidirectionalRNNEncoder with a classification layer. Both step-wise classification and sequence-level classification are supported, specified in hparams.

Arguments are the same as in UnidirectionalRNNEncoder.

Parameters
  • input_size (int) – The number of expected features in the input for the cell.

  • cell – (RNNCell, optional) If not specified, a cell is created as specified in hparams["rnn_cell"].

  • output_layer (optional) – An instance of torch.nn.Module. Applies to the RNN cell output of each step. If None (default), the output layer is created as specified in hparams["output_layer"].

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    # (1) Same hyperparameters as in UnidirectionalRNNEncoder
    ...

    # (2) Additional hyperparameters
    "num_classes": 2,
    "logit_layer_kwargs": None,
    "clas_strategy": "final_time",
    "max_seq_length": None,
    "name": "unidirectional_rnn_classifier"
}

Here:

  1. Same hyperparameters as in UnidirectionalRNNEncoder. See the default_hparams() . An instance of UnidirectionalRNNEncoder is created for feature extraction.

  2. Additional hyperparameters:

    “num_classes”: int

    Number of classes:

    • If > 0, an additional Linear layer is appended to the encoder to compute the logits over classes.

    • If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.

    “logit_layer_kwargs”: dict

    Keyword arguments for the logit Dense layer constructor, except for argument “units” which is set to num_classes. Ignored if no extra logit layer is appended.

    “clas_strategy”: str

    The classification strategy, one of:

    • final_time: Sequence-level classification based on the output of the final time step. Each sequence has a class.

    • all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.

    • time_wise: Step-wise classification, i.e., make classification for each time step based on its output.

    “max_seq_length”: int, optional

    Maximum possible length of input sequences. Required if clas_strategy is all_time.

    “name”: str

    Name of the classifier.

forward(inputs, sequence_length=None, initial_state=None, time_major=False)[source]

Feeds the inputs through the network and makes classification.

The arguments are the same as in UnidirectionalRNNEncoder.

Parameters
  • inputs – A 3D Tensor of shape [batch_size, max_time, dim]. The first two dimensions batch_size and max_time are exchanged if time_major is True.

  • sequence_length (optional) – A 1D torch.LongTensor of shape [batch_size]. Sequence lengths of the batch inputs. Used to copy-through state and zero-out outputs when past a batch element’s sequence length.

  • initial_state (optional) – Initial state of the RNN.

  • time_major (bool) – The shape format of the inputs and outputs Tensors. If True, these tensors are of shape [max_time, batch_size, depth]. If False (default), these tensors are of shape [batch_size, max_time, depth].

Returns

A tuple (logits, preds), containing the logits over classes and the predictions, respectively.

  • If clas_strategy is final_time or all_time:

    • If num_classes == 1, logits and pred are both of shape [batch_size].

    • If num_classes > 1, logits is of shape [batch_size, num_classes] and pred is of shape [batch_size].

  • If clas_strategy is time_wise:

    • num_classes == 1, logits and pred are both of shape [batch_size, max_time].

    • If num_classes > 1, logits is of shape [batch_size, max_time, num_classes] and pred is of shape [batch_size, max_time].

    • If time_major is True, the batch and time dimensions are exchanged.

property output_size

The feature size of forward() output logits. If logits size is only determined by input (i.e. if num_classes == 1), the feature size is equal to -1. Otherwise it is equal to last dimension value of logits size.

Conv1DClassifier

class texar.torch.modules.Conv1DClassifier(in_channels, in_features=None, hparams=None)[source]

Simple Conv-1D classifier. This is a combination of the Conv1DEncoder with a classification layer.

Parameters
  • in_channels (int) – Number of channels in the input tensor.

  • in_features (int) – Size of the feature dimension in the input tensor.

  • hparams (dict, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

See forward() for the inputs and outputs. If "data_format" is set to "channels_first" (this is the default), inputs must be a tensor of shape [batch_size, channels, length]. If "data_format" is set to "channels_last", inputs must be a tensor of shape [batch_size, length, channels]. For example, for sequence classification, length corresponds to time steps, and channels corresponds to embedding dim.

Example:

inputs = torch.randn([64, 20, 256])

clas = Conv1DClassifier(in_channels=20, in_features=256,
                        hparams={'num_classes': 10})

logits, pred = clas(inputs)
# logits == Tensor of shape [64, 10]
# pred   == Tensor of shape [64]
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    # (1) Same hyperparameters as in Conv1DEncoder
    ...

    # (2) Additional hyperparameters
    "num_classes": 2,
    "logit_layer_kwargs": {
        "use_bias": False
    },
    "name": "conv1d_classifier"
}

Here:

  1. Same hyperparameters as in Conv1DEncoder. See the default_hparams(). An instance of Conv1DEncoder is created for feature extraction.

  2. Additional hyperparameters:

    “num_classes”: int

    Number of classes:

    • If > 0, an additional torch.nn.Linear layer is appended to the encoder to compute the logits over classes.

    • If <= 0, no dense layer is appended. The number of classes is assumed to be equal to out_features of the final dense layer size of the encoder.

    “logit_layer_kwargs”: dict

    Keyword arguments for the logit torch.nn.Linear layer constructor, except for argument out_features which is set to "num_classes". Ignored if no extra logit layer is appended.

    “name”: str

    Name of the classifier.

forward(input, sequence_length=None, dtype=None, data_format=None)[source]

Feeds the inputs through the network and makes classification.

The arguments are the same as in Conv1DEncoder.

The predictions of binary classification (num_classes =1) and multi-way classification (num_classes >1) are different, as explained below.

Parameters
  • input – The inputs to the network, which is a 3D tensor. See Conv1DEncoder for more details.

  • sequence_length (optional) – An int tensor of shape [batch_size] or a python array containing the length of each element in inputs. If given, time steps beyond the length will first be masked out before feeding to the layers.

  • dtype (optional) – Type of the inputs. If not provided, infers from inputs automatically.

  • data_format (optional) – Data type of the input tensor. If channels_last, the last dimension will be treated as channel dimension so the size of the input should be [batch_size, X, channel]. If channels_first, first dimension will be treated as channel dimension so the size should be [batch_size, channel, X]. Defaults to None. If None, the value will be picked from hyperparameters.

Returns

A tuple (logits, pred), where

  • logits is a torch.Tensor of shape [batch_size, num_classes] for num_classes >1, and [batch_size] for num_classes =1 (i.e., binary classification).

  • pred is the prediction, a torch.LongTensor of shape [batch_size]. For binary classification, the standard sigmoid function is used for prediction, and the class labels are {0, 1}.

property num_classes

The number of classes.

property encoder

The classifier neural network.

has_layer(layer_name)[source]

Returns True if the network with the name exists. Returns False otherwise.

Parameters

layer_name (str) – Name of the layer.

layer_by_name(layer_name)[source]

Returns the layer with the name. Returns None if the layer name does not exist.

Parameters

layer_name (str) – Name of the layer.

property layers_by_name

A dictionary mapping layer names to the layers.

property layers

A list of the layers.

property layer_names

A list of uniquified layer names.

property output_size

The feature size of forward() output logits. If logits size is only determined by input (i.e. if num_classes == 1), the feature size is equal to -1. Otherwise, if num_classes > 1, it is equal to num_classes.

XLNetClassifier

class texar.torch.modules.XLNetClassifier(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Classifier based on XLNet modules. Please see PretrainedXLNetMixin for a brief description of XLNet.

Arguments are the same as in XLNetEncoder.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., xlnet-based-cased). Please refer to PretrainedXLNetMixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    # (1) Same hyperparameters as in XLNetEncoder
    ...
    # (2) Additional hyperparameters
    "clas_strategy": "cls_time",
    "use_projection": True,
    "num_classes": 2,
    "name": "xlnet_classifier",
}

Here:

  1. Same hyperparameters as in

    XLNetEncoder. See the default_hparams(). An instance of XLNetEncoder is created for feature extraction.

  2. Additional hyperparameters:

    “clas_strategy”: str

    The classification strategy, one of:

    • cls_time: Sequence-level classification based on the output of the last time step (which is the CLS token). Each sequence has a class.

    • all_time: Sequence-level classification based on the output of all time steps. Each sequence has a class.

    • time_wise: Step-wise classification, i.e., make classification for each time step based on its output.

    “use_projection”: bool

    If True, an additional Linear layer is added after the summary step.

    “num_classes”: int

    Number of classes:

    • If > 0, an additional torch.nn.Linear layer is appended to the encoder to compute the logits over classes.

    • If <= 0, no dense layer is appended. The number of classes is assumed to be the final dense layer size of the encoder.

    “name”: str

    Name of the classifier.

param_groups(lr=None, lr_layer_scale=1.0, decay_base_params=False)[source]

Create parameter groups for optimizers. When lr_layer_decay_rate is not 1.0, parameters from each layer form separate groups with different base learning rates.

The return value of this method can be used in the constructor of optimizers, for example:

model = XLNetClassifier(...)
param_groups = model.param_groups(lr=2e-5, lr_layer_scale=0.8)
optim = torch.optim.Adam(param_groups)
Parameters
  • lr (float) – The learning rate. Can be omitted if lr_layer_decay_rate is 1.0.

  • lr_layer_scale (float) – Per-layer LR scaling rate. The i-th layer will be scaled by lr_layer_scale ^ (num_layers - i - 1).

  • decay_base_params (bool) – If True, treat non-layer parameters (e.g. embeddings) as if they’re in layer 0. If False, these parameters are not scaled.

Returns

The parameter groups, used as the first argument for optimizers.

forward(inputs, segment_ids=None, input_mask=None)[source]

Feeds the inputs through the network and makes classification.

Parameters
  • inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.

  • segment_ids – Shape [batch_size, max_time].

  • input_mask – Float tensor of shape [batch_size, max_time]. Note that positions with value 1 are masked out.

Returns

A tuple (logits, preds), containing the logits over classes and the predictions, respectively.

  • If clas_strategy is cls_time or all_time:

    • If num_classes == 1, logits and pred are both of shape [batch_size].

    • If num_classes > 1, logits is of shape [batch_size, num_classes] and pred is of shape [batch_size].

  • If clas_strategy is time_wise:

    • num_classes == 1, logits and pred are both of shape [batch_size, max_time].

    • If num_classes > 1, logits is of shape [batch_size, max_time, num_classes] and pred is of shape [batch_size, max_time].

property output_size

The feature size of forward() output logits. If logits size is only determined by input (i.e. if num_classes == 1), the feature size is equal to -1. Otherwise it is equal to last dimension value of logits size.

Regressors

XLNetRegressor

class texar.torch.modules.XLNetRegressor(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Regressor based on XLNet modules. Please see PretrainedXLNetMixin for a brief description of XLNet.

Arguments are the same as in XLNetEncoder.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., xlnet-based-cased). Please refer to PretrainedXLNetMixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    # (1) Same hyperparameters as in XLNetEncoder
    ...
    # (2) Additional hyperparameters
    "regr_strategy": "cls_time",
    "use_projection": True,
    "logit_layer_kwargs": None,
    "name": "xlnet_regressor",
}

Here:

  1. Same hyperparameters as in XLNetEncoder. See the default_hparams(). An instance of XLNetEncoder is created for feature extraction.

  2. Additional hyperparameters:

    “regr_strategy”: str

    The regression strategy, one of:

    • cls_time: Sequence-level regression based on the output of the first time step (which is the CLS token). Each sequence has a prediction.

    • all_time: Sequence-level regression based on the output of all time steps. Each sequence has a prediction.

    • time_wise: Step-wise regression, i.e., make regression for each time step based on its output.

    “logit_layer_kwargs”: dict

    Keyword arguments for the logit torch.nn.Linear layer constructor. Ignored if no extra logit layer is appended.

    “use_projection”: bool

    If True, an additional torch.nn.Linear layer is added after the summary step.

    “name”: str

    Name of the regressor.

param_groups(lr=None, lr_layer_scale=1.0, decay_base_params=False)[source]

Create parameter groups for optimizers. When lr_layer_decay_rate is not 1.0, parameters from each layer form separate groups with different base learning rates.

The return value of this method can be used in the constructor of optimizers, for example:

model = XLNetRegressor(...)
param_groups = model.param_groups(lr=2e-5, lr_layer_scale=0.8)
optim = torch.optim.Adam(param_groups)
Parameters
  • lr (float) – The learning rate. Can be omitted if lr_layer_decay_rate is 1.0.

  • lr_layer_scale (float) – Per-layer LR scaling rate. The i-th layer will be scaled by lr_layer_scale ^ (num_layers - i - 1).

  • decay_base_params (bool) – If True, treat non-layer parameters (e.g. embeddings) as if they’re in layer 0. If False, these parameters are not scaled.

Returns

The parameter groups, used as the first argument for optimizers.

forward(inputs, segment_ids=None, input_mask=None)[source]

Feeds the inputs through the network and makes regression.

Parameters
  • inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.

  • segment_ids – Shape [batch_size, max_time].

  • input_mask – Float tensor of shape [batch_size, max_time]. Note that positions with value 1 are masked out.

Returns

Regression predictions.

  • If regr_strategy is cls_time or all_time, predictions have shape [batch_size].

  • If clas_strategy is time_wise, predictions have shape [batch_size, max_time].

property output_size

The feature size of forward() output. Since output size is only determined by input, the feature size is equal to -1.

EncoderDecoders

T5EncoderDecoder

class texar.torch.modules.T5EncoderDecoder(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

The pre-trained T5 model. Please see PretrainedT5Mixin for a brief description of T5.

This module basically stacks WordEmbedder, T5Encoder, and T5Decoder.

Parameters
  • pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., T5-Small). Please refer to PretrainedT5Mixin for all supported models. If None, the model name in hparams is used.

  • cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

reset_parameters()[source]

Initialize parameters of the pre-trained model. This method is only called if pre-trained checkpoints are not loaded.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

  • The model arch is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.

  • Otherwise, the model arch is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.

  • If the above two are None, the encoder arch is defined by the configurations in hparams and weights are randomly initialized.

{
    "pretrained_model_name": "T5-Small",
    "embed": {
        "dim": 768,
        "name": "word_embeddings"
    },
    "vocab_size": 32128,

    "encoder": {
        "dim": 768,
        "embedding_dropout": 0.1,
        "multihead_attention": {
            "dropout_rate": 0.1,
            "name": "self",
            "num_heads": 12,
            "num_units": 768,
            "output_dim": 768,
            "use_bias": False,
            "is_decoder": False,
            "relative_attention_num_buckets": 32,
        },
        "eps": 1e-6,
        "name": "encoder",
        "num_blocks": 12,
        "poswise_feedforward": {
            "layers": [
                {
                    "kwargs": {
                        "in_features": 768,
                        "out_features": 3072,
                        "bias": False
                    },
                    "type": "Linear"
                },
                {"type": "ReLU"},
                {
                    "kwargs": {
                        "in_features": 3072,
                        "out_features": 768,
                        "bias": False
                    },
                    "type": "Linear"
                }
            ]
        },
        "residual_dropout": 0.1,
        },

    "decoder": {
        "eps": 1e-6,
        "dim": 768,
        "embedding_dropout": 0.1,
        "multihead_attention": {
            "dropout_rate": 0.1,
            "name": "self",
            "num_heads": 12,
            "num_units": 768,
            "output_dim": 768,
            "use_bias": False,
            "is_decoder": True,
            "relative_attention_num_buckets": 32,
        },
        "name": "decoder",
        "num_blocks": 12,
        "poswise_feedforward": {
            "layers": [
                {
                    "kwargs": {
                        "in_features": 768,
                        "out_features": 3072,
                        "bias": False
                    },
                    "type": "Linear"
                },
                {"type": "ReLU"},
                {
                    "kwargs": {
                        "in_features": 3072,
                        "out_features": 768,
                        "bias": False
                    },
                    "type": "Linear"
                }
            ]
        },
        "residual_dropout": 0.1,
        },
    "hidden_size": 768,
    "initializer": None,
    "name": "t5_encoder_decoder",
}

Here:

The default parameters are values for T5-Small model.

“pretrained_model_name”: str or None

The name of the pre-trained T5 model. If None, the model will be randomly initialized.

“embed”: dict

Hyperparameters for word embedding layer.

“vocab_size”: int

The vocabulary size of inputs in T5 model.

“encoder”: dict

Hyperparameters for the T5Encoder. See default_hparams() for details.

“decoder”: dict

Hyperparameters for the T5Decoder. See default_hparams() for details.

“hidden_size”: int

Size of the hidden layer.

“initializer”: dict, optional

Hyperparameters of the default initializer that initializes variables created in this module. See get_initializer() for details.

“name”: str

Name of the module.

forward(inputs, sequence_length=None)[source]

Performs encoding and decoding.

Parameters
  • inputs – Either a 2D Tensor of shape [batch_size, max_time], containing the ids of tokens in input sequences, or a 3D Tensor of shape [batch_size, max_time, vocab_size], containing soft token ids (i.e., weights or probabilities) used to mix the embedding vectors.

  • sequence_length – A 1D torch.Tensor of shape [batch_size]. Input tokens beyond respective sequence lengths are masked out automatically.

Returns

A pair (encoder_output, decoder_output)

  • encoder_output: A Tensor of shape [batch_size, max_time, dim] containing the encoded vectors.

  • decoder_output: An instance of TransformerDecoderOutput which contains sample_id and logits.

property output_size

The feature size of forward() output of the encoder.

Pre-trained

PretrainedMixin

class texar.torch.modules.PretrainedMixin(hparams=None)[source]

A mixin class for all pre-trained classes to inherit.

load_pretrained_config(pretrained_model_name=None, cache_dir=None, hparams=None)[source]

Load paths and configurations of the pre-trained model.

Parameters
  • pretrained_model_name (optional) – A str with the name of a pre-trained model to load. If None, will use the model name in hparams.

  • cache_dir (optional) – The path to a folder in which the pre-trained models will be cached. If None (default), a default directory will be used.

  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

reset_parameters()[source]

Initialize parameters of the pre-trained model. This method is only called if pre-trained checkpoints are not loaded.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "pretrained_model_name": None,
    "name": "pretrained_base"
}
classmethod download_checkpoint(pretrained_model_name, cache_dir=None)[source]

Download the specified pre-trained checkpoint, and return the directory in which the checkpoint is cached.

Parameters
  • pretrained_model_name (str) – Name of the model checkpoint.

  • cache_dir (str, optional) – Path to the cache directory. If None, uses the default directory (user’s home directory).

Returns

Path to the cache directory.

abstract classmethod _transform_config(pretrained_model_name, cache_dir)[source]

Load the official configuration file and transform it into Texar-style hyperparameters.

Parameters
  • pretrained_model_name (str) – Name of the pre-trained model.

  • cache_dir (str) – Path to the cache directory.

Returns

Texar module hyperparameters.

Return type

dict

abstract _init_from_checkpoint(pretrained_model_name, cache_dir, **kwargs)[source]

Initialize model parameters from weights stored in the pre-trained checkpoint.

Parameters
  • pretrained_model_name (str) – Name of the pre-trained model.

  • cache_dir (str) – Path to the cache directory.

  • **kwargs – Additional arguments for specific models.

PretrainedBERTMixin

class texar.torch.modules.PretrainedBERTMixin(hparams=None)[source]

A mixin class to support loading pre-trained checkpoints for modules that implement the BERT model.

Both standard BERT models and many domain specific BERT-based models are supported. You can specify the pretrained_model_name argument to pick which pre-trained BERT model to use. All available categories of pre-trained models (and names) include:

  • Standard BERT: proposed in (Devlin et al. 2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . A bidirectional Transformer language model pre-trained on large text corpora. Available model names include:

    • bert-base-uncased: 12-layer, 768-hidden, 12-heads, 110M parameters.

    • bert-large-uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters.

    • bert-base-cased: 12-layer, 768-hidden, 12-heads , 110M parameters.

    • bert-large-cased: 24-layer, 1024-hidden, 16-heads, 340M parameters.

    • bert-base-multilingual-uncased: 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters.

    • bert-base-multilingual-cased: 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters.

    • bert-base-chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters.

  • BioBERT: proposed in (Lee et al. 2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining . A domain specific language representation model pre-trained on large-scale biomedical corpora. Based on the BERT architecture, BioBERT effectively transfers the knowledge from a large amount of biomedical texts to biomedical text mining models with minimal task-specific architecture modifications. Available model names include:

    • biobert-v1.0-pmc: BioBERT v1.0 (+ PMC 270K) - based on BERT-base-Cased (same vocabulary).

    • biobert-v1.0-pubmed-pmc: BioBERT v1.0 (+ PubMed 200K + PMC 270K) - based on BERT-base-Cased (same vocabulary).

    • biobert-v1.0-pubmed: BioBERT v1.0 (+ PubMed 200K) - based on BERT-base-Cased (same vocabulary).

    • biobert-v1.1-pubmed: BioBERT v1.1 (+ PubMed 1M) - based on BERT-base-Cased (same vocabulary).

  • SciBERT: proposed in (Beltagy et al. 2019) SciBERT: A Pretrained Language Model for Scientific Text. A BERT model trained on scientific text. SciBERT leverages unsupervised pre-training on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. Available model names include:

    • scibert-scivocab-uncased: Uncased version of the model trained on its own vocabulary.

    • scibert-scivocab-cased: Cased version of the model trained on its own vocabulary.

    • scibert-basevocab-uncased: Uncased version of the model trained on the original BERT vocabulary.

    • scibert-basevocab-cased: Cased version of the model trained on the original BERT vocabulary.

  • SpanBERT: proposed in (Joshi et al. 2019) SpanBERT: Improving Pre-training by Representing and Predicting Spans. As a variant of the standard BERT model, SpanBERT extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Differing from the standard BERT, the SpanBERT model does not use segmentation embedding. Available model names include:

    • spanbert-base-cased: SpanBERT using the BERT-base architecture, 12-layer, 768-hidden, 12-heads , 110M parameters.

    • spanbert-large-cased: SpanBERT using the BERT-large architecture, 24-layer, 1024-hidden, 16-heads, 340M parameters.

We provide the following BERT classes:

PretrainedRoBERTaMixin

class texar.torch.modules.PretrainedRoBERTaMixin(hparams=None)[source]

A mixin class to support loading pre-trained checkpoints for modules that implement the RoBERTa model.

The RoBERTa model was proposed in (Liu et al. 2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. As a variant of the standard BERT model, RoBERTa trains for more iterations on more data with a larger batch size as well as other tweaks in pre-training. Differing from the standard BERT, the RoBERTa model does not use segmentation embedding. Available model names include:

  • roberta-base: RoBERTa using the BERT-base architecture, 125M parameters.

  • roberta-large: RoBERTa using the BERT-large architecture, 355M parameters.

We provide the following RoBERTa classes:

PretrainedGPT2Mixin

class texar.torch.modules.PretrainedGPT2Mixin(hparams=None)[source]

A mixin class to support loading pre-trained checkpoints for modules that implement the GPT2 model.

The GPT2 model was proposed in Language Models are Unsupervised Multitask Learners by Radford et al. from OpenAI. It is a unidirectional Transformer model pre-trained using the vanilla language modeling objective on a large corpus.

The available GPT2 models are as follows:

  • gpt2-small: Small version of GPT-2, 124M parameters.

  • gpt2-medium: Medium version of GPT-2, 355M parameters.

  • gpt2-large: Large version of GPT-2, 774M parameters.

  • gpt2-xl: XL version of GPT-2, 1558M parameters.

We provide the following GPT2 classes:

_init_from_checkpoint(pretrained_model_name, cache_dir, load_output_layer=True, **kwargs)[source]

Initialize model parameters from weights stored in the pre-trained checkpoint.

Parameters
  • pretrained_model_name (str) – Name of the pre-trained model.

  • cache_dir (str) – Path to the cache directory.

  • load_output_layer (bool) – If False, will not load weights of the output layer. Set this argument to False when loading weights into a GPT2 encoder. Defaults to True.

PretrainedXLNetMixin

class texar.torch.modules.PretrainedXLNetMixin(hparams=None)[source]

A mixin class to support loading pre-trained checkpoints for modules that implement the XLNet model.

The XLNet model was proposed in XLNet: Generalized Autoregressive Pretraining for Language Understanding by Yang et al. It is based on the Transformer-XL model, pre-trained on a large corpus using a language modeling objective that considers all permutations of the input sentence.

The available XLNet models are as follows:

  • xlnet-based-cased: 12-layer, 768-hidden, 12-heads. This model is trained on full data (different from the one in the paper).

  • xlnet-large-cased: 24-layer, 1024-hidden, 16-heads.

We provide the following XLNet classes:

reset_parameters()[source]

Initialize parameters of the pre-trained model. This method is only called if pre-trained checkpoints are not loaded.

PretrainedT5Mixin

class texar.torch.modules.PretrainedT5Mixin(hparams=None)[source]

A mixin class to support loading pre-trained checkpoints for modules that implement the T5 model.

The T5 model was proposed in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Raffel et al. from Google. It treats multiple NLP tasks in a similar manner by encoding the different tasks as text directives in the input stream. This enables a single model to be trained supervised on a wide variety of NLP tasks. The T5 model examines factors relevant for leveraging transfer learning at scale from pure unsupervised pre-training to supervised tasks.

The available T5 models are as follows:

  • T5-Small: Small version of T5, 60 million parameters.

  • T5-Base: Base-line version of T5, 220 million parameters.

  • T5-Large: Large Version of T5, 770 million parameters.

  • T5-3B: A version of T5 with 3 billion parameters.

  • T5-11B: A version of T5 with 11 billion parameters.

We provide the following classes:

  • T5Encoder for loading weights for the encoder stack.

  • T5Decoder for loading weights for the decoding stack.

  • T5EncoderDecoder as a raw pre-trained model.

Connectors

ConnectorBase

class texar.torch.modules.ConnectorBase(*args, **kwds)[source]

Base class inherited by all connector classes. A connector is to transform inputs into outputs with any specified structure and shape. For example, transforming the final state of an encoder to the initial state of a decoder, and performing stochastic sampling in between as in Variational Autoencoders (VAEs).

Parameters
  • output_size – Size of output excluding the batch dimension. For example, set output_size to dim to generate output of shape [batch_size, dim]. Can be an int, a tuple of int, a torch.Size, or a tuple of torch.Sizes. For example, to transform inputs to have decoder state size, set output_size=decoder.state_size.

  • hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

property output_size

The feature size of forward() output tensor(s), usually it is equal to the last dimension value of the output tensor size.

ConstantConnector

class texar.torch.modules.ConstantConnector(*args, **kwds)[source]

Creates a constant tensor or (nested) tuple of Tensors that contains a constant value.

Parameters
  • output_size – Size of output excluding the batch dimension. For example, set output_size to dim to generate output of shape [batch_size, dim]. Can be an int, a tuple of int, a torch.Size, or a tuple of torch.Size. For example, to transform inputs to have decoder state size, set output_size=decoder.state_size. If output_size is a tuple (1, 2, 3), then the output structure will be ([batch_size * 1], [batch_size * 2], [batch_size * 3]). If output_size is torch.Size([1, 2, 3]), then the output structure will be [batch_size, 1, 2, 3].

  • hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

This connector does not have trainable parameters.

Example

state_size = (1, 2, 3)
connector = ConstantConnector(state_size, hparams={"value": 1.})
one_state = connector(batch_size=64)
# `one_state` structure: (Tensor_1, Tensor_2, Tensor_3),
# Tensor_1.size() == torch.Size([64, 1])
# Tensor_2.size() == torch.Size([64, 2])
# Tensor_3.size() == torch.Size([64, 3])
# Tensors are filled with 1.0.
size = torch.Size([1, 2, 3])
connector_size = ConstantConnector(size, hparams={"value": 2.})
size_state = connector_size(batch_size=64)
# `size_state` structure: Tensor with size [64, 1, 2, 3].
# Tensor is filled with 2.0.
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "value": 0.,
    "name": "constant_connector"
}

Here:

“value”: float

The constant scalar that the output tensor(s) has.

“name”: str

Name of the connector.

forward(batch_size)[source]

Creates output tensor(s) that has the given value.

Parameters

batch_size – An int or int scalar tensor, the batch size.

Returns

A (structure of) tensor whose structure is the same as output_size, with value specified by value or hparams.

ForwardConnector

class texar.torch.modules.ForwardConnector(*args, **kwds)[source]

Transforms inputs to have specified structure.

Example:

state_size = namedtuple('LSTMStateTuple', ['h', 'c'])(256, 256)
# state_size == LSTMStateTuple(c=256, h=256)
connector = ForwardConnector(state_size)
output = connector([tensor_1, tensor_2])
# output == LSTMStateTuple(c=tensor_1, h=tensor_2)
Parameters
  • output_size – Size of output excluding the batch dimension. For example, set output_size to dim to generate output of shape [batch_size, dim]. Can be an int, a tuple of int, a torch.Size, or a tuple of torch.Size. For example, to transform inputs to have decoder state size, set output_size=decoder.state_size.

  • hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

This connector does not have trainable parameters. See forward() for the inputs and outputs of the connector. The input to the connector must have the same structure with output_size, or must have the same number of elements and be re-packable into the structure of output_size. Note that if input is or contains a dict instance, the keys will be sorted to pack in deterministic order (See pack_sequence_as()).

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "name": "forward_connector"
}

Here:

“name”: str

Name of the connector.

forward(inputs)[source]

Transforms inputs to have the same structure as with output_size. Values of the inputs are not changed. inputs must either have the same structure, or have the same number of elements with output_size.

Parameters

inputs – The input (structure of) tensor to pass forward.

Returns

A (structure of) tensors that re-packs inputs to have the specified structure of output_size.

MLPTransformConnector

class texar.torch.modules.MLPTransformConnector(*args, **kwds)[source]

Transforms inputs with an MLP layer and packs the results into the specified structure and size.

Example

cell = LSTMCell(num_units=256)
# cell.state_size == LSTMStateTuple(c=256, h=256)
connector = MLPTransformConnector(cell.state_size)
inputs = torch.zeros([64, 10])
output = connector(inputs)
# output == LSTMStateTuple(c=tensor_of_shape_(64, 256),
#                          h=tensor_of_shape_(64, 256))
## Use to connect encoder and decoder with different state size
encoder = UnidirectionalRNNEncoder(...)
_, final_state = encoder(inputs=...)
decoder = BasicRNNDecoder(...)
connector = MLPTransformConnector(decoder.state_size)
_ = decoder(
    initial_state=connector(final_state),
    ...)
Parameters
  • output_size – Size of output excluding the batch dimension. For example, set output_size to dim to generate output of shape [batch_size, dim]. Can be an int, a tuple of int, a torch.Size, or a tuple of torch.Size. For example, to transform inputs to have decoder state size, set output_size=decoder.state_size.

  • linear_layer_dim (int) – Value of final dim of the input tensors i.e. the input dim of the mlp linear layer.

  • hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

The input to the connector can have arbitrary structure and size.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "activation_fn": "texar.torch.core.layers.identity",
    "name": "mlp_connector"
}

Here:

“activation_fn”: str or callable

The activation function applied to the outputs of the MLP transformation layer. Can be a function, or its name or module path.

“name”: str

Name of the connector.

forward(inputs)[source]

Transforms inputs with an MLP layer and packs the results to have the same structure as specified by output_size.

Parameters

inputs – Input (structure of) tensors to be transformed. Must be a tensor of shape [batch_size, ...] or a (nested) tuple of such Tensors. That is, the first dimension of (each) tensor must be the batch dimension.

Returns

A tensor or a (nested) tuple of tensors of the same structure of output_size.

Networks

FeedForwardNetworkBase

class texar.torch.modules.FeedForwardNetworkBase(hparams=None)[source]

Base class inherited by all feed-forward network classes.

Parameters

hparams (dict, optional) – Hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

See forward() for the inputs and outputs.

static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "name": "NN"
}
forward(input)[source]

Feeds forward inputs through the network layers and returns outputs.

Parameters

input – The inputs to the network. The requirements on inputs depends on the first layer and subsequent layers in the network.

Returns

The output of the network.

append_layer(layer)[source]

Appends a layer to the end of the network.

Parameters

layer – A subclass of torch.nn.Module, or a dict of layer hyperparameters.

has_layer(layer_name)[source]

Returns True if the network with the name exists. Returns False otherwise.

Parameters

layer_name (str) – Name of the layer.

layer_by_name(layer_name)[source]

Returns the layer with the name. Returns None if the layer name does not exist.

Parameters

layer_name (str) – Name of the layer.

property layers_by_name

A dictionary mapping layer names to the layers.

property layers

A list of the layers.

property layer_names

A list of uniquified layer names.

FeedForwardNetwork

class texar.torch.modules.FeedForwardNetwork(layers=None, hparams=None)[source]

Feed-forward neural network that consists of a sequence of layers.

Parameters
  • layers (list, optional) – A list of torch.nn.Linear instances composing the network. If not given, layers are created according to hparams.

  • hparams (dict, optional) – Embedder hyperparameters. Missing hyperparameters will be set to default values. See default_hparams() for the hyperparameter structure and default values.

See forward() for the inputs and outputs.

Example

hparams = { # Builds a two-layer dense NN
    "layers": [
        { "type": "Dense", "kwargs": { "units": 256 },
        { "type": "Dense", "kwargs": { "units": 10 }
    ]
}
nn = FeedForwardNetwork(hparams=hparams)

inputs = torch.randn([64, 100])
outputs = nn(inputs)
# outputs == Tensor of shape [64, 10]
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "layers": [],
    "name": "NN"
}

Here:

“layers”: list

A list of layer hyperparameters. See get_layer() for details on layer hyperparameters.

“name”: str

Name of the network.

forward(input)

Feeds forward inputs through the network layers and returns outputs.

Parameters

input – The inputs to the network. The requirements on inputs depends on the first layer and subsequent layers in the network.

Returns

The output of the network.

append_layer(layer)

Appends a layer to the end of the network.

Parameters

layer – A subclass of torch.nn.Module, or a dict of layer hyperparameters.

has_layer(layer_name)

Returns True if the network with the name exists. Returns False otherwise.

Parameters

layer_name (str) – Name of the layer.

layer_by_name(layer_name)

Returns the layer with the name. Returns None if the layer name does not exist.

Parameters

layer_name (str) – Name of the layer.

property layers_by_name

A dictionary mapping layer names to the layers.

property layers

A list of the layers.

property layer_names

A list of uniquified layer names.

Conv1DNetwork

class texar.torch.modules.Conv1DNetwork(in_channels, in_features=None, hparams=None)[source]

Simple Conv-1D network which consists of a sequence of convolutional layers followed with a sequence of dense layers.

Parameters
  • in_channels (int) – Number of channels in the input tensor.

  • in_features (int) – Size of the feature dimension in the input tensor.

  • hparams (dict, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

See forward() for the inputs and outputs. If "data_format" is set to "channels_first" (this is the default), inputs must be a tensor of shape [batch_size, channels, length]. If "data_format" is set to "channels_last", inputs must be a tensor of shape [batch_size, length, channels]. For example, for sequence classification, length corresponds to time steps, and channels corresponds to embedding dim.

Example:

nn = Conv1DNetwork(in_channels=20, in_features=256) # Use the default

inputs = torch.randn([64, 20, 256])
outputs = nn(inputs)
# outputs == Tensor of shape [64, 256], because the final dense layer
# has size 256.
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    # (1) Conv layers
    "num_conv_layers": 1,
    "out_channels": 128,
    "kernel_size": [3, 4, 5],
    "conv_activation": "ReLU",
    "conv_activation_kwargs": None,
    "other_conv_kwargs": {},
    "data_format": "channels_first",
    # (2) Pooling layers
    "pooling": "MaxPool1d",
    "pool_size": None,
    "pool_stride": 1,
    "other_pool_kwargs": {},
    # (3) Dense layers
    "num_dense_layers": 1,
    "out_features": 256,
    "dense_activation": None,
    "dense_activation_kwargs": None,
    "final_dense_activation": None,
    "final_dense_activation_kwargs": None,
    "other_dense_kwargs": None,
    # (4) Dropout
    "dropout_conv": [1],
    "dropout_dense": [],
    "dropout_rate": 0.75,
    # (5) Others
    "name": "conv1d_network"
}

Here:

  1. For convolutional layers:

    “num_conv_layers”: int

    Number of convolutional layers.

    “out_channels”: int or list

    The number of out_channels in the convolution, i.e., the dimensionality of the output space.

    • If "num_conv_layers" > 1 and "out_channels" is an int, all convolution layers will have the same number of output channels.

    • If "num_conv_layers" > 1 and "out_channels" is a list, the length must equal "num_conv_layers". The number of output channels of each convolution layer will be the corresponding element from this list.

    “kernel_size”: int or list

    Lengths of 1D convolution windows.

    • If “num_conv_layers” = 1, this can also be a int list of arbitrary length denoting differently sized convolution windows. The number of output channels of each size is specified by "out_channels". For example, the default values will create 3 convolution layers, each of which has kernel size of 3, 4, and 5, respectively, and has output channel 128.

    • If “num_conv_layers” > 1, this must be a list of length "num_conv_layers". Each element can be an int or a int list of arbitrary length denoting the kernel size of each layer.

    “conv_activation”: str or callable

    Activation applied to the output of the convolutional layers. Set to None to maintain a linear activation. See get_layer() for more details.

    “conv_activation_kwargs”: dict, optional

    Keyword arguments for the activation following the convolutional layer. See get_layer() for more details.

    “other_conv_kwargs”: list or dict, optional

    Other keyword arguments for torch.nn.Conv1d constructor, e.g., padding.

    • If a dict, the same dict is applied to all the convolution layers.

    • If a list, the length must equal "num_conv_layers". This list can contain nested lists. If the convolution layer at index i has multiple kernel sizes, then the corresponding element of this list can also be a list of length equal to "kernel_size" at index i. If the element at index i is instead a dict, then the same dict gets applied to all the convolution layers at index i.

    “data_format”: str, optional

    Data format of the input tensor. Defaults to channels_first denoting the first dimension to be the channel dimension. Set it to channels_last to treat last dimension as the channel dimension. This argument can also be passed in forward function, in which case the value specified here will be ignored.

  2. For pooling layers:

    “pooling”: str or class or instance

    Pooling layer after each of the convolutional layer(s). Can be a pooling layer class, its name or module path, or a class instance.

    “pool_size”: int or list, optional

    Size of the pooling window. If an int, all pooling layer will have the same pool size. If a list, the list length must equal "num_conv_layers". If None and the pooling type is either MaxPool1d or AvgPool1d, the pool size will be set to input size. That is, the output of the pooling layer is a single unit.

    “pool_stride”: int or list, optional

    Strides of the pooling operation. If an int, all layers will have the same stride. If a list, the list length must equal "num_conv_layers".

    “other_pool_kwargs”: list or dict, optional

    Other keyword arguments for pooling layer class constructor.

    • If a dict, the same dict is applied to all the pooling layers.

    • If a list, the length must equal "num_conv_layers". The pooling arguments for layer i will be the element at index i from this list.

  3. For dense layers (note that here dense layers always follow convolutional and pooling layers):

    “num_dense_layers”: int

    Number of dense layers.

    “out_features”: int or list

    Dimension of features after the dense layers. If an int, all dense layers will have the same feature dimension. If a list of int, the list length must equal "num_dense_layers".

    “dense_activation”: str or callable

    Activation function applied to the output of the dense layers except the last dense layer output. Set to None to maintain a linear activation.

    “dense_activation_kwargs”: dict, optional

    Keyword arguments for dense layer activation functions before the last dense layer.

    “final_dense_activation”: str or callable

    Activation function applied to the output of the last dense layer. Set to None to maintain a linear activation.

    “final_dense_activation_kwargs”: dict, optional

    Keyword arguments for the activation function of last dense layer.

    “other_dense_kwargs”: dict, optional

    Other keyword arguments for dense layer class constructor.

  4. For dropouts:

    “dropout_conv”: int or list

    The indices of convolutional layers (starting from 0) whose inputs are applied with dropout. The index = num_conv_layers means dropout applies to the final convolutional layer output. For example,

    {
        "num_conv_layers": 2,
        "dropout_conv": [0, 2]
    }
    

    will leads to a series of layers as -dropout-conv0-conv1-dropout-.

    The dropout mode (training or not) is controlled by self.training.

    “dropout_dense”: int or list

    Same as "dropout_conv" but applied to dense layers (index starting from 0).

    “dropout_rate”: float

    The dropout rate, between 0 and 1. For example, "dropout_rate": 0.1 would drop out 10% of elements.

  5. Others:

    “name”: str

    Name of the network.

forward(input, sequence_length=None, dtype=None, data_format=None)[source]

Feeds forward inputs through the network layers and returns outputs.

Parameters
  • input – The inputs to the network, which is a 3D tensor.

  • sequence_length (optional) – An torch.LongTensor of shape [batch_size] or a python array containing the length of each element in inputs. If given, time steps beyond the length will first be masked out before feeding to the layers.

  • dtype (optional) – Type of the inputs. If not provided, infers from inputs automatically.

  • data_format (optional) – Data type of the input tensor. If channels_last, the last dimension will be treated as channel dimension so the size of the input should be [batch_size, X, channel]. If channels_first, first dimension will be treated as channel dimension so the size should be [batch_size, channel, X]. Defaults to None. If None, the value will be picked from hyperparameters.

Returns

The output of the final layer.

append_layer(layer)

Appends a layer to the end of the network.

Parameters

layer – A subclass of torch.nn.Module, or a dict of layer hyperparameters.

has_layer(layer_name)

Returns True if the network with the name exists. Returns False otherwise.

Parameters

layer_name (str) – Name of the layer.

layer_by_name(layer_name)

Returns the layer with the name. Returns None if the layer name does not exist.

Parameters

layer_name (str) – Name of the layer.

property layers_by_name

A dictionary mapping layer names to the layers.

property layers

A list of the layers.

property layer_names

A list of uniquified layer names.