current position:Home>Tensorflow - recurrent neural network (5) subword text classification

Tensorflow - recurrent neural network (5) subword text classification

2022-08-06 18:09:48plum_blossom

携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第10天,点击查看活动详情


在上一篇文章中,我们介绍了LSTM模型,并且使用LSTMThe model completes the actual combat of text classification and text generation.

今天我们将使用tensorflow_datasets中的imdb数据使用subword_levelmethod for text classification


  • 在这个实战中,我们使用的是imdb的数据集,在这个数据集中,We use words as units to express all sentences and corpus,do with wordsembedding,对embeddingto input our recurrent neural network,to do text classification.

  • 但是,在NLP领域,We often do not use words as units,Because the word as a unit has two disadvantages,

    • One is that the vocabulary of words is relatively large,Therefore, we need a relatively large vocabulary,那么我们的model_size就会比较大;
    • 第二个原因就是,No matter how big my vocabulary is,It still has an upper limit,And for language,It has a lot of words,Not in the vocabulary many times,I can only use it when I come across itUNK表示.
  • 那么怎么解决这个问题呢?

    • One way is to use char-level model, char-level就是字符级别的,For English it isa-z这26个字母加0-9这10个数字
    • 一种就是使用subword-level model,subword-level是介于char-level和word-level之间的一个level;例如:对于‘hello’就使用‘he’、‘ll’、‘o'三个subword-level来表示

接下来,Let's see how to use itsubword-level来训练一个文本分类模型.

  • 5.1 Let's start by introducing a new library:Tensorflow_datasets在这个库中,tensorflowIt is to help us define a lot of public data sets,These datasets are all based on datasetformat to store,可以通过一个stringString to load in

  • 5.1.1 载入数据
    • as_supervised:Whether the dataset is supervised
import tensorflow_datasets as tfds

dataset, info = tfds.load('imdb_reviews/subwords8k', with_info = True,
                          as_supervised=True)

train_dataset, test_dataset = dataset['train'], dataset['test']
复制代码

5.1.2 对数据集进行操作:

  • encoder:Words can be converted into subword
tokenizer = info.features['text'].encoder
print('vocabulary size: {}'.format(tokenizer.vocab_size))
复制代码

运行结果:

vocabulary size: 8185
复制代码

5.1.3 接下来,Let's test it for a sentencetokenizerwhat will it look like:

  • 使用tokenizer.encodemethod to process sentences
  • 使用tokenizer.decodeTurn the sentence back
sample_string = 'Tensorflow is cool.'
tokenized_string = tokenizer.encode(sample_string)
print('Tokenized string is {}'.format(tokenized_string))

original_string = tokenizer.decode(tokenized_string)
print('Original string is {}'.format(original_string))

assert original_string == sample_string
复制代码

运行结果:

Tokenized string is [6307, 2327, 2934, 7961, 9, 2724, 7975]
Original string is Tensorflow is cool.
复制代码

Look at what letter the sentence corresponds to:

for token in tokenized_string:
    print('{} --> {}'.format(token, tokenizer.decode([token])))
复制代码

运行结果:

6307 --> Ten
2327 --> sor
2934 --> flow
7961 --> 
9 --> is 
2724 --> cool
7975 --> .
复制代码

5.1.4 对数据集进行变换,包括shuffle,batch

padded_batch:对于每一个batch分别做padding,找到batchThe longest sample in ,Do according to the longest samplepadding,当然,You can also set the longest length yourself

buffer_size = 10000
batch_size = 64

print(tf.compat.v1.data.get_output_shapes(train_dataset))
print(tf.compat.v1.data.get_output_shapes(test_dataset))

train_dataset = train_dataset.shuffle(buffer_size)
train_dataset = train_dataset.padded_batch(batch_size, tf.compat.v1.data.get_output_shapes(train_dataset))
test_dataset = test_dataset.padded_batch(batch_size, tf.compat.v1.data.get_output_shapes(test_dataset))
复制代码

5.1.5 建立模型:

vocab_size = tokenizer.vocab_size
embedding_dim = 16
batch_size = 512

bi_rnn_model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim),
    keras.layers.Bidirectional(
        keras.layers.LSTM(
            units = 32, return_sequences = False)),
    keras.layers.Dense(32, activation = 'relu'),
    keras.layers.Dense(1, activation='sigmoid'),
])

bi_rnn_model.summary()
bi_rnn_model.compile(optimizer = 'adam',
                     loss = 'binary_crossentropy',
                     metrics = ['accuracy'])
复制代码

运行结果:

Model: "sequential"
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, None, 16) 130960 _________________________________________________________________
bidirectional (Bidirectional (None, 64)                12544     
_________________________________________________________________ dense (Dense) (None, 32) 2080 _________________________________________________________________
dense_1 (Dense) (None, 1) 33 ================================================================= Total params: 145,617 Trainable params: 145,617 Non-trainable params: 0 _________________________________________________________________ 复制代码

训练模型,打印学习曲线:

history = bi_rnn_model.fit(
    train_dataset,
    epochs = 10,
    validation_data = test_dataset)
def plot_learning_curves(history, label, epochs, min_value, max_value):
    data = {}
    data[label] = history.history[label]
    data['val_'+label] = history.history['val_'+label]
    pd.DataFrame(data).plot(figsize=(8, 5))
    plt.grid(True)
    plt.axis([0, epochs, min_value, max_value])
    plt.show()
    
plot_learning_curves(history, 'accuracy', 10, 0, 1)
plot_learning_curves(history, 'loss', 10, 0, 1)
复制代码

运行结果:

W`HPCFDHH65}ZLPM_UFSZCV.png

在tfdsImported data and kerasThere are some differences in the imported data,Therefore, it cannot be compared with the previous model,But after the image can be found,Our model overfit is a little weaker than before.

因而,我们可以看到,通过使用subwordThis mechanism can be a shrinking of the vocabulary,The smaller the vocabulary, the fewer parameters,The fewer the parameters, the lower the risk of overfitting.

9JQ4ZCQY3M({Q$KEN%9BFQX.png

copyright notice
author[plum_blossom],Please bring the original link to reprint, thank you.
https://en.cdmana.com/2022/218/202208061751323386.html

Random recommended