大家好,今天给大家分享使用深度学习处理文本,更多技术干货,后面会陆续分享出来,感兴趣可以持续关注。
NLP技术的大致发展历程:
常用的标准化方法:
需要注意:某些符号词元在特定的场景下有具体的作用或者含义,则不能直接删除。
词元化的3种方法:
两种文本处理模型:
将每个词元编码为数值表示,比如将每个词元编码为一个固定的二进制向量:
- vocabulary = {}
- for text in dataset: # 遍历数据集
- text = standardize(text) # 文本标准化过程
- tokens = tokenize(text) # 生成词元
- for token in tokens: # 遍历tokens
- if token not in vocabulary:
- vocabulary[token] = len(vocabulary) # 为词表中的每个单词分配唯一整数
-
将整数进行向量化:
- def one_hot_encode_token(token):
- vector = np.zeros(len(vocabulary)) # 初始为全0向量
- token_index = vocabulary[token] # 某个token的索引(整数)
- vector[token_index] = 1 # 该索引位置的值置为1
- return vector
-
需要注意3点:
- [[5,7,8,10,1],
- [4,1,3]]
-
- 通过掩码词元变成:
- [[5,7,8,10,1],
- [4,1,3,0,0]]
-
In [1]:
- # 实现Vectorizer层
-
- import string
-
- class Vectorizer:
- def standardize(self, text):
- text = text.lower() # 全部变成小写
- result = "".join(char for char in text if char not in string.punctuation)
- return result
-
- def tokenize(self,text):
- text = self.standardize(text) # 调用标准化函数
- return text.split()
-
- def make_vocabulary(self, dataset):
- self.vocabulary = {"":0, "[UNK]":1} # 针对掩码索引和OOV索引
- for text in dataset:
- text = self.standardize(text) # 标准化
- tokens = self.tokenize(text) # 词元化
- for token in tokens:
- if token not in self.vocabulary:
- self.vocabulary[token] = len(self.vocabulary)
- self.inverse_vocabulary = dict((v,k) for k, v in self.vocabulary.items()) # 翻转vocabulary中的键值对
-
- def encode(self, text): # 编码过程
- text = self.standardize(text)
- tokens = self.tokenize(text)
- return [self.vocabulary.get(token,1) for token in tokens]
-
- def decode(self, int_sequence): # 解码过程
- result = " ".join(self.inverse_vocabulary.get(i,"[UNK]") for i in int_sequence)
- return result
-
-
- vectorizer = Vectorizer()
- dataset = ["I write, erase, rewrite","Erase again and again","A poppy blooms"]
-
- vectorizer.make_vocabulary(dataset)
-
In [2]:
- test_sequence = "I write, rewrite, and still rewrite again"
-
- encoded_sentence = vectorizer.encode(test_sequence)
- encoded_sentence
-
Out[2]:
- [2, 3, 5, 7, 1, 5, 6]
-
In [3]:
- decoded_sentence = vectorizer.decode(encoded_sentence)
- decoded_sentence # still在原文本没有出现,使用OOV索引,用[UNK]表示
-
Out[3]:
- 'i write rewrite and [UNK] rewrite again'
-
TextVectorization层默认的操作是:
In [4]:
- from tensorflow.keras.layers import TextVectorization
-
- text_vectorization = TextVectorization(
- output_mode="int" # 返回值是编码为整数索引的单词序列
- )
-
In [5]:
- # 可以自定方法来标准化和词元化。等同下面的代码
-
- import re # 正则模块
- import string
- import tensorflow as tf
-
- def custom_standardization_fn(string_tensor):
- lowercase_string = tf.strings.lower(string_tensor) # 转小写
- return tf.strings.regex_replace(lowercase_string, f'[{re.escape(string.punctuation)}]', '') # 将标点符号替换为空字符串
-
- def custom_split_fn(string_tensor):
- return tf.strings.split(string_tensor) # 基于空格的切割字符串
-
- text_vectorization = TextVectorization(output_mode="int",
- standardize=custom_standardization_fn, # 标准化
- split=custom_split_fn # 词元化
- )
-
对文本语料库的词表建立索引,使用该层的adapt()方法:参数是可以生成字符串的Dataset对象或者由python字符串组成的列表。
In [6]:
- dataset = ["I write, erase, rewrite",
- "Erase again and again",
- "A poppy blooms"
- ]
-
- text_vectorization.adapt(dataset)
-
获取词表get_vocabulary:词表中的元素按照频率排列
In [7]:
- # 显示词表
- text_vectorization.get_vocabulary()
-
Out[7]:
- ['',
- '[UNK]',
- 'erase',
- 'again',
- 'write',
- 'rewrite',
- 'poppy',
- 'i',
- 'blooms',
- 'and',
- 'a']
-
In [8]:
- vocabulary = text_vectorization.get_vocabulary()
-
- # 测试句子
- test_sentence = "I write, rewrite, and still rewrite again"
-
- # 编码
- encoded_sentence = text_vectorization(test_sentence)
- encoded_sentence # 返回的是单词对应的索引-数值
-
Out[8]:
- <tf.Tensor: shape=(7,), dtype=int64, numpy=array([7, 4, 5, 9, 1, 5, 3])>
-
In [9]:
- inverse_vocab = dict(enumerate(vocabulary))
- print("inverse_vocab", inverse_vocab) # 单词对应的索引 键-索引,值-单词
- # 解码
- decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
-
- decoded_sentence
- inverse_vocab {0: '', 1: '[UNK]', 2: 'erase', 3: 'again', 4: 'write', 5: 'rewrite', 6: 'poppy', 7: 'i', 8: 'blooms', 9: 'and', 10: 'a'}
-
Out[9]:
- 'i write rewrite and [UNK] rewrite again'
-
TextVectorization层主要是字典的查询操作,不能在GPU或者TPU上运行,只能在CPU上运行
方法1:在tf.data管道中使用TextVectorization层
- int_sequence_dataset = string_dataset.map( # string_dataset:生成字符串张量的数据集
- text_vectorization, # 文本标准化的数据
- num_parallel_calls=4 # 在多个CPU上并行调用map
- )
-
方法2: 将TextVectorization层作为模型的一部分来使用
- text_input = keras.Input(shape=(), dtype="string") # 创建输入的符号张量,数据类型为字符串
- vectorized_text = text_vectorization(text_input) # 向量化
- embedded_input = keras.layers.Embedding(...)(vectorized_text) # 添加新的层:就像普通的函数式API
- output = ...
- model = keras.Model(text_input, output)
-
从斯坦福大学的Andrew Maas的页面下载数据并解压,train/pos目录下有12500个文件,每个文件包含一个正面情绪的影评文本,用作训练集。负面情绪的数据在neg目录下
In [11]:
- import os, pathlib, shutil, random
-
- # 创建3个文件目录
- base_dir = pathlib.Path("aclImdb")
- val_dir = base_dir / "val"
- train_dir = base_dir / "train"
-
In [12]:
- base_dir
-
Out[12]:
- PosixPath('aclImdb')
-
In [13]:
- val_dir
-
Out[13]:
- PosixPath('aclImdb/val')
-
In [14]:
- train_dir
-
Out[14]:
- PosixPath('aclImdb/train')
-
In [15]:
- for category in ("neg", "pos"):
- os.makedirs(val_dir / category) # 创建两个验证集目录
- files = os.listdir(train_dir / category) # 训练集目录下的全部文件(正负)
-
- random.Random(1337).shuffle(files) # 随机打乱数据
- num_val_samples = int(0.2*len(files)) # 20%
- val_files = files[-num_val_samples:] # 倒序切片 [-n:]
-
- for fname in val_files: # 对文件遍历:将train_dir中的文件一定到val_dir中
- shutil.move(train_dir / category / fname, val_dir / category / fname)
-
使用text_dataset_from_directory来生成数据集
In [16]:
- from tensorflow import keras
- batch_size = 32
-
- train_ds = keras.utils.text_dataset_from_directory("aclImdb/train", batch_size=batch_size)
- val_ds = keras.utils.text_dataset_from_directory("aclImdb/val", batch_size=batch_size)
- test_ds = keras.utils.text_dataset_from_directory("aclImdb/test", batch_size=batch_size)
- Found 12800 files belonging to 2 classes.
- Found 3200 files belonging to 2 classes.
- Found 25000 files belonging to 2 classes.
-
运行代码输出的内容是:找到2个类别的20000个文件
数据集的输入是TensorFlow tf.string张量,生成的目标是int32格式的张量,取值是0或者1.
In [17]:
- # 显示第一个批量的数据集形状和类型
-
- for inputs, targets in train_ds:
- print("inputs_shape", inputs.shape)
- print("inputs_dtype", inputs.dtype)
- print("targets.shape", targets.shape)
- print("targets.dtype", targets.dtype)
- print("inputs[0]", inputs[0])
- print("targets[0]", targets[0])
-
对文本编码的简单方法:舍弃顺序,将文本看做是一组(一袋)词元。既可以看做是单个词元,也可以看做是连续的一组词元(N元语法)
将整个文本看做是单一向量,其中每个元素表示某个单词是否存在。
基于二进制的编码将文本编码为一个向量,向量维数等于词表中的单词个数。
向量中的所有元素几乎为0,存在的元素才是1。
In [18]:
- text_vectorization = TextVectorization(
- max_tokens=20000,
- output_mode="multi_hot", # 重点:二进制编码
- )
-
In [19]:
- # 准备数据集,仅包含文本输入,不包含标签
- text_only_train_ds = train_ds.map(lambda x,y: x)
- text_vectorization.adapt(text_only_train_ds)
-
In [20]:
- binary_lgram_train_ds = train_ds.map(lambda x,y: (text_vectorization(x),y),num_parallel_calls=4)
- binary_lgram_val_ds = val_ds.map(lambda x,y: (text_vectorization(x),y),num_parallel_calls=4)
- binary_lgram_test_ds = test_ds.map(lambda x,y: (text_vectorization(x),y),num_parallel_calls=4)
-
In [21]:
- # 查看一元语法二进制编码后的数据集的输出
-
- for inputs, targets in binary_lgram_train_ds:
- print("inputs_shape", inputs.shape)
- print("inputs_dtype", inputs.dtype)
- print("targets.shape", targets.shape)
- print("targets.dtype", targets.dtype)
- print("inputs[0]", inputs[0])
- print("targets[0]", targets[0])
-
本章节中复用此模型
In [22]:
- from tensorflow import keras
- from tensorflow.keras import layers
-
- def get_model(max_token=20000, hidden_dim=16):
- inputs = keras.Input(shape=(max_token,)) # 输入层
- x = layers.Dense(hidden_dim, activation="relu")(inputs) # 隐藏层
- x = layers.Dropout(0.5)(x) # dropout层,防止过拟合
- outputs = layers.Dense(1, activation="sigmoid")(x) # 输出层
- model = keras.Model(inputs, outputs) # Model实例化
- model.compile(optimizer="rmsprop",loss="binary_crossentropy",metrics=["accuracy"]) # 编译模型
-
- return model
-
In [23]:
- model = get_model()
- model.summary()
- Model: "model"
- _________________________________________________________________
- Layer (type) Output Shape Param #
- =================================================================
- input_1 (InputLayer) [(None, 20000)] 0
-
- dense (Dense) (None, 16) 320016
-
- dropout (Dropout) (None, 16) 0
-
- dense_1 (Dense) (None, 1) 17
-
- =================================================================
- Total params: 320,033
- Trainable params: 320,033
- Non-trainable params: 0
- _________________________________________________________________
-
In [24]:
- callbacks = [keras.callbacks.ModelCheckpoint("binary_lgram_keras",
- save_best_only=True)]
-
- model.fit(binary_lgram_train_ds.cache(),
- validation_data=binary_lgram_val_ds.cache(),
- epochs=10,
- callbacks=callbacks
- )
-
- model = keras.models.load_model("binary_lgram_keras")
-
- print(f'Test acc', model.evaluate(binary_lgram_test_ds))
- Epoch 1/10
- 392/400 [============================>.] - ETA: 0s - loss: 0.4510 - accuracy: 0.7995INFO:tensorflow:Assets written to: binary_lgram_keras/assets
- 400/400 [==============================] - 7s 14ms/step - loss: 0.4488 - accuracy: 0.8008 - val_loss: 0.3165 - val_accuracy: 0.8806
- Epoch 2/10
- 398/400 [============================>.] - ETA: 0s - loss: 0.2740 - accuracy: 0.8956INFO:tensorflow:Assets written to: binary_lgram_keras/assets
- 400/400 [==============================] - 4s 10ms/step - loss: 0.2743 - accuracy: 0.8952 - val_loss: 0.2996 - val_accuracy: 0.8834
- Epoch 3/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.2313 - accuracy: 0.9175 - val_loss: 0.3097 - val_accuracy: 0.8869
- Epoch 4/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.2019 - accuracy: 0.9303 - val_loss: 0.3240 - val_accuracy: 0.8875
- Epoch 5/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.1885 - accuracy: 0.9398 - val_loss: 0.3431 - val_accuracy: 0.8831
- Epoch 6/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.1763 - accuracy: 0.9412 - val_loss: 0.3683 - val_accuracy: 0.8838
- Epoch 7/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.1740 - accuracy: 0.9477 - val_loss: 0.3835 - val_accuracy: 0.8841
- Epoch 8/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.1541 - accuracy: 0.9491 - val_loss: 0.4029 - val_accuracy: 0.8853
- Epoch 9/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.1664 - accuracy: 0.9514 - val_loss: 0.4074 - val_accuracy: 0.8819
- Epoch 10/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.1521 - accuracy: 0.9532 - val_loss: 0.4303 - val_accuracy: 0.8831
- 782/782 [==============================] - 6s 7ms/step - loss: 0.2969 - accuracy: 0.8810
- Test acc [0.2969053089618683, 0.8809599876403809]
-
TextVectorization层能够返回任意N元语法,通过参数设置ngrams=N
In [25]:
- # 返回二元语法
-
- text_vectorization = TextVectorization(ngrams=2,
- max_tokens=20000,
- output_mode="multi_hot",
- )
-
In [26]:
- text_vectorization.adapt(text_only_train_ds)
-
- binary_2gram_train_ds = train_ds.map(
- lambda x,y: (text_vectorization(x),y),
- num_parallel_calls=4)
- binary_2gram_val_ds = val_ds.map(
- lambda x,y: (text_vectorization(x),y),
- num_parallel_calls=4)
- binary_2gram_test_ds = test_ds.map(
- lambda x,y: (text_vectorization(x),y),
- num_parallel_calls=4)
-
- # 模型实例化
- model = get_model()
- model.summary()
-
-
- callbacks = [keras.callbacks.ModelCheckpoint("binary_2gram_keras",
- save_best_only=True)]
-
- model.fit(binary_2gram_train_ds.cache(),
- validation_data=binary_2gram_val_ds.cache(),
- epochs=10,
- callbacks=callbacks
- )
-
- model = keras.models.load_model("binary_2gram_keras")
-
- print(f'Test acc', model.evaluate(binary_2gram_test_ds))
- Model: "model_1"
- _________________________________________________________________
- Layer (type) Output Shape Param #
- =================================================================
- input_2 (InputLayer) [(None, 20000)] 0
-
- dense_2 (Dense) (None, 16) 320016
-
- dropout_1 (Dropout) (None, 16) 0
-
- dense_3 (Dense) (None, 1) 17
-
- =================================================================
- Total params: 320,033
- Trainable params: 320,033
- Non-trainable params: 0
- _________________________________________________________________
- Epoch 1/10
- 398/400 [============================>.] - ETA: 0s - loss: 0.4216 - accuracy: 0.8215INFO:tensorflow:Assets written to: binary_2gram_keras/assets
- 400/400 [==============================] - 7s 15ms/step - loss: 0.4211 - accuracy: 0.8220 - val_loss: 0.2985 - val_accuracy: 0.8891
- Epoch 2/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.2421 - accuracy: 0.9115 - val_loss: 0.2988 - val_accuracy: 0.8888
- Epoch 3/10
- 400/400 [==============================] - 3s 8ms/step - loss: 0.1938 - accuracy: 0.9348 - val_loss: 0.3240 - val_accuracy: 0.8913
- Epoch 4/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.1696 - accuracy: 0.9445 - val_loss: 0.3459 - val_accuracy: 0.8881
- Epoch 5/10
- 400/400 [==============================] - 2s 6ms/step - loss: 0.1477 - accuracy: 0.9542 - val_loss: 0.3751 - val_accuracy: 0.8894
- Epoch 6/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.1381 - accuracy: 0.9601 - val_loss: 0.4036 - val_accuracy: 0.8872
- Epoch 7/10
- 400/400 [==============================] - 2s 6ms/step - loss: 0.1362 - accuracy: 0.9623 - val_loss: 0.4186 - val_accuracy: 0.8891
- Epoch 8/10
- 400/400 [==============================] - 2s 6ms/step - loss: 0.1336 - accuracy: 0.9640 - val_loss: 0.4406 - val_accuracy: 0.8863
- Epoch 9/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.1357 - accuracy: 0.9641 - val_loss: 0.4575 - val_accuracy: 0.8881
- Epoch 10/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.1286 - accuracy: 0.9663 - val_loss: 0.4638 - val_accuracy: 0.8884
- 782/782 [==============================] - 5s 6ms/step - loss: 0.2848 - accuracy: 0.8918
- Test acc [0.28478434681892395, 0.8918399810791016]
-
我们发现,使用二元语法后精度达到了89.2%,而是用一元语法精度仅为87.8%;效果还是蛮好的
TextVectorizatio层还可以是基于计算每个单词或者每个N元语法的出现次数,统计文本的直方图
- # 二元语法的出现次数
-
- text_vectorization = TextVectorization(ngrams=2,
- max_tokens=20000,
- output_mode="count")
-
上述处理的缺陷:有些单词,比如the``a等肯定高频出现,但是对于建模无用,怎么处理?
解决方法:规范化,将单词计数减去均值除以方差。使用TF-IDF最好:词频-逆文档频次。
- text_vectorization = TextVectorization(ngrams=2,
- max_token=20000,
- output_mode="tf-idf")
-
TF-IDF的思想:某个单词在一个文档(当前文档)中出现的次数很重要;在全部文档中出现的频次也很重要。如果一个词语几乎在每个文档都出现的话,比如the、a等,那么它就不重要了。TF-IDF就是综合考虑了这两种思想。
TF:词频数,一篇文章中的词语出现的总次数,计算公式为:
某个词语在文章中出现的总次数文章的总词数
IDF:逆文档频率,需要一个语料库来支撑模型的环境,计算公式为:
预料库的文档总数包含该词语的文档数
- def tfidf(term, document,dataset):
- term_freq = document.count(term)
- doc_freq = math.log(sum(doc.count(term) for doc in dataset) + 1)
- return term_freq / doc_freq
-
In [27]:
- text_vectorization = TextVectorization(
- ngrams=2,
- max_tokens=20000,
- output_mode="tf-idf" # 选择输出模式
- )
-
In [28]:
- text_vectorization.adapt(text_only_train_ds)
-
- tfidf_2gram_train_ds = train_ds.map(
- lambda x,y: (text_vectorization(x),y),
- num_parallel_calls=4)
- tfidf_2gram_val_ds = val_ds.map(
- lambda x,y: (text_vectorization(x),y),
- num_parallel_calls=4)
- tfidf_2gram_test_ds = test_ds.map(
- lambda x,y: (text_vectorization(x),y),
- num_parallel_calls=4)
-
- # 模型实例化
- model = get_model()
- model.summary()
-
-
- callbacks = [keras.callbacks.ModelCheckpoint("tfidf_2gram_keras",
- save_best_only=True)]
-
- model.fit(tfidf_2gram_train_ds.cache(),
- validation_data=tfidf_2gram_val_ds.cache(),
- epochs=10,
- callbacks=callbacks
- )
-
- model = keras.models.load_model("tfidf_2gram_keras")
-
- print(f'Test acc', model.evaluate(tfidf_2gram_test_ds))
- Model: "model_2"
- _________________________________________________________________
- Layer (type) Output Shape Param #
- =================================================================
- input_3 (InputLayer) [(None, 20000)] 0
-
- dense_4 (Dense) (None, 16) 320016
-
- dropout_2 (Dropout) (None, 16) 0
-
- dense_5 (Dense) (None, 1) 17
-
- =================================================================
- Total params: 320,033
- Trainable params: 320,033
- Non-trainable params: 0
- _________________________________________________________________
- Epoch 1/10
- 396/400 [============================>.] - ETA: 0s - loss: 0.5398 - accuracy: 0.7569INFO:tensorflow:Assets written to: tfidf_2gram_keras/assets
- 400/400 [==============================] - 10s 21ms/step - loss: 0.5393 - accuracy: 0.7577 - val_loss: 0.3512 - val_accuracy: 0.8644
- Epoch 2/10
- 398/400 [============================>.] - ETA: 0s - loss: 0.3277 - accuracy: 0.8609INFO:tensorflow:Assets written to: tfidf_2gram_keras/assets
- 400/400 [==============================] - 4s 9ms/step - loss: 0.3288 - accuracy: 0.8603 - val_loss: 0.3308 - val_accuracy: 0.8775
- Epoch 3/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.2839 - accuracy: 0.8780 - val_loss: 0.3535 - val_accuracy: 0.8856
- Epoch 4/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.2462 - accuracy: 0.8916 - val_loss: 0.3800 - val_accuracy: 0.8772
- Epoch 5/10
- 400/400 [==============================] - 3s 6ms/step - loss: 0.2380 - accuracy: 0.8948 - val_loss: 0.4089 - val_accuracy: 0.8712
- Epoch 6/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.2129 - accuracy: 0.9093 - val_loss: 0.4162 - val_accuracy: 0.8863
- Epoch 7/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.2116 - accuracy: 0.9071 - val_loss: 0.4390 - val_accuracy: 0.8744
- Epoch 8/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.1900 - accuracy: 0.9180 - val_loss: 0.4753 - val_accuracy: 0.8794
- Epoch 9/10
- 400/400 [==============================] - 3s 7ms/step - loss: 0.1894 - accuracy: 0.9192 - val_loss: 0.4709 - val_accuracy: 0.8766
- Epoch 10/10
- 400/400 [==============================] - 3s 8ms/step - loss: 0.1878 - accuracy: 0.9184 - val_loss: 0.5033 - val_accuracy: 0.8706
- 782/782 [==============================] - 6s 6ms/step - loss: 0.3094 - accuracy: 0.8807
- Test acc [0.30939018726348877, 0.8807200193405151]
-
-
深度学习的历史就是逐渐摆脱手动特征工程,让模型仅仅通过数据自己就能学习特征。
序列模型就是非手动寻找基于顺序的特征,而是让模型直接观察原始单词序列的顺序并自己找出这样的特征。
要想实现序列模型:
In [32]:
- # 准备序列模型数据
- from tensorflow.keras import layers
-
- max_length = 600 # 在600个单词处阶段
- max_tokens = 20000 #
- text_vectorization = layers.TextVectorization( # 向量化
- max_tokens=max_tokens,
- output_mode="int",
- output_sequence_length=max_length,
- )
-
- text_vectorization.adapt(text_only_train_ds)
-
- int_train_ds = train_ds.map(
- lambda x,y: (text_vectorization(x),y),
- num_parallel_calls=4)
- int_val_ds = val_ds.map(
- lambda x,y: (text_vectorization(x),y),
- num_parallel_calls=4)
- int_test_ds = test_ds.map(
- lambda x,y: (text_vectorization(x),y),
- num_parallel_calls=4)
-
In [33]:
- import tensorflow as tf
- inputs = keras.Input(shape=(None,), dtype="int64")
-
- embedded = tf.one_hot(inputs, depth=max_tokens) # 编码为20000维的二进制向量
- x = layers.Bidirectional(layers.LSTM(32))(embedded) # 添加一个双向的LSTM
- x = layers.Dropout(0.5)(x)
- outputs = layers.Dense(1, activation="sigmoid")(x) # 最后一层是分类器
-
- model = keras.Model(inputs, outputs)
- model.compile(optimizer="rmsprop",
- loss="binary_crossentropy",
- metrics=["accuracy"]
- )
-
- model.summary()
- Model: "model_4"
- _________________________________________________________________
- Layer (type) Output Shape Param #
- =================================================================
- input_5 (InputLayer) [(None, None)] 0
-
- tf.one_hot_1 (TFOpLambda) (None, None, 20000) 0
-
- bidirectional_1 (Bidirectio (None, 64) 5128448
- nal)
-
- dropout_4 (Dropout) (None, 64) 0
-
- dense_7 (Dense) (None, 1) 65
-
- =================================================================
- Total params: 5,128,513
- Trainable params: 5,128,513
- Non-trainable params: 0
- _________________________________________________________________
-
-
In [*]:
- callbacks = [ # 回调函数
- keras.callbacks.ModelCheckpoint("ont_hot_bidir_lstm.keras",
- save_best_only=True
- )]
-
- model.fit(int_train_ds, # 模型训练
- validation_data=int_val_ds,
- epochs=10,
- callbacks=callbacks
- )
-
- model = keras.models.load_model("one_hot_bidir_lstm.keras") # 直接调用模型
-
这个模型在这里运行的很慢,输入很大:每个样本被编码成(600,20000)的矩阵(电脑运行部分截图)