AI 学习之类比推理
类比推理
本篇文档我们将使用已经训练好了的Glove向量来实现类比推理。因为词嵌入是需要很大的语料库很多的计算力很长的时间才能训练好的,所以通常我们都是使用前人已经训练好了的词嵌入。
import numpy as np
from w2v_utils import *
Using TensorFlow backend.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:458: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:465: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
下面的代码将加载词库中的单词到words里面,并返回一个word_to_vec_map字典,使用这个字典可以将单词转换成相应的Glove向量。
words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')
print(words)
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
print(type(word_to_vec_map))
<class 'dict'>
# print(word_to_vec_map)
1 - 余弦相似Cosine similarity
使用余弦cosine来判断两个词相似度的公式如下(u和v代表两个不同的词):
$$\text{CosineSimilarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$
$u.v$表示的是两个向量的点积(dot product)或内积(inner product);$||u||_2$是向量u的范数(norm),$\theta$是向量$u$ 和$v$之间的角度。如果向量u和v越相似余弦值就越接近1;如果它们越不相似,那么余弦值就会靠近-1。u的范数的定义是 :
$$||u||2 = \sqrt{\sum{i=1}^{n} u_i^2}$$
左图是法国france和意大利italy,它们都是国家名,所以很相似,所以角度很大,余弦值很大。中间的是两个不相干的词,角度成90度。右边的是两组词,一组是罗马rome意大利italy,另一组是法国france和巴黎paris,这两组词把国家和首都调换了位置,所以有了对立的关系,所以角度靠近180度。
# 根据上面的公式来实现使用余弦cosine判断两个词的相似度
def cosine_similarity(u, v):
distance = 0.0
dot = np.dot(u, v)
norm_u = np.sqrt(np.sum(u * u))
norm_v = np.sqrt(np.sum(v * v))
cosine_similarity = dot / (norm_u * norm_v)
return cosine_similarity
father = word_to_vec_map["father"]
mother = word_to_vec_map["mother"]
ball = word_to_vec_map["ball"]
crocodile = word_to_vec_map["crocodile"]
france = word_to_vec_map["france"]
italy = word_to_vec_map["italy"]
paris = word_to_vec_map["paris"]
rome = word_to_vec_map["rome"]
print("cosine_similarity(father, mother) = ", cosine_similarity(father, mother))
print("cosine_similarity(ball, crocodile) = ",cosine_similarity(ball, crocodile))
print("cosine_similarity(france - paris, rome - italy) = ",cosine_similarity(france - paris, rome - italy))
cosine_similarity(father, mother) = 0.8909038442893615
cosine_similarity(ball, crocodile) = 0.2743924626137942
cosine_similarity(france - paris, rome - italy) = -0.6751479308174201
你也可以修改上面单元测试里面的单词,来看看不同单词会有怎样的余弦值。
2 - 类比推理
# 根据word_a->word_b,推理出word_c->?
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
# 转换成小写
word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
#将单词转换成Glove向量
e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]
words = word_to_vec_map.keys()
max_cosine_sim = -100
best_word = None
# 遍历词表中的每个单词
for w in words:
# 避免遇到已知的3个词word_a, word_b, word_c,如果遇到就跳到下一次循环,可以自行查阅continue的语法
if w in [word_a, word_b, word_c] :
continue
### 利用前面的函数进行相似性判断。
cosine_sim = cosine_similarity(e_b - e_a, word_to_vec_map[w] - e_c)
if cosine_sim > max_cosine_sim:
max_cosine_sim = cosine_sim
best_word = w
return best_word
triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad, word_to_vec_map)))
italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> larger
为者常成,行者常至
自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)