产品观点提取
Contents
1.11. 产品观点提取#
PDF请访问产品观点提取1.pdf。产品观点提取2.pdf。
对于哪一款产品的什么态度
产品 评价名词 评价形容词 银行
标注,名词问题,不统一问题,很主观
2个任务,bio实体标注(begin in & out) 和 情感分类
总分排名
precision 和 recall 都有作弊的可能
F1 结合两者
Kappa系数 - 情感 - 越小,越不敏感 - 考虑了不平衡的样本
accuracy在样本不均衡的时候没卵用
NER 命名实体,从文本中找到你想要的任何东西,sequence tagging
信息提取,问答系统,句法分析,机器翻译等多种ML的重要基础
深度学习
LSTM / BI-LSTM
LSTM - CRF
BERT - CRF
在外面
维特比的方法来融合两个模型
CRF 计算transition函数
1.12. pytorch 里面定义了一种bilstm-crf的方法#
# read data
import pandas as pd
train = pd.read_csv('./train_data_public.csv')
test = pd.read_csv('./test_public.csv')
# 给BIO数据切分一下
train['BIO_anno'] = train['BIO_anno'].apply(lambda x : x.split(' '))
# 按照bio顺序加入词
train['train_Data'] = train.apply(lambda row: (list(row['text']), row['BIO_anno']), axis=1)
train
id | text | BIO_anno | class | train_Data | |
---|---|---|---|---|---|
0 | 0 | 交行14年用过,半年准备提额,却直接被降到1K,半年期间只T过一次三千,其它全部真实消费,第... | [B-BANK, I-BANK, O, O, O, O, O, O, O, O, O, O,... | 0 | ([交, 行, 1, 4, 年, 用, 过, ,, 半, 年, 准, 备, 提, 额, ,,... |
1 | 1 | 单标我有了,最近visa双标返现活动好 | [B-PRODUCT, I-PRODUCT, O, O, O, O, O, O, B-PRO... | 1 | ([单, 标, 我, 有, 了, ,, 最, 近, v, i, s, a, 双, 标, 返,... |
2 | 2 | 建设银行提额很慢的…… | [B-BANK, I-BANK, I-BANK, I-BANK, B-COMMENTS_N,... | 0 | ([建, 设, 银, 行, 提, 额, 很, 慢, 的, …, …], [B-BANK, I... |
3 | 3 | 我的怎么显示0.25费率,而且不管分多少期都一样费率,可惜只有69k | [O, O, O, O, O, O, O, O, O, O, B-COMMENTS_N, I... | 2 | ([我, 的, 怎, 么, 显, 示, 0, ., 2, 5, 费, 率, ,, 而, 且,... |
4 | 4 | 利率不错,可以撸 | [B-COMMENTS_N, I-COMMENTS_N, B-COMMENTS_ADJ, I... | 1 | ([利, 率, 不, 错, ,, 可, 以, 撸], [B-COMMENTS_N, I-CO... |
... | ... | ... | ... | ... | ... |
7523 | 7523 | 我鼎级拒了 | [O, O, O, B-COMMENTS_ADJ, O] | 2 | ([我, 鼎, 级, 拒, 了], [O, O, O, B-COMMENTS_ADJ, O]) |
7524 | 7524 | 一打一个准,准胜,看激活信用卡时那协议,全是对银行有利的 | [O, O, O, O, O, O, O, O, O, O, B-COMMENTS_N, I... | 2 | ([一, 打, 一, 个, 准, ,, 准, 胜, ,, 看, 激, 活, 信, 用, 卡,... |
7525 | 7525 | 招行分期白80k | [B-BANK, I-BANK, B-PRODUCT, I-PRODUCT, I-PRODU... | 2 | ([招, 行, 分, 期, 白, 8, 0, k], [B-BANK, I-BANK, B-... |
7526 | 7526 | 5万,额度还行吧没毕业哦 | [O, O, O, B-COMMENTS_N, I-COMMENTS_N, O, O, O,... | 2 | ([5, 万, ,, 额, 度, 还, 行, 吧, 没, 毕, 业, 哦], [O, O, ... |
7527 | 7527 | 张家港农商、江阴农商、无锡农商试试 | [B-BANK, I-BANK, I-BANK, I-BANK, I-BANK, O, B-... | 2 | ([张, 家, 港, 农, 商, 、, 江, 阴, 农, 商, 、, 无, 锡, 农, 商,... |
7528 rows × 5 columns
# 测试集,标点符号要的,是o
test['test_Data'] = test.apply(lambda row: (list(row['text'])), axis=1)
train_txt = []
for i in range(len(train)):
train_txt.append(train.loc[i, 'train_Data'])
test_txt = []
for i in range(len(test)):
test_txt.append(test.loc[i, 'test_Data'])
加载pytorch
# !pip install torch
Requirement already satisfied: torch in c:\users\19723\anaconda3\lib\site-packages (1.10.2)
Requirement already satisfied: typing-extensions in c:\users\19723\anaconda3\lib\site-packages (from torch) (4.1.1)
1.13. 查看pytorch版本#
import torch
torch.__version__
'1.10.2+cpu'
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim
torch.manual_seed(1)
# 返回向量中最大值索引
def argmax(vec):
# return argmax as a python int
_, idx = torch.max(vec, 1)
return idx.item()
# 将句子转换成IDlist
def prepare_sequence(seq, to_ix):
idxs = [to_ix[w] for w in seq]
return torch.tensor(idxs, dtype=torch.long)
# 计算log sum exp
# Compute log sum exp in a numerically stable way for the forward algorithm
def log_sum_exp(vec):
max_score = vec[0, argmax(vec)]
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
return max_score + \
torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))
class BiLSTM_CRF(nn.Module):
def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):
super(BiLSTM_CRF, self).__init__()
self.embedding_dim = embedding_dim
self.hidden_dim = hidden_dim
self.vocab_size = vocab_size
self.tag_to_ix = tag_to_ix
self.tagset_size = len(tag_to_ix)
self.word_embeds = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2,
num_layers=1, bidirectional=True)
# Maps the output of the LSTM into tag space.
# 将bilstm提取的特征向量vec映射到特征空间,得到了发射分数
self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size)
# Matrix of transition parameters. Entry i,j is the score of
# transitioning *to* i *from* j.
# 从j 到 i的转移分数得分
# 转移矩阵的参数初始化
self.transitions = nn.Parameter(
torch.randn(self.tagset_size, self.tagset_size))
# These two statements enforce the constraint that we never transfer
# to the start tag and we never transfer from the stop tag
# 初始化到start的分数非常小,因此不可能转移过来
self.transitions.data[tag_to_ix[START_TAG], :] = -10000
# stop点也可能转移到其他tag
self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000
self.hidden = self.init_hidden()
# 初始化参数
def init_hidden(self):
return (torch.randn(2, 1, self.hidden_dim // 2),
torch.randn(2, 1, self.hidden_dim // 2))
def _forward_alg(self, feats):
# 通过前向传播进行计算
# Do the forward algorithm to compute the partition function
init_alphas = torch.full((1, self.tagset_size), -10000.)
# START_TAG has all of the score.
# 初始化位置0的发射分数
init_alphas[0][self.tag_to_ix[START_TAG]] = 0.
# Wrap in a variable so that we will get automatic backprop
forward_var = init_alphas
# Iterate through the sentence
# 迭代整个句子
for feat in feats:
alphas_t = [] # The forward tensors at this timestep
for next_tag in range(self.tagset_size):
# broadcast the emission score: it is the same regardless of
# the previous tag
emit_score = feat[next_tag].view(
1, -1).expand(1, self.tagset_size)
# the ith entry of trans_score is the score of transitioning to
# next_tag from i
trans_score = self.transitions[next_tag].view(1, -1)
# The ith entry of next_tag_var is the value for the
# edge (i -> next_tag) before we do log-sum-exp
# 当前路径分数 = 之前时间步 + 转移分数 + 发射分数
next_tag_var = forward_var + trans_score + emit_score
# The forward variable for this tag is log-sum-exp of all the
# scores.
# 对当前分数计算 loss function (log_sum_exp)
alphas_t.append(log_sum_exp(next_tag_var).view(1))
# 更新forward_var
forward_var = torch.cat(alphas_t).view(1, -1)
# 考虑最终转移到stop
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
# 计算最终分数
alpha = log_sum_exp(terminal_var)
return alpha
def _get_lstm_features(self, sentence):
# 通过bi_LSTM提取特征
self.hidden = self.init_hidden()
embeds = self.word_embeds(sentence).view(len(sentence), 1, -1)
lstm_out, self.hidden = self.lstm(embeds, self.hidden)
lstm_out = lstm_out.view(len(sentence), self.hidden_dim)
lstm_feats = self.hidden2tag(lstm_out)
return lstm_feats
def _score_sentence(self, feats, tags):
# 计算tag序列的分数,一条路径的分数
# Gives the score of a provided tag sequence
score = torch.zeros(1)
tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])
for i, feat in enumerate(feats):
# 不断递推计算
score = score + \
self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
return score
# 求解最优路径
def _viterbi_decode(self, feats):
backpointers = []
# Initialize the viterbi variables in log space
init_vvars = torch.full((1, self.tagset_size), -10000.)
init_vvars[0][self.tag_to_ix[START_TAG]] = 0
# forward_var at step i holds the viterbi variables for step i-1
forward_var = init_vvars
for feat in feats:
# 保存当前时间步的回溯指针
bptrs_t = [] # holds the backpointers for this step
viterbivars_t = [] # holds the viterbi variables for this step
for next_tag in range(self.tagset_size):
# 只考虑上一步和上一步的转移
# 维特比记录最优路径,考虑上一步的分数以及上一步tag转移到当前tag的分数
# 不用考虑当前的分数
# next_tag_var[i] holds the viterbi variable for tag i at the
# previous step, plus the score of transitioning
# from tag i to next_tag.
# We don't include the emission scores here because the max
# does not depend on them (we add them in below)
next_tag_var = forward_var + self.transitions[next_tag]
best_tag_id = argmax(next_tag_var)
bptrs_t.append(best_tag_id)
viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
# Now add in the emission scores, and assign forward_var to the set
# of viterbi variables we just computed
forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
backpointers.append(bptrs_t)
# Transition to STOP_TAG
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
best_tag_id = argmax(terminal_var)
path_score = terminal_var[0][best_tag_id]
# Follow the back pointers to decode the best path.
best_path = [best_tag_id]
for bptrs_t in reversed(backpointers):
best_tag_id = bptrs_t[best_tag_id]
best_path.append(best_tag_id)
# Pop off the start tag (we dont want to return that to the caller)
start = best_path.pop()
assert start == self.tag_to_ix[START_TAG] # Sanity check
best_path.reverse()
return path_score, best_path
# 损失函数的组成
def neg_log_likelihood(self, sentence, tags):
feats = self._get_lstm_features(sentence)
forward_score = self._forward_alg(feats)
gold_score = self._score_sentence(feats, tags)
return forward_score - gold_score
# 通过bilstm计算发射分数
def forward(self, sentence): # dont confuse this with _forward_alg above.
# Get the emission scores from the BiLSTM
lstm_feats = self._get_lstm_features(sentence)
# Find the best path, given the features.
score, tag_seq = self._viterbi_decode(lstm_feats)
return score, tag_seq
Running Training
# START_TAG = "<START>"
# STOP_TAG = "<STOP>"
# # 隐藏层的神经元
# EMBEDDING_DIM = 11
# HIDDEN_DIM = 6
# # Make up some training data
# training_data = train_txt[:10000] #使用全量的
# word_to_ix = {}
# for sentence, tags in training_data:
# for word in sentence:
# if word not in word_to_ix:
# word_to_ix[word] = len(word_to_ix)
# len(word_to_ix)
2054
# # testing data --> 将汉字转换成id
# testing_data = test_txt[:10000] #使用全量的
# for sentence in testing_data:
# for word in sentence:
# if word not in word_to_ix:
# word_to_ix[word] = len(word_to_ix)
# len(word_to_ix)
2205
# import pickle
# with open('./word_to_id.pkl', 'wb') as file:
# pickle.dump(word_to_ix, file)
这个bio的数据标注 一句话,里面不同词性填到不同的列 产品,名词,形容词,银行 算法转换这些词性为不同的列
# tag_to_ix = {"B-BANK": 0, "I-BANK": 1, "B-PRODUCT": 2, "I-PRODUCT": 3, "O": 4, "B-COMMENTS_N":5, "I-COMMENTS_N":6, "B-COMMENTS_ADJ":7, "I-COMMENTS_ADJ":8, START_TAG:9, STOP_TAG:10}
# '''
# B-BANK 代表银行实体的开始
# I-BANK 代表银行实体的内部
# B-PRODUCT 代表产品实体的开始
# I-PRODUCT 代表产品实体的内部
# O 代表不属于标注的范围
# B-COMMENTS_N 代表用户评论(名词)
# I-COMMENTS_N 代表用户评论(名词)实体的内部
# B-COMMENTS_ADJ 代表用户评论(形容词)
# I-COMMENTS_ADJ 代表用户评论(形容词)实体的内部
# '''
# model = BiLSTM_CRF(len(word_to_ix), tag_to_ix, EMBEDDING_DIM, HIDDEN_DIM)
# optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)
# # 用随机参数进行预测, 不准确,只是跑了一遍流程
# # Check predictions before training
# with torch.no_grad():
# # 句子汉字 --》 ID LIST
# precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)
# precheck_tags = torch.tensor([tag_to_ix[t] for t in training_data[0][1]], dtype=torch.long)
# # 使用model预测BIO类别
# print(model(precheck_sent))
(tensor(125.1090), [6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 5, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 6, 0, 7, 4, 3])
减少误差的大小
# from tqdm import tqdm
# # Make sure prepare_sequence from earlier in the LSTM section is loaded
# # 40多轮的结果得出的结果会比较好一点
# for epoch in range(10): # again, normally you would NOT do 300 epochs, it is toy data
# for sentence, tags in tqdm(training_data):
# # Step 1. Remember that Pytorch accumulates gradients.
# # We need to clear them out before each instance
# # 梯度清零,防止梯度爆炸
# model.zero_grad()
# # Step 2. Get our inputs ready for the network, that is,
# # turn them into Tensors of word indices.
# # 原始文字 =》 IDX
# sentence_in = prepare_sequence(sentence, word_to_ix)
# targets = torch.tensor([tag_to_ix[t] for t in tags], dtype=torch.long)
# # Step 3. Run our forward pass.
# loss = model.neg_log_likelihood(sentence_in, targets)
# # Step 4. Compute the loss, gradients, and update the parameters by
# # calling optimizer.step()
# # 因为之前是前向转播,这里我们反向传播更新参数
# loss.backward()
# optimizer.step()
# # 我们需要保存一下我们的运行结果, 以下代码意思是运行多少轮保存一次
# if (epoch+1)%1==0:
# file_name='model{}.pt'.format(epoch+1)
# torch.save(model, file_name)
# prepare_sequencerint('{ saved}'.format(file_name))
# # 也可以.cuda放到GPU里面、 也可以放到paddle里面
# # Check predictions after training
# # 得出的结果更科学一点
# with torch.no_grad():
# precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)
# print(model(precheck_sent))
# # We got it!
(tensor(633.8870), [0, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 6, 4, 4, 4, 4, 4, 7, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 6, 4, 4, 4, 4, 2, 3, 4, 4, 4, 4, 5, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4])
放到GPU 用to(device), 模型和数据都要