BERT based sentence scenario detector

前两天用简单的多层感知器搭建了一个Word-level的detector模型。在模型的最后一次是用来Softmax,将Output Layer进行了分类。

对于场景识别这个问题,我目前先规定了可选的类别(比如Forest/ Ocean/ River/ College/ Suburb/ etc.)。这样一方面来说,可以简化detector的工作流程,另外也比较适应我们组目前的资源情况(识别场景之后需要提取事先准备好的Background,如果提取出了新的element也是无法获取到background resource的)。

上周我的想法是先使用Word embedding将Sentence转化为Sequence,然后使用Bi-LSTM或者直接使用Linear CRF对Sequence进行Sequence Tagging,以提取Sentence中涉及场景的Word。最后通过Word-level detector分析所选的Word,得到Sentence-level Scenario。

不过经过实验我发现,由于我手上只有不到500个短篇的儿童故事,还是没有标注的那种。就算我全部拿来进行标注,也只能生成不到5000个Phases。因为Labeler的资源比较紧张,我先用第一版的词表Detector模型生成了Labeling data,丢到CRF里面之后发现出了Person-entities,其他的类别基本无法有效识别出来。

于是这种方法暂时宣告失败。

周五晚上在公司发呆,突然觉得可以试一试力大砖飞的方法,直接使用Sentence-level embedding来作为Input。在这个模型里加入了CNN Layer,但其实单靠Dense Full connect Layer就已经可以在这个数据集上达到同样的效果了。

# 模型构建
model = Sequential([
Conv1D(filters=5, kernel_size=5, strides=1, padding='valid', input_shape=(768, 1), name="Convolution_Layer_1"),
AveragePooling1D(pool_size=5, strides=1, padding="valid", name="Pooling_Layer_1"),

Conv1D(filters=5, kernel_size=5, strides=1, padding='valid', name="Convolution_Layer_2"),
AveragePooling1D(pool_size=5, strides=1, padding="valid", name="Pooling_Layer_2"),

Flatten(name="Flatten_Layer"),

Dense(256, input_dim=3760, name="Dense_Layer_1"),
Activation('relu'),
Dropout(0.1),

Dense(32, input_dim=256, name="Dense_Layer_2"),
Activation('relu'),
Dropout(0.1),

Dense(11, input_dim=32, name="Dense_Layer_3"),
Activation('softmax'),
])


通用场景识别器

今天是新年第一天上班,然后想到这周只用上三天班就很开心。

由于Sequence Tagging需要大量的标注数据,我这边暂时没有数据源,所以今天下午就先用现有的标注数据集做了一个场景Softmax分类器。

原理十分简单,使用Word2vec (之后可能会考虑换成BERT,但是这两天BERT在我这里表现还不是很理想,所以先用顺手的工具搭建一下Demo)生成词向量。之后通过标注数据集合,将词表里所有的词分成以下几个大类(类别可以由具体的使用场景确定,我这里的分类主要是为了适配童话故事的情况)。

# 模型构建
model = Sequential([
Dense(32, input_dim=200),
Activation('relu'),
Dropout(0.1),

Dense(16, input_dim=32),
Activation('relu'),

Dense(9, input_dim=32),
Activation('softmax'),
])

用Keras搭建了一个最简单的多层感知机,加上Dropout,开始喂数据。最后可以达到96%左右的Accuracy,算是基本可以使用了。

 test loss:  0.09276254528926478
test accuracy: 0.9666666666666667

现在这套模型已经可以识别任意词的场景类别了。下一步就是使用Sequence Tagging找出描述场景的位置了。

Virtualenv pip update failure

很久之前就一直被pip的更新困扰了。由于Virtualenv自带的pip版本是10.1的,而现在的pip版本已经进化到了18,所以每次安装的时候都会被提醒要升级pip。

但是不知道为什么,pip在virtualenv里面的升级貌似会有报错。而且由于不升级的话,也不影响使用,所以也就一直没有升级。

今天实在受不了了,于是Google了一下解决的方法,其实只要强制升级一下就可以了(我估计是Python类型检查出的问题)。下面是更新命令:

If you can’t understand Chinese, and just want to update your pip in the virtialenv. Please ignore these blather, typing the following command and hit the Enter:

python -m pip install -U --force-reinstall pip

嗯,就是这么简单…

How to recapture my Oculus Home of Gear VR when I came back China

It has been more than one month that I came back my home country.

However, I recognized I have carried a Samsung Gear VR back until yesterday. It is a fancy equipment, but I met a bunch of trouble when I wanna restart it in China.

Firstly, everything looks normal…

I tried into the virtual space, and got something new. the scenario was seem like the old one, but only two applications can be shown (Samsung Gallery and Internet Browser).

The worst thing is that my library and App Store were vanished, which means I can’t download anything. So I uninstalled all applications and services which are VR-related. Technically, the behavior has been proved as a stupid idea…

Through some very struggle working, I finally recapture the original version of Oculus Home (Or the international version, in another word). Following are what I does:

  1. If you inserted a Chinese SIM card which was activated, the Oculus Home would identify it and then you will be forced updating to the Chinese version. That is the source of this problem.
  2. Now, inserting a oversea SIM card and deactivated your Chinese SIM card (If your phone has two slots of SIM card), and disabling your GPS.
  3. Open your VPN in order to downloading relevant applications and services.
  4. Inserting your phone to Gear VR. All the VR-related services would be updated to international version.
  5. The familiar picture will come back!

Start Windows

这两天开始认真工作了。

由于代码仓库的权限一直没有审批下来,我昨天划了一天的水。

可能是因为巨硬的体量实在是太大了,各种流程的审批都有点慢。而且不知道问什么,感觉Check-in的引导工作也不是非常理想,很多东西都需要自己打电话去咨询,这无疑增加了沟通成本。

在其他方面,微软的东西做的还是可以的(除了经常性的exception之外)。就操作系统来说,在微软肯定是得用Windows的系统了,但是Windows真的是对开发人员不太友好,Shell好难用。

于是我默默的搜索了一波Windows效率工具… 希望可以稍微缓解一下现在尴尬的情况…

今天和豪翔约了午饭,希望可以请教前辈大佬的经验吧。

万万没想到 我还是拿到了微软的Offer…

今天正在复习KT的时候,突然收到了微软小冰那边面试小哥的电话。

不知道为啥,我想都没想下意识的就接了电话,然后以为会是告诉我一声我跪了之类的通知。因为这次面试什么都没准备,而且正好是兵荒马乱的考试季,所以整个面试的流程真的是无比的艰辛 (找不到面试的地方,被迫坐在花粉漫天的草坪上面试/ 在Old Arts好不容易找了一个位置,被查夜的保安赶出去…)。 要是按照我自己标准来说,这个面试基本上是全程打铁。

周一的时候和学校沟通了一下申请实习的事情,然后又给微软的小哥发了消息去询问面试的情况,结果就没有回复。于是我就脑补了一百零八种可能性,最后觉得自己可能是凉了,还是好好复习考试吧。

但是万万没有想到,小哥打来电话居然告诉我有Offer了,瞬间目瞪狗呆,真刺激。

多的就不说了,我这两天还是先好好复习考试吧。

最后,感谢微软小哥和隔壁组可爱的小姐姐,考完试见 🙂

Be a survivor of a disaster

这两天全靠红牛和咖啡续命了。

11月1号考完了这学期的第一门Final,在皇家展览馆考的。看了一下考场的座次表,三千人一起考试真的是美滋滋。考场的“服务人员”态度也特别好,看我手画Burndown Chart,就贴心的给我递过来了一把尺子 (可能是看我手画的太惨不忍睹了)。

因为这个暑假有四个月的时间,待在家里的话虽然可以轻松许多,但是还是觉得趁还可以实习,应该多锻炼一下自己。本来以为离放假还有一段时间,打算等考完试再去找实习,但是前两天算了一下日期,我发现再不找实习估计就来不及了,于是赶紧把简历投了起来。运气还不错的是,有好几个大佬都给了机会让我试一试。考完试的第二天,也就是11月2号,我约了微软的面试。正好打算考完试休息放松一天,于是2号就被我完全腾空用来面试了。

Continue reading

891. Sum of Subsequence Widths

Given an array of integers A, consider all non-empty subsequences of A.
For any sequence S, let the width of S be the difference between the maximum and minimum element of S.
Return the sum of the widths of all subsequences of A.
As the answer may be very large, return the answer modulo 10^9 + 7.

Example 1:
Input: [2,1,3]
Output: 6
Explanation:
Subsequences are [1], [2], [3], [2,1], [2,3], [1,3], [2,1,3].
The corresponding widths are 0, 0, 0, 1, 1, 2, 2.
The sum of these widths is 6.

It is a pretty interesting question. Initially, I try to use combinations from itertools package. It works however time out…
So, I am aware of it is a math question.
We can find there are i numbers smaller than A[i], hence we have 2 ^ i subsequences that A[i] is the max number.
Meanwhile, there are len(A) – i – 1 numbers bigger than A[i], and we have 2^(len(A) – i – 1)subsequences in which A[i] is the min number.
According to above, result equal res += A[i] * ((2 ^ i) – 2 ^ (n – i – 1)). At least, we only need to mod 10^9 + 7, and then we get the accepted answer.

def leetcode(A):
    result = []
    for i, a in enumerate(sorted(A)):
        result += [((1 << i) - (1 << (len(A) - i - 1))) * a]
    return sum(result) % (10 ** 9 + 7)

Ready Player One

I created the OASIS because I never felt at home in the real world. I didn’t know how to connect with the people there. I was afraid, for all of my life, right up until I knew it was ending. That was when I realized, as terrifying and painful as reality can be, it’s also the only place where you can find true happiness. Because reality is real.