本文已參與「新人創作禮」活動，一起開啟掘金創作之路。

本文首發於CSDN。

fastText Python官方GitHub資料夾網址：fastText/python at main · facebookresearch/fastText 本文介紹fastText Python包的基本教程，包括安裝方式和簡單的使用方式。

本文所使用的示例中文文字分類資料來自http://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv。除文中所做的工作外，還可以做停用詞處理等其他工作。其他fasttext Python示例程式碼可參考：fastText/python/doc/examples at master · facebookresearch/fastText

@[toc]

1. 安裝fastText

首先需要安裝numpy、scipy和pybind11。

numpy我是在安裝PyTorch的時候，順帶著安裝的。我使用的命令列是conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch。單獨安裝numpy可以參考使用conda install numpy。安裝SciPy：conda install scipy 或 pip install scipy 安裝pybind11，參考官方文件：Installing the library — pybind11 documentation：conda install -c conda-forge pybind11 或 pip install pybind11

安裝完前置包之後，安裝fastText：pip install fasttext

2. 訓練和呼叫詞向量模型

以前我用gensim做過。以後可以比較一下兩個包的不同之處。此外fasttext詞向量論文中用的baseline是谷歌官方的word2vec包：Google Code Archive - Long-term storage for Google Code Project Hosting.

2.1 程式碼實現

官方詳細教程：Word representations · fastText（使用的是英文維基百科的語料，本文的實驗用的是中文語料）

由於fasttext本身沒有中文分詞功能，因此需要手動對文字預先分詞。處理資料的程式碼可參考： ```python import csv,jieba with open('data/cls/ChnSentiCorp_htl_all.csv') as f: reader=csv.reader(f) header = next(reader) #表頭 data = [[int(row[0]),row[1]] for row in reader] #每個元素是一個由字串組成的列表，第一個元素是標籤（01），第二個元素是評論文字。

tofiledir='data/cls' with open(tofiledir+'/corpus.txt','w') as f: f.writelines([' '.join(jieba.cut(row[1]))+'\n' for row in data]) ``` 檔案效果：在這裡插入圖片描述

學習詞向量並展示的程式碼： python import fasttext model=fasttext.train_unsupervised('data/cls/corpus.txt',model='skipgram') #model入參可以更換為`cbow` print(model.words[:10]) #列印前10個單詞 print(model[model.words[9]]) #列印第10個單詞的詞向量（展示詞向量也可以使用get_word_vector(word)，而且可以查詢資料中未出現的data（事實上詞向量是用子字串總和來表示的））

輸出： Read 0M words Number of words: 6736 Number of labels: 0 Progress: 100.0% words/sec/thread: 71833 lr: 0.000000 avg.loss: 2.396854 ETA: 0h 0m 0s ['，', '的', '。', ',', '了', '酒店', '是', '</s>', '很', '房間'] [ 1.44523270e-02 -1.14391923e-01 -1.31457284e-01 -1.59686044e-01 -4.57017310e-02 2.04045177e-01 2.00106978e-01 1.63031772e-01 1.71287894e-01 -2.93396801e-01 -1.01871997e-01 2.42363811e-01 2.78942972e-01 -4.99058776e-02 -1.27043173e-01 2.87460908e-02 3.73771787e-01 -1.69842303e-01 2.42533281e-01 -1.82482198e-01 7.33817369e-02 2.21920848e-01 2.17794716e-01 1.68730497e-01 2.16873884e-02 -3.15452456e-01 8.21631625e-02 -6.56387508e-02 9.51113254e-02 1.69942483e-01 1.13980576e-01 1.15132451e-01 3.28856230e-01 -4.43856061e-01 -5.13903908e-02 -1.74580872e-01 4.39242758e-02 -2.22267807e-01 -1.09185934e-01 -1.62346154e-01 2.11286068e-01 2.44934723e-01 -1.95910111e-02 2.33887792e-01 -7.72107393e-02 -6.28366888e-01 -1.30844399e-01 1.01614185e-01 -2.42928267e-02 4.28218693e-02 -3.78409088e-01 2.31552869e-01 3.49486321e-02 8.70033056e-02 -4.75800633e-01 5.37340902e-02 2.29140893e-02 3.87787819e-04 -5.77102527e-02 1.44286081e-03 1.33415654e-01 2.14263964e-02 9.26891491e-02 -2.24226922e-01 7.32692927e-02 -1.52607411e-01 -1.42978013e-01 -4.28122580e-02 9.64387357e-02 7.77726322e-02 -4.48957413e-01 -6.19397573e-02 -1.22236833e-01 -6.12100661e-02 -5.51685333e-01 -1.35704070e-01 -1.66864052e-01 7.26311505e-02 -4.55838069e-02 -5.94963729e-02 1.23811573e-01 6.13824800e-02 2.12341957e-02 -9.38200951e-02 -1.40030123e-03 2.17677400e-01 -6.04508296e-02 -4.68601920e-02 2.30288744e-01 -2.68855840e-01 7.73726255e-02 1.22143216e-01 3.72817874e-01 -1.87924504e-01 -1.39104724e-01 -5.74962497e-01 -2.42888659e-01 -7.35510439e-02 -6.01616681e-01 -2.18178451e-01] 檢查詞向量的效果：搜尋其最近鄰居（nearest neighbor (nn)），給出向量捕獲語義資訊的直覺觀感（在教程中英文拼錯了也能用，但是中文這咋試，算了）（向量距離用餘弦相似度計算得到） python print(model.get_nearest_neighbors('房間')) 輸出：[(0.804237425327301, '小房間'), (0.7725597023963928, '房屋'), (0.7687026858329773, '盡頭'), (0.7665393352508545, '第一間'), (0.7633816599845886, '但床'), (0.7551409006118774, '成舊'), (0.7520463466644287, '屋子裡'), (0.750516414642334, '壓抑'), (0.7492958903312683, '油漆味'), (0.7476236820220947, '知')]

word analogies（預測跟第三個片語成與前兩個詞之間關係的詞）： python print(model.get_analogies('房間','壓抑','環境')) 輸出：[(0.7665581703186035, '優越'), (0.7352521419525146, '地理位置'), (0.7330452799797058, '安靜'), (0.7157530784606934, '周邊環境'), (0.7050396800041199, '自然環境'), (0.6963807344436646, '服務到位'), (0.6960451602935791, '也好'), (0.6948464512825012, '優雅'), (0.6906660795211792, '地點'), (0.6869651079177856, '地理')]

其他函式： - model.save_model(path) - fasttext.load_model(path) 返回model

fasttext.train_unsupervised()其他引數： - dim 向量維度（預設值是100,100-300都是常用值） - minn maxn 最大和最小的subword子字串（預設值是3-6） - epoch（預設值是5） - lr 學習率高會更快收斂，但是可能過擬合（預設值是0.05，常見選擇範圍是 [0.01, 1] ） - thread（預設值是12） input # training file path (required) model # unsupervised fasttext model {cbow, skipgram} [skipgram] lr # learning rate [0.05] dim # size of word vectors [100] ws # size of the context window [5] epoch # number of epochs [5] minCount # minimal number of word occurences [5] minn # min length of char ngram [3] maxn # max length of char ngram [6] neg # number of negatives sampled [5] wordNgrams # max length of word ngram [1] loss # loss function {ns, hs, softmax, ova} [ns] bucket # number of buckets [2000000] thread # number of threads [number of cpus] lrUpdateRate # change the rate of updates for the learning rate [100] t # sampling threshold [0.0001] verbose # verbose [2]

fastText官方提供已訓練好的300維多語言詞向量：Wiki word vectors · fastText 新版：Word vectors for 157 languages · fastText

2.2 原理介紹

在論文中如使用詞向量，需要引用這篇文獻：Enriching Word Vectors with Subword Information

skipgram和cbow應該不太需要介紹，這是NLP的常識知識。skipgram用一個隨機選擇的鄰近詞預測目標單詞，cbow用上下文（在一個window內，比如加總向量）預測目標單詞。 fasttext所使用的詞向量兼顧了subword資訊（用子字串表徵加總，作為整體的表徵），比單使用word資訊能獲得更豐富的語義，運算速度更快，而且可以得到原語料中不存在的詞語。

3. 文字分類

3.1 程式碼實現

官方詳細教程：Text classification · fastText（官方教程使用的資料集是英文烹飪領域stackexchange資料集）

本文中介紹的是one-label的情況，如果想使用multi-label的正規化，可參考官方教程中的相應部分：http://fasttext.cc/docs/en/supervised-tutorial.html#multi-label-classification，以及我撰寫的另一篇博文：multi-class分類模型評估指標的定義、原理及其Python實現

首先將原始資料處理成fasttext分類格式（需要手動對中文分詞，標籤以__label__為開頭）（由於fasttext只有訓練程式碼和測試程式碼，所以我只分了訓練集和測試集），程式碼可參考： ```python import csv,jieba,random with open('data/cls/ChnSentiCorp_htl_all.csv') as f: reader=csv.reader(f) header = next(reader) #表頭 data = [[row[0],row[1]] for row in reader] #每個元素是一個由字串組成的列表，第一個元素是標籤（01），第二個元素是評論文字。

tofiledir='data/cls'

隨機抽取80%訓練集，20%測試集

random.seed(14560704) random.shuffle(data) split_point=int(len(data)*0.8) with open(tofiledir+'/train.txt','w') as f: train_data=data[:split_point] f.writelines([' '.join(jieba.cut(row[1]))+' label'+row[0]+'\n' for row in train_data]) with open(tofiledir+'/test.txt','w') as f: test_data=data[split_point:] f.writelines([' '.join(jieba.cut(row[1]))+' label'+row[0]+'\n' for row in test_data]) 檔案示例： ![在這裡插入圖片描述](http://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/eba24540c5e940d1975c362448bb38e6~tplv-k3u1fbpfcp-zoom-1.image) 訓練分類模型，並進行測試，列印測試結果：python import fasttext model=fasttext.train_supervised('data/cls/train.txt') print(model.words[:10]) print(model.labels)

print(model.test('data/cls/test.txt'))

print(model.predict('酒店環境還可以，服務也很好，就是房間的衛生稍稍馬虎了一些，坐便器擦得不是十分乾淨，其它方面都還好。尤其是早餐，在我住過的四星酒店裡算是花樣比較多的了。因為游泳池是在室外，所以這個季節去了怕冷的人就沒有辦法游泳。補充點評 2007 年 11 月 16 日：服務方面忘了說一點，因為我落了一樣小東西在酒店，還以為就算了，沒想到昨天離開，今天就收到郵件提醒我說我落了東西，問我需要不需要他們給寄回來，這一點比有些酒店要好很多。')) 輸出： Read 0M words Number of words: 26133 Number of labels: 2 Progress: 100.0% words/sec/thread: 397956 lr: 0.000000 avg.loss: 0.353336 ETA: 0h 0m 0s ['，', '的', '。', ',', '了', '酒店', '是', '', '很', '房間'] ['__label__1', '__label__0'] (1554, 0.8783783783783784, 0.8783783783783784) (('__label__1',), array([0.83198541])) ``test()`函式的輸出依次是：樣本數，[email protected]，[email protected] （這個[email protected]大概意思是得分最高的標籤屬於正確標籤的比例，可以參考：IR-ratio: Precision-at-1 and Reciprocal Rank。[email protected]是正確標籤被預測到的概率）

predict()函式也可以傳入字串列表。

test()和predict()的入參k可以指定返回的標籤數量，預設為1。

儲存和載入模型檔案的方式與第二節中詞向量模型的類似。

train_supervised()其他入參： - epoch（預設值為5） - lr（效果好的範圍為0.1-1） - wordNgrams 用n-gram而不是unigram（當使用語序很重要的分類任務（如情感分析）時很重要） - bucket - dim - loss - 使用hs (hierarchical softmax) 代替標準softmax，可以加速執行 hierarchical softmax：看了一下沒太看懂，總之大概來說是用二叉樹來表示標籤，這樣複雜度就不呈線性增長而是呈對數增長了。fasttext中用的是哈夫曼樹，平均查詢時間最優。fasttext官方介紹：http://fasttext.cc/docs/en/supervised-tutorial.html#advanced-readers-hierarchical-softmax 此外還給出了一個YouTube講解影片：Neural networks [10.7] : Natural language processing - hierarchical output layer - YouTube - one-vs-all或ova：multi-label正規化，將每個標籤都單獨建模成一個one-label分類任務（相比其他損失函式，建議調低學習率。predict()時指定k=-1輸出儘量多的預測結果，threshold規定輸出大於閾值的標籤。test()時直接指定k=-1） input # training file path (required) lr # learning rate [0.1] dim # size of word vectors [100] ws # size of the context window [5] epoch # number of epochs [5] minCount # minimal number of word occurences [1] minCountLabel # minimal number of label occurences [1] minn # min length of char ngram [0] maxn # max length of char ngram [0] neg # number of negatives sampled [5] wordNgrams # max length of word ngram [1] loss # loss function {ns, hs, softmax, ova} [softmax] bucket # number of buckets [2000000] thread # number of threads [number of cpus] lrUpdateRate # change the rate of updates for the learning rate [100] t # sampling threshold [0.0001] label # label prefix ['__label__'] verbose # verbose [2] pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []

3.2 原理介紹

在論文中如使用文字分類功能需引用該文獻：Bag of Tricks for Efficient Text Classification

感覺是個比較直覺的簡單模型，計算詞向量後求平均值，計算輸出標籤。具體細節待補。

4. 量化實現模型壓縮

```python

with the previously trained `model` object, call :

model.quantize(input='data.train.txt', retrain=True)

then display results and save the new model :

print_results(*model.test(valid_data)) model.save_model("model_filename.ftz") ```

5. 模型的屬性和方法

在這裡插入圖片描述

方法： get_dimension # Get the dimension (size) of a lookup vector (hidden layer). # This is equivalent to `dim` property. get_input_vector # Given an index, get the corresponding vector of the Input Matrix. get_input_matrix # Get a copy of the full input matrix of a Model. get_labels # Get the entire list of labels of the dictionary # This is equivalent to `labels` property. get_line # Split a line of text into words and labels. get_output_matrix # Get a copy of the full output matrix of a Model. get_sentence_vector # Given a string, get a single vector represenation. This function # assumes to be given a single line of text. We split words on # whitespace (space, newline, tab, vertical tab) and the control # characters carriage return, formfeed and the null character. get_subword_id # Given a subword, return the index (within input matrix) it hashes to. get_subwords # Given a word, get the subwords and their indicies. get_word_id # Given a word, get the word id within the dictionary. get_word_vector # Get the vector representation of word. get_words # Get the entire list of words of the dictionary # This is equivalent to `words` property. is_quantized # whether the model has been quantized predict # Given a string, get a list of labels and a list of corresponding probabilities. quantize # Quantize the model reducing the size of the model and it's memory footprint. save_model # Save the model to the given path test # Evaluate supervised model using file given by path test_label # Return the precision and recall score for each label.

model.words # equivalent to model.get_words() model.labels # equivalent to model.get_labels()

model['king'] # equivalent to model.get_word_vector('king') 'king' in model # equivalent to `'king' in model.get_words()`

4. 其他在正文及腳註中未提及的參考資料

NLP實戰之Fasttext中文文字分類_vivian_ll的部落格-CSDN部落格_fasttext 中文：這一篇去除了停用詞，此外還介紹了gensim包中計算詞向量的方法。
[原創]《使用 fastText 做中文文字分類》文章合集 – 編碼無悔 / Intent & Focused：這一篇是使用Java做的，因為資料量很大，所以想用map-reduce實現。資料標籤是通過騰訊雲文字分類免費API來調取得到的……
關於文字分類（情感分析）的中文資料集彙總_櫻與刀的部落格-CSDN部落格_情感分析資料集
Python3讀取CSV資料_柿子鐳的部落格-CSDN部落格_python3讀取csv檔案
python讀取csv時skipinitialspace引數的使用_vanlywang的部落格-CSDN部落格

fastText Python 教程