1 Support Vector Machines
1.1 Example Dataset 1
%matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sb from scipy.io import loadmat from sklearn import svm
大多數SVM的庫會自動幫你添加額外的特征X?已經θ?,所以無需手動添加
mat = loadmat("./data/ex6data1.mat") print(mat.keys()) # dict_keys(["__header__", "__version__", "__globals__", "X", "y"]) X = mat["X"] y = mat["y"]
def plotData(X, y): plt.figure(figsize=(8,5)) plt.scatter(X[:,0], X[:,1], c=y.flatten(), cmap="rainbow") plt.xlabel("X1") plt.ylabel("X2") plt.legend() plotData(X, y)
def plotBoundary(clf, X): """plot decision bondary""" x_min, x_max = X[:,0].min()*1.2, X[:,0].max()*1.1 y_min, y_max = X[:,1].min()*1.1,X[:,1].max()*1.1 xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500), np.linspace(y_min, y_max, 500)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contour(xx, yy, Z)
models = [svm.SVC(C, kernel="linear") for C in [1, 100]] clfs = [model.fit(X, y.ravel()) for model in models]
title = ["SVM Decision Boundary with C = {} (Example Dataset 1".format(C) for C in [1, 100]] for model,title in zip(clfs,title): plt.figure(figsize=(8,5)) plotData(X, y) plotBoundary(model, X) plt.title(title)
可以從上圖看到,當C比較小時模型對誤分類的懲罰增大,比較嚴格,誤分類少,間隔比較狹窄。
當C比較大時模型對誤分類的懲罰增大,比較寬松,允許一定的誤分類存在,間隔較大。
1.2 SVM with Gaussian Kernels
這部分,使用SVM做非線性分類。我們將使用高斯核函數。
為了用SVM找出一個非線性的決策邊界,我們首先要實現高斯核函數。我可以把高斯核函數想象成一個相似度函數,用來測量一對樣本的距離,(x ? ? ?,y ? ? ?)
這里我們用sklearn自帶的svm中的核函數即可。
1.2.1 Gaussian Kernel
def gaussKernel(x1, x2, sigma): return np.exp(- ((x1 - x2) ** 2).sum() / (2 * sigma ** 2)) gaussKernel(np.array([1, 2, 1]),np.array([0, 4, -1]), 2.) # 0.32465246735834974
1.2.2 Example Dataset 2
mat = loadmat("./data/ex6data2.mat")
X2 = mat["X"]
y2 = mat["y"]
plotData(X2, y2)
sigma = 0.1 gamma = np.power(sigma,-2.)/2 clf = svm.SVC(C=1, kernel="rbf", gamma=gamma) modle = clf.fit(X2, y2.flatten()) plotData(X2, y2) plotBoundary(modle, X2)
1.2.3 Example Dataset 3
mat3 = loadmat("data/ex6data3.mat") X3, y3 = mat3["X"], mat3["y"] Xval, yval = mat3["Xval"], mat3["yval"] plotData(X3, y3)
Cvalues = (0.01, 0.03, 0.1, 0.3, 1., 3., 10., 30.) sigmavalues = Cvalues best_pair, best_score = (0, 0), 0 for C in Cvalues: for sigma in sigmavalues: gamma = np.power(sigma,-2.)/2 model = svm.SVC(C=C,kernel="rbf",gamma=gamma) model.fit(X3, y3.flatten()) this_score = model.score(Xval, yval) if this_score > best_score: best_score = this_score best_pair = (C, sigma) print("best_pair={}, best_score={}".format(best_pair, best_score)) # best_pair=(1.0, 0.1), best_score=0.965
model = svm.SVC(C=1., kernel="rbf", gamma = np.power(.1, -2.)/2) model.fit(X3, y3.flatten()) plotData(X3, y3) plotBoundary(model, X3)
# 這我的一個練習畫圖的,和作業無關,給個畫圖的參考。 import numpy as np import matplotlib.pyplot as plt from sklearn import svm # we create 40 separable points np.random.seed(0) X = np.array([[3,3],[4,3],[1,1]]) Y = np.array([1,1,-1]) # fit the model clf = svm.SVC(kernel="linear") clf.fit(X, Y) # get the separating hyperplane w = clf.coef_[0] a = -w[0] / w[1] xx = np.linspace(-5, 5) yy = a * xx - (clf.intercept_[0]) / w[1] # plot the parallels to the separating hyperplane that pass through the # support vectors b = clf.support_vectors_[0] yy_down = a * xx + (b[1] - a * b[0]) b = clf.support_vectors_[-1] yy_up = a * xx + (b[1] - a * b[0]) # plot the line, the points, and the nearest vectors to the plane plt.figure(figsize=(8,5)) plt.plot(xx, yy, "k-") plt.plot(xx, yy_down, "k--") plt.plot(xx, yy_up, "k--") # 圈出支持向量 plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=150, facecolors="none", edgecolors="k", linewidths=1.5) plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.rainbow) plt.axis("tight") plt.show() print(clf.decision_function(X))
[ 1. 1.5 -1. ]
2 Spam Classification
2.1 Preprocessing Emails
這部分用SVM建立一個垃圾郵件分類器。你需要將每個email變成一個n維的特征向量,這個分類器將判斷給定一個郵件x是垃圾郵件(y=1)或不是垃圾郵件(y=0)。
take a look at examples from the dataset
with open("data/emailSample1.txt", "r") as f: email = f.read() print(email)
> Anyone knows how much it costs to host a web portal ? > Well, it depends on how many visitors you"re expecting. This can be anywhere from less than 10 bucks a month to a couple of $100. You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 if youre running something big.. To unsubscribe yourself from this mailing list, send an email to: [email protected]
可以看到,郵件內容包含 a URL, an email address(at the end), numbers, and dollar amounts. 很多郵件都會包含這些元素,但是每封郵件的具體內容可能會不一樣。因此,處理郵件經常采用的方法是標準化這些數據,把所有URL當作一樣,所有數字看作一樣。
例如,我們用唯一的一個字符串‘httpaddr"來替換所有的URL,來表示郵件包含URL,而不要求具體的URL內容。這通常會提高垃圾郵件分類器的性能,因為垃圾郵件發送者通常會隨機化URL,因此在新的垃圾郵件中再次看到任何特定URL的幾率非常小。
我們可以做如下處理:
1. Lower-casing: 把整封郵件轉化為小寫。 2. Stripping HTML: 移除所有HTML標簽,只保留內容。 3. Normalizing URLs: 將所有的URL替換為字符串 “httpaddr”. 4. Normalizing Email Addresses: 所有的地址替換為 “emailaddr” 5. Normalizing Dollars: 所有dollar符號($)替換為“dollar”. 6. Normalizing Numbers: 所有數字替換為“number” 7. Word Stemming(詞干提取): 將所有單詞還原為詞源。例如,“discount”, “discounts”, “discounted” and “discounting”都替換為“discount”。 8. Removal of non-words: 移除所有非文字類型,所有的空格(tabs, newlines, spaces)調整為一個空格.
%matplotlib inline import numpy as np import matplotlib.pyplot as plt from scipy.io import loadmat from sklearn import svm import re #regular expression for e-mail processing # 這是一個可用的英文分詞算法(Porter stemmer) from stemming.porter2 import stem # 這個英文算法似乎更符合作業里面所用的代碼,與上面效果差不多 import nltk, nltk.stem.porter
def processEmail(email): """做除了Word Stemming和Removal of non-words的所有處理""" email = email.lower() email = re.sub("<[^<>]>", " ", email) # 匹配<開頭,然后所有不是< ,> 的內容,知道>結尾,相當于匹配<...> email = re.sub("(http|https)://[^s]*", "httpaddr", email ) # 匹配//后面不是空白字符的內容,遇到空白字符則停止 email = re.sub("[^s]+@[^s]+", "emailaddr", email) email = re.sub("[$]+", "dollar", email) email = re.sub("[d]+", "number", email) return email
接下來就是提取詞干,以及去除非字符內容。
def email2TokenList(email): """預處理數據,返回一個干凈的單詞列表""" # I"ll use the NLTK stemmer because it more accurately duplicates the # performance of the OCTAVE implementation in the assignment stemmer = nltk.stem.porter.PorterStemmer() email = preProcess(email) # 將郵件分割為單個單詞,re.split() 可以設置多種分隔符 tokens = re.split("[ @$/#.-:&*+=[]?!(){},"">\_<;\%]", email) # 遍歷每個分割出來的內容 tokenlist = [] for token in tokens: # 刪除任何非字母數字的字符 token = re.sub("[^a-zA-Z0-9]", "", token); # Use the Porter stemmer to 提取詞根 stemmed = stemmer.stem(token) # 去除空字符串‘",里面不含任何字符 if not len(token): continue tokenlist.append(stemmed) return tokenlist
2.1.1 Vocabulary List(詞匯表)
在對郵件進行預處理之后,我們有一個處理后的單詞列表。下一步是選擇我們想在分類器中使用哪些詞,我們需要去除哪些詞。
我們有一個詞匯表vocab.txt,里面存儲了在實際中經常使用的單詞,共1899個。
我們要算出處理后的email中含有多少vocab.txt中的單詞,并返回在vocab.txt中的index,這就我們想要的訓練單詞的索引。
def email2VocabIndices(email, vocab): """提取存在單詞的索引""" token = email2TokenList(email) index = [i for i in range(len(vocab)) if vocab[i] in token ] return index
2.2 Extracting Features from Emails
def email2FeatureVector(email): """ 將email轉化為詞向量,n是vocab的長度。存在單詞的相應位置的值置為1,其余為0 """ df = pd.read_table("data/vocab.txt",names=["words"]) vocab = df.as_matrix() # return array vector = np.zeros(len(vocab)) # init vector vocab_indices = email2VocabIndices(email, vocab) # 返回含有單詞的索引 # 將有單詞的索引置為1 for i in vocab_indices: vector[i] = 1 return vector
vector = email2FeatureVector(email) print("length of vector = {} num of non-zero = {}".format(len(vector), int(vector.sum())))
length of vector = 1899
num of non-zero = 45
2.3 Training SVM for Spam Classification
讀取已經訓提取好的特征向量以及相應的標簽。分訓練集和測試集。
# Training set mat1 = loadmat("data/spamTrain.mat") X, y = mat1["X"], mat1["y"] # Test set mat2 = scipy.io.loadmat("data/spamTest.mat") Xtest, ytest = mat2["Xtest"], mat2["ytest"]
clf = svm.SVC(C=0.1, kernel="linear") clf.fit(X, y)
2.4 Top Predictors for Spam
predTrain = clf.score(X, y) predTest = clf.score(Xtest, ytest) predTrain, predTest
(0.99825, 0.989)
到此這篇關于機器學習SVM支持向量機的練習文章就介紹到這了,更多相關機器學習內容請搜索服務器之家以前的文章或繼續瀏覽下面的相關文章,希望大家以后多多支持服務器之家!
原文鏈接:https://blog.csdn.net/Cowry5/article/details/80465922