shihpeng's blog

2015年11月16日星期一

[Java] Hashtable 的概觀、細節與實作

Hash Table

在撰寫程式時，我們經常會用到類似「表格」的資料結構來儲存資料。要完成這件事情，一個最基本的想法就是將表格的鍵值 (key，通常會是一個字串)，經過某個函數轉換為整數後，將這個整數作為索引值把該筆資料存到一個夠大的陣列中。

利用 hashing 方式儲存與取得資料的操作都必須有兩個階段：第一步，先透過某個 hash function 把鍵值轉為用作索引值的整數，我們稱為 hash value，這個 hash value 會對應到這個「陣列」中的某個位置，我們稱為 bucket；第二，用這個計算出的 hash value 把要儲存的資料存入或取出 bucket。理想上來說，我們希望每筆資料都能在陣列中找到一個屬於自己的位置，但事實上當然不可能這麼理想，假設現在我們的陣列長度只有 100，但有 200 個元素要存入這個資料結構中，那絕對會發生資料 collision 的狀況。所以上述的第二步驟的主要工作就是在於如何處理資料的 collision。

所以，組成一份 hash table 的主要元件有三：hash function、hash value 與 bucket。其中最為重要，也影響 hash table 效能最大的，就是 hash function 了。

Hash Function

為了盡量讓我們的資料都能平均的散佈在 hash table 的各個 bucket 中，我們希望 hash function 所計算出的 hash value 能盡量的平均散佈 (uniformly distributed)。而且因為計算 hash value 只是進行資料儲存取出整個過程的一小部分，所以我們也希望這個值的計算能盡量簡單。一般來說，常用的 hash function 有：
郵遞區號、身分證ID 等對資料內容有直觀意義的數字。
取餘數：假設 hash table 中有 k 個 bucket，我們可以取 m % k 作為 hash value。
字串：我們可以根據字串中的字元將整個字串計算成一個巨大的數字，然後再取 k 的餘數：
int hash = 0;
for (int i = 0; i < s.length(); i++)
hash = (R * hash + s.charAt(i)) % M;

Bucket

在 hash table 的定義中，bucket 是用來存放多個具有相同 hash value 的鍵值對資料的地方。假設現在我們有一個含有 1000 個 bucket 的 hash table，同時我們也有一個非常完美的 hash function，理想上來說，如果現在我們將這 1000 筆鍵值對資料存入這個 hash table 後，每個 bucket 會正好只有一筆鍵值對資料存在其中，如此一來存取任何一筆資料的時間複雜度為 O(1)。

但是，事實上當然不可能有那麼完美的事。如果有兩筆鍵值對資料要存入 hash table，而兩筆資料的鍵值所算出來的 hash value 相同時，那這筆資料就會以鏈結串列的方式記錄在對應的 hash value 的欄位上。所以，如果現在我們有 n 筆資料要存入 hash table 中，而我們的 hash function 又非常的爛，假設是一個只會吐出常數的函數好了，那我們的 n 筆資料所算出的 hash value 總是同一個值，總是要被放在同一個 bucket 中，最終這個 hash table 裡會只有一個bucket，而這個 bucket 中卻有一個長度為 n 的鏈結串列。當我們要尋找某筆資料時，必須窮舉這個串列的每一筆資料，造成取出時間複雜度為 O(n) 的 worst case。

Java 的 hash tables

Java 中是如何實作 hash table 的呢？在 Java standard library 中可做為 hash table 的類別共有三個，分別是 Hashtable、HashMap 以及 ConcurrentHashMap。

事實上，Java 的 HashMap 並不信任使用者類別中定義的 hashCode() 的 hash value 有多好，所以它會在實際計算 hash value 時，利用 hashCode() 所回傳的整數雜湊值作為輸入，再放入另一個名為 hash() 的靜態方法中來計算出實際用於存取資料的雜湊值。

public V put(K key, V value) {
    if (key == null)
        return putForNullKey(value);
    int hash = hash(key.hashCode());
    int i = indexFor(hash, table.length);
    for (Entry<K,V> e = table[i]; e != null; e = e.next) {

        // put the <key, value> to the table
    }
}

static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor.)
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

static int indexFor(int h, int length) {
    return h & (length-1);
}

我們可以看到要實際放入一筆資料進入 HashMap 時，會先透過物件的 hashCode() 與 HashMap 的 hash() 方法計算出 hash value，hash() 中的運算非常簡單，卻能夠有效降低 collision 的機率。

接著，由於 HashMap 中用來儲存資料的物件是個每個元素實際上都是一個 Entry 型態串列的陣列，所以我們要再透過 indexFor() 方法來算出現在要存取的鍵值對資料該放入哪個串列中。

HashMap 的 hash function

HashMap 的 get(Key k) 方法首先會呼叫鍵值物件的 hashCode()，然後將這個方法所計算出的 hash value 作為 input，再送給其所自行定義的靜態方法 hash() 中計算出要存取的 bucket 位置。不過，像是 Hashtable 與 ConcurrentHashMap 並沒有 null key 的問題，HashMap 如果遇到 null key，則總是會計算出 hash 0，也就是位置為 0 的 bucket。

現在有個有趣的問題了：明明我們就已經有了一個 hashCode()，為什麼還要再用它自己的 hash() 算一次呢？答案很簡單，因為 HashMap 並不相信自定義物件的 hash function 的品質。

Thread-Safety

Hashtable 雖然是 thread-safe 的類別，但其 thread-safe 的性質卻來自於各個被標為 synchronized 的成員方法。也就是說，Hashtable 在實際應用在多執行緒環境時，其效能會非常的低落。所以當我們必須在多執行緒環境中使用 hash table 時，我們會選擇使用 ConcurrentHashMap。ConcurrentHashMap 也是 thread-safe 的，但與 Hashtable 不同的是，ConcurrentHashMap 在處理可能發生 race condition 的操作時，其採用的策略是鎖定部分欄位 (lock the portion) 的方式。我們如果去看 ConcurrenctHashMap 的原始碼，會發現 put() 或 get() 等方法並沒有像 Hashtable 那樣被直接標為 synchronzied，而是在明確知道該處理哪個 bucket 之後，不得不進行資料的寫入讀出時，才進行同步化的工作。

Fail-Fast 與 Fail-Safe

2015年10月15日星期四

一些基本的資料分析問題

這篇文章將為大家介紹一些常見實用的資料分析問題，這些問題相當的基本，主要目的在於讓我們能對所擁有的資料開始有一些些的 insight，好讓我們在未來要進行諸如 regression 或 learning 等等工作上能更加順利。

1. 這些資料的 typical value 是什麼？(What is a typical value?)

所謂資料的 typical value，指的就是能夠代表這份資料的統計值。具備這種資格的統計值有：平均值 (mean)、中位數 (median)、眾數 (mode) 還有各式分位數等等。但因為均值具備有「能呈現資料全體變化」、便於統計方法使用等等優點，所以一般來說我們都會以均值做為某數據集的 typical value。

做法：優先給予資料的均值，搭配中位數與眾數等其他統計值。

2. 這份資料中的資料變動程度有多大？(What is the uncertainty for a typical value?)

資料的變動度 (variability) 有時也稱為分散度 (spread)。我們主要是希望能透過這樣的統計值來知道：1) 資料集中的資料與其 typical value 一般來說離得多遠？以及 2) 這份資料的極端值的分佈狀況。能用以表達這件事的最佳統計值當然非變異數 (variance) 與標準差 (standard deviation) 莫屬了，但事實上還是有其他的統計值是能夠協助我們瞭解資料變動度的，例如 range、average absolute deviation (AAD；也稱為 absolute mean deviation；MD)、median absolute deviation (MAD) 與 interquartile range (IQR) 等等。

不過，雖然我們總是習慣以標準差來標示數據的變動程度，但其能夠表達變動度的性質大多是基於母體分佈為常態的假設而來的。所以有些人建議說，如果我們懷疑母體分佈嚴重異於常態時，則我們可以用相對單純的 AAD 來評估資料的變動度會更加可靠一些 [2]。

做法：給予資料的變異數與標準差 (若資料分佈與常態差異太大，可加上 AAD)，搭配資料集的 histogram 呈現。

3. 這份資料的分佈狀況最類似於何種機率分佈？(What is a good distributional fit for a set of numbers?)

要知道機率分佈最好的方法就是透過資料的 histogram 來判讀；當資料的圖形出來後，大概會有三種可能：

1. 分佈左右對稱：資料分佈可能屬於常態分佈、logistic distribution 或 Cauchy distribution。前兩者非常的類似，但 Cauchy distribution 會在圖形的左右兩端出現非常非常極端的資料。

2. 分佈往右偏斜：資料右斜時，可能符合的分佈相當的多，像是：log-normal distribution、log-logistic distribution 與 exponential distribution (還有一些其他的右偏斜分佈像是 Weibull distribution、Pareto distribution 等，不過大概都能算是前幾個的特例)。右偏斜的分佈種類雖然多，不過差異都相當明顯，稍加判讀即可找出形狀最符合的分佈。

3. 分佈往左偏斜：左偏斜的分佈就比較少了，要是真的出現，則可判斷其是否為 square-normal distribution。在透過圖形讓我們對資料的分佈狀況有了一些 candidate 之後，即可套用 Goodness-of-Fit Test 來做進一步的確認。

做法：給予資料集分佈的 histogram，搭配可能的 candidate distributions 及其對應的 Goodness-of-Fit Test 結果。

4. 某實驗或操作是否發生了顯著的影響？(Does an engineering modification have an effect?)

過去在學術界如果要確認一項實驗是否具有影響必然使用的工具就是各項的統計檢測，像是 t-test/F-test；然後當我們計算出的 \(p\)-value 小於某個 significant level 時就宣稱我們的實驗效果顯著。不過，隨著時代進步，許多實驗的進行成本降低，一些比較容易進行的實驗中我們可以很容易的取得上萬甚至上百萬個樣本 (例如網站的 A/B testing)。但傳統的統計檢測方法用在這樣的大樣本資料上會出現某些問題，例如因為樣本數太多而造成效果總是顯著的 the \(p\)-value problem。所以以圖表的方式來確認一項試驗或操作是否造成了顯著影響會是比較的好方法。

做法：給予需進行比對的資料的 bihistogram 或用以解決 the \(p\)-value problem 的其他圖形並進行解讀；若資料量不大時則可使用 t-test/F-test。

5. 資料中的某項因子是否對資料具有顯著的影響？(Does a factor have an effect?)

做法：給予並解釋 block plot 或 bihistogram；若資料量不大時可加以使用 t-test/ANOVA 進行確認。

6. 影響資料的各項因子的重要性分別為何？(What are the most important factors?)

做法：給予並解釋 Design of Experiments (DOE) plot。

7. 能夠描述因變數與自變數之間關係的最佳函式為何？(What is the best function for relating a response variable to a set of factor variables?)

要對一組自變數與因變數進行回歸分析，我們會依循以下步驟：

1. 首先使用 linear model 作回歸分析，觀察其 \(R^2\) 與 \(adj. R^2\)（但請千萬不要僅以 \(R^2\) 與 \(adj. R^2\) 來評估 model，那有時會出現非常誤導的結果），以及其 residual plot。如果 residual plot 顯示各 residual 成隨機分佈，則代表回歸效果不錯；倘若我們看到 residual plot 上的資料點成某類型的 pattern 出現，則代表我們的 model 還沒有掌握到某部分的變數關係，此時需使用項次更高的多項式來做 regression。

2. 倘若使用一次多項式的 linear model 無法完全描述資料行為，可逐漸增加自變數的次數，然後依照 1. 的方式評估效果。不過在以多項式 linear model 進行迴歸分析時，自變數的次數不會太大，通常二次或三次即可相當準確的回歸出資料關係。

3. 如果 linear model 無法達到理想的效果時，可開始使用 nonlinear regression 的技巧。不過對於 nonlinear model 的效果我們只能透過 residual plot 與回歸標準誤差 (standard error of regression；有時也稱為 S 值) 來評估，因為我們並沒有辦法用「一般的方式」為 nonlinear model 計算 \(R^2\) 與 \(adj. R^2\)。

做法：Linear regression 優先，無法達到理想效果時才使用 nonlinear regression，並配合適當的工具評估選擇 model。

8. 資料中是否存在 outliers？(Does the data have outliers?)

對於outlier 的偵測，首先我們可以透過各式的圖表工具，像是一般的 histogram、box plot 以及 normal probability plot 來以肉眼的方式判斷資料中究竟有無 outlier。或許有人聽過許多 outliers 偵測的方法，例如 Grubb’s test 之類的，不過那些方法的限制太多，準確度也一直很有爭議，所以基本上以這些方式作為判斷 outlier 的依據。

事實上，在處理 outliers 之前，我們首先要對 outliers 有正確的認識，那就是我們眼中看出來的 outlier，究竟是不是 outliers？有的時候這些資料可能是資料輸入錯誤，那我們就該把它刪除；也有可能是這些 outliers 的生成程序本來就不屬於我們想分析的母體，那我們也該從一開始蒐集資料時過濾掉這些資料；但這些方法的最終目的，其實是要判斷是否有某些資料是來自不同的來源的。也就是說，要從根本清除 outliers 的話，其實最好的方法就是從蒐集資料的一開始就好好做資料的篩選與確認，而不是在要進行分析了才開始剔除 outliers。

做法：使用 histogram、box plot 以及 normal probability plot 等圖表法以肉眼確認是否出現 outliers。倘若確定有 outlier，則須回頭檢視資料蒐集流程，並移除這些雜訊資料的來源 (不宜對蒐集來的資料貿然刪除「看似」outliers 的資料)。

小結

嚴謹的資料分析必須連同一般檢測方法以及大量視覺化圖形一起進行，僅依賴任何一項都非常可能導致錯誤的結果。上述的各項數值與工具在各大統計軟體中都有提供，只要熟悉軟體的操作方式，上面幾個問題應該都能夠很快的回答出來。

References:

[1] http://www.itl.nist.gov/div898/handbook/eda/section3/eda32.htm
[2] Revisiting a 90-year-old debate: the advantages of the mean deviation
[3] The power of outliers (and why researchers should always check for them)

2015年5月25日星期一

最小平方法的時間複雜度 (Computational Complexity of Least Square Regression)

假設我們目前有 \(n\) 個 training samples，而我們會使用這些 samples 中的 \(p\) 個 features 來進行 OLS (Ordinary Least Squares)。則：

計算 \(X^T\) 乘以 \(X\) 的時間複雜度為 \(O(p^2 n)\)
計算 \(X^T\) 乘以 \(Y\) 的時間複雜度為 \(O(p n)\)
計算 \(X^TX\) 的 LU factorization 並使用其結果來計算 \((X^TX)^{-1}(X^TY)\) 的時間複雜度為 \(O(p^3)\)

我們現在可以來分析整體計算流程的時間複雜度了。\(O(p^2 n)\) 與 \(O(p n)\) 相較，由於\(O(p^2 n)\) dominates \(O(p n)\)，所以可不理會 \(O(p n)\)。

再者，這邊討論的是 normal equation，我們因此可以假設 \(n>p\) (否則 \(X^T X\) 會是 singular 的，並因此不可逆，找不到 \((X^TX)^{-1}\))，這也代表 \(O(p^2 n)\) dominates \(O(p^3)\)。

因此，整個運算過程的時間複雜度為 \(O(p^2 n)\)。

相較於使用 normal equation 來計算 OLS，我們也可以使用梯度下降法，由於梯度下降法僅會進行固定個迭代的矩陣相乘運算，其複雜度僅有 \(O(p n)\)，這也因此成為一般 samples 數量很多時所會採用的解法。

Reference:
http://math.stackexchange.com/questions/84495/computational-complexity-of-least-square-regression-operation

2013年1月31日星期四

[Python] BeautifulSoup 中Tag 物件的 string 與 text 屬性的差別是什麼？

　BeautifulSoup 是一款相當著名的 Python HTML/XML parser。簡單的使用方式讓幾乎任何人都能很輕易的上手並使用這個 library。唯一的問題是，BeautifulSoup 的文件實在太少，而且似乎沒有一份完整的 API 說明，造成在使用某些功能時必須三不五時的按 F3 trace library 原始碼，實在有點擾人。在這邊要說明的就是其中一個困擾我蠻久的一個問題：Tag物件中string與text屬性的差別到底是什麼？

　這兩個屬性間的差別是：Tag.string代表的是該 HTML/XML 標籤內含的文字，其資料型態為bs4.NavigableString。例如：

例一：
HTML：
<td width="20%"> Something </td>

Python code：
td = soup.find('td')
print table.string # Something
print table.text # Something

　Tag.text則是該標籤中，所有子標籤內含的文字。若其中有標籤不具備文字內容，則text會將其視為空字串一併丟出。所以，Tag.text總是有值，而不會出現為None的情況：

例二：

soup = BeautifulSoup('<img src="/img/test.jpg"/>')

print soup.string # None
print soup.text # (空字串)

例三：
HTML：
<table>
<tr>
<td width="20%"> ID </td>
<td width="30%"> 姓名 </td>
<tr/>
<tr>
<td width="20%"> 001 </td>
<td width="30%"> Shih-Peng Lin </td>
<tr/>
</table>

Python code：
table = soup.find('table')
print table.string # None
print table.text
# ID
# 姓名
# 001
# Shih-Peng Lin

　換句話說就是：如果一個Tag有string，那他就必定有text；但如果Tag有text屬性，則不一定會有string屬性。

2013年1月2日星期三

[Python] Python中的多執行緒

　在Python中如果要使用執行緒的話，其函式庫中提供了兩種方法：一種是以函式的方式使用，另一種則是將工作包裝成類別的方式。

1. 使用thread模組中的start_new_thread()來建立新的執行緒：

import time  

import thread  
 
def timer(no,interval):  #自己写的线程函数   

    while True:   

        print 'Thread :(%d) Time:%s'%(no,time.ctime())   

    time.sleep(interval)   
 
def test(): #使用thread.start_new_thread()来产生2个新的线程   

    thread.start_new_thread(timer,(1,1))     

    thread.start_new_thread(timer,(2,3))   
 
if __name__=='__main__':   

    test()  

　这个是thread.start_new_thread(function,args[,kwargs])函数原型，其中function参数是你将要调用的线程函数；args是讲传递给你的线程函数的参数，他必须是个tuple类型；而kwargs是可选的参数。

　线程的结束一般依靠线程函数的自然结束；也可以在线程函数中调用thread.exit()，他抛出SystemExit exception，达到退出线程的目的。

2、將工作包裝成threading.Thread的子類別來使用多執行緒：

import threading  

import time  

class timer(threading.Thread):     #我的timer类继承自threading.Thread类   

    def __init__(self,no,interval):    

        #在我重写__init__方法的时候要记得调用基类的__init__方法   

        threading.Thread.__init__(self)        

        self.no=no   

        self.interval=interval   

    def run(self):  #重写run()方法，把自己的线程函数的代码放到这里   

        while True:   

            print 'Thread Object (%d), Time:%s'%(self.no,time.ctime())   

        time.sleep(self.interval)   

def test():   

    threadone=timer(1,1)    #产生2个线程对象   

    threadtwo=timer(2,3)   

    threadone.start()   #通过调用线程对象的.start()方法来激活线程   

    threadtwo.start()   

if __name__=='__main__':   

    test()

[Python] 了解、args、與 kwargs

When i started learning Python, i was very confused regarding what args, kwargs, * and ** does. And i feel there are few like me who had this confusion and problem. With this post, i intend to reduce (hopefully i can eliminate) that confusion.

Throughout this post, i will be using ipython and i suggest you to try everything on ipython as well. We will intentionally make some mistakes along the way, so that we can understand this topic better.

Let's divide our work under five sections:

Understanding what '*' does from inside a function call.
Understanding what '*args' mean inside a function definition.
Understanding what '**' does from inside a function call.
Understanding what '**kwargs' mean inside a function definition.
A practical example of where we use 'args', 'kwargs' and why we use it.

Understanding what * does from inside a function call.

Let's define a function "fun" which takes three positional arguments.

In [5]: def fun(a, b, c):
   ...:     print a, b, c
   ...:     
   ...:

Call this function passing three positional arguments

In [7]: fun(1,2,3)
1 2 3                     #Output

So, calling this function by passing three positional arguments, prints the three arguments passed to the function.

Let's make a list with three integer values in it.

In [8]: l = [1,2,3]

Let's use '*' now.

In [9]: fun(*l)
1 2 3                     #Output

What did '*' do?

It unpacked the values in list 'l' as positional arguments. And then the unpacked values were passed to function 'fun' as positional arguments.

So, unpacking the values in list and changing it to positional arguments meant writing fun(*l) was equivalent to writing fun(1,2,3). Keep in mind that l=[1,2,3]. Let's try with some other value of 'l'.

In [10]: l=[5,7,9]
In [11]: fun(*l)
5 7 9                     #Output

Let's make some errors now. Let's put four values in "l".

In [12]: l=[3,5,6,9]

Now, try to call function "fun".

In [13]: fun(*l)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/akshar/branding_git/netconference/<ipython console> in <module>()
TypeError: fun() takes exactly 3 arguments (4 given)

So, in our last statement which is 'fun(*l)' we did not get a proper output and a TypeError was raised. See the error, it says "TypeError: fun() takes exactly 3 arguments (4 given)".

Why did it happen?

list 'l' contains four values. So, when we tried 'fun(*l)', 'l' was unpacked so that its value could be sent as positional arguments. But, "l" has four values in it. So, writing 'fun(*l)' was equivalent to writing 'fun(3,5,6,9)'. But, 'fun' is defined to take only three positional arguments and hence we got this error. Similary, you can follow same steps with two values in list 'l' and notice the error.

In [14]: l=[5,6]
In [15]: fun(*l)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/akshar/branding_git/netconference/<ipython console> in <module>()
TypeError: fun() takes exactly 3 arguments (2 given)

Let's mix '*l' with a positional argument.

In [16]: fun(1, *l)
1 5 6                  #Output.

Here we gave one positional argument which is 1 and two values i.e 5 and 6 were unpacked from "l" and hence 1,5 and 6 were passed to 'fun'.

Hope you were able to follow what '*' does when used inside a function call.

Understanding what '*args' mean inside a function definition.

Let's change the function definition now.

In [18]: def fun(*args):
   ....:     print args
   ....:     
   ....:

Call this function with one positional argument.

In [19]: fun(1)
(1,)                  #Output

Now call this function with two positional arguments or any number of positional arguments you wish.

In [20]: fun(1,2,3)
(1, 2, 3)

What does '*args' in a function definition do?

It recieves a tuple containing the positional arguments beyond the formal parameter list. So, "args" is a tuple. Don't worry about the part "formal parameter list" in our explanation, it will be clear with next few examples. In our last example when we printed "args", it printed a tuple which contained all the values we passed while calling the function.

Let's mix "*args" with some "formal parameter list". Note that in our last example we didn't have any formal parameter list. Let's redefine our function.

In [21]: def fun(a, *args):
   ....:     print "a is ", a
   ....:     print "args is ", args
   ....:     
   ....:

In this function definition, parameter "a" constitue the "formal parameter list". Let's call "fun" with four positional argument.

In [22]: fun(1, 2, 3, 4)
a is  1                                #Output
args is  (2, 3, 4)

So, we can see that 'a' is assigned 1 which was the first positional argument. There is only one parameter "*args" defined after "a". So, "args" received a tuple containing the positional arguments beyond the formal parameter list. So, args received 2, 3 and 4 as a tuple.

We can also call "fun" with just one positional argument. Let's do that.

In [23]: fun(1)
a is  1                                #Output
args is  ()                            #args received an empty tuple

Here, we passed only one argument to the function which was assigned to the formal parameter 'a'. So, 'args' received an empty tuple as can be seen from the output.

After we have "args", we can extract the values and do whatever we want. Let's redefine "fun".

In [24]: def fun(a, *args):
   ....:     print a
   ....:     print "args can receive a tuple of any number of arguments. Let's print all that."
   ....:     for arg in args:
   ....:         print arg
   ....:         
   ....:

We can call "fun" with any number of arguments.

In [25]: fun(1,5,6,7)
1                                                                         #Output
args can receive a tuple of any number of arguments. Let's print all that.
5
6
7

Since 'args' is a tuple, we could iterate over it.

Now, let's consider a case where we use whatever we saw till here. Here we need to use two functions. First function has to take an arbitrary number of arguments and it has to calculate the sum of all the arguments except the first argument. Also, this function has to make use of another function to calculate the sum. Weird use case, but we just need to recap whatever we did till here. The objective here is to see how we get a variable number of arguments in a function and pass these arguments to another function.

Let's first write the function which has to calculate the sum i.e the second function. For our use case, this function will be used in the first function which we are yet to write.

In [26]: def calculate_sum(*args):
   ....:     return sum(args)
   ....:

Here we make use of 'sum'. Function 'sum' is an inbuilt function which takes a tuple or a list and return the sum of all the elements in the tuple. From our function definition we can see that 'args' will receive a tuple containing all the positional arguments passed to this function. So, 'args' will be a tuple and can be directly used as an argument to function 'sum'. Let's write the other function which takes any number of arguments and uses previous function to calulate the sum of all arguments except the first argument.

In [29]: def ignore_firstargs_calculate_sum(a, *iargs):
   ....:     required_sum = calculate_sum(*iargs)
   ....:     print "sum is", required_sum
   ....:     
   ....:

We can pass any number of arguments to this function. First argument will be recieved by 'a' which is a formal parameter. All other arguments will be recieved by 'iargs' as a tuple. As per the case we are considering, we want to calculate the sum of all arguments except the first. So, we leave 'a' as it receives the first argument. 'iargs' is the tuple containing all arguments except the first. We will make use of function 'calculate_sum'. But 'calculate_sum' expects number of positional arguments to be sent to it which it will receive in 'args' as a tuple. So, in function 'ignore_firstargs_calculate_sum' we need to unpack 'iargs', as it is a tuple, and then send the unpacked positional arguments to 'calculate_sum'. Remember, we used '*' to unpack a list/tuple.

So, we write 'required_sum=calculate_sum(*iargs)'.

We can't write 'required_sum=calculate_sum(iargs)' because we need to unpack the values in the tuple 'iargs' before sending to 'calculate_sum'. Not using '*' will not unpack the values and hence we won't have the desired behaviour. Let's use the function we wrote.

In [34]: ignore_firstargs_calculate_sum(3,1,2,6)
sum is 9                          #Output

The output is the sum of all arguments except the first argument.

Understanding what '**' does when used from inside a function.

Let's consider a simple example first. Let's define a function which takes three arguments.

In [35]: def fun(a, b, c):
   ....:     print a, b, c
   ....:     
   ....:

Let's call this function in various ways.

In [36]: fun(1,2,3)
1 2 3                            #Output
In [37]: fun(1, b=4, c=6)
1 4 6                            #Output

Let's use "**" from inside the function call. For this we want a dictionary. Remember, while using "*" in the function call, we required a list/tuple. For using "**" in the function call, we require a dictionary.

In [38]: d={'b':5, 'c':7}

Let's call "fun" using "**" in the function call.

In [39]: fun(1, **d)
1 5 7

What "**" did while being used in a function call?

It unpacked the dictionary used with it, and passed the items in the dictionary as keyword arguments to the function. So writing "fun(1, **d)" was equivalent to writing "fun(1, b=5, c=7)".

Let's try some more examples to understand it better.

In [40]: d={'c':3}
In [42]: fun(1,2,**d)           #This is equivalent to fun(1,2,c=3)
1 2 3
In [43]: d={'a':7,'b':8,'c':9}
In [44]: fun(**d)
7 8 9                           #Output

Let's make some errors now.

In [45]: d={'a':1, 'b':2, 'c':3, 'd':4}
In [46]: fun(**d)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/akshar/branding_git/netconference/<ipython console> in <module>()
TypeError: fun() got an unexpected keyword argument 'd'

Last statement was equivalent to fun(a=1, b=2, c=3, d=4). But, "fun" expected only three arguments and hence we got this error.

In [47]: d={'a':1, 'b':5, 'd':9}
In [48]: fun(**d)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/akshar/branding_git/netconference/<ipython console> in <module>()
TypeError: fun() got an unexpected keyword argument 'd'

Last statement was equivalent to fun(a=1, b=5, d=9). Although it passed three arguments which is the number of arguments expected by "fun", but "fun" does not have 'd' in its parameter list. But, 'd' was passed as a keyword argument And hence we got this error.

So, "**" unpacks the dictionary i.e the key values pairs in the dictionary as keyword arguments and these are sent as keyword arguments to the function being called. "*" unpacks a list/tuple i.e the values in the list as positional arguments and these are sent as positional arguments to the function being called.

Understanding "**kwargs" in a function definition.

Let's redefine our function "fun".

In [49]: def fun(a, **kwargs):
   ....:     print a, kwargs
   ....:     
   ....:

So, this function can only take one positional argument since formal parameter list contains only one variable 'a'. But with "**kwargs", it can take any number of keyword arguments. Let's see some examples.

In [50]: fun(1, b=4, c=5)
1 {'c': 5, 'b': 4}                #Output
In [51]: fun(2, b=6, c=7, d=8)
2 {'c': 7, 'b': 6, 'd': 8}        #Output

What does "**kwargs" mean when used in a function definition?

With "**kwargs" in the function definition, kwargs receives a dictionary containing all the keyword arguments beyond the formal parameter list. Remember 'kwargs' will be a dictionary. In our previous two examples, when we printed kwargs, it printed a dictionary containing all the keyword arguments beyond the formal parameter list.

Let's again redefine our function.

In [54]: def fun(a, **kwargs):
   ....:     print "a is", a
   ....:     print "We expect kwargs 'b' and 'c' in this function"
   ....:     print "b is", kwargs['b']
   ....:     print "c is", kwargs['c']
   ....:     
   ....:

Let's call "fun" now.

In [55]: fun(1, b=3, c=5)
a is 1                  We expect kwargs 'b' and 'c' in this function
b is 3
c is 5

Let's make some errors now.

In [56]: fun(1, b=3, d=5)
a is 1
We expect kwargs 'b' and 'c' in this function
b is 3
c is---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/home/akshar/branding_git/netconference/<ipython console> in <module>()
/home/akshar/branding_git/netconference/<ipython console> in fun(a, **kwargs)
KeyError: 'c'

We were able to call the function. First positional argument was printed. Keyword argument 'b' was printed. But the other keyword argument we passed was 'd'. Since function expected a keyword argument 'c' and tried to access it from the dictionary "kwargs". But since we did not pass any keyword argument 'c', we got this error. If we add a keyword argument 'c' in the function call, we won't get the error anymore.

In [57]: fun(1, b=3, d=5, c=7)
a is 1
We expect kwargs 'b' and 'c' in this function
b is 3
c is 7

Since having "**kwargs" in the function argument list, we can pass any number of keyword arguments. We passed 'd' but did not make any use of it in the function.

Let's make one more error.

In [58]: fun(1, {'b':2, 'c':3})
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/akshar/branding_git/netconference/<ipython console> in <module>()
TypeError: fun() takes exactly 1 argument (2 given)

As the error says, Function "fun" expected only one postional argument but was given two. So, although 'kwargs' receives the keyword arguments as a dictionary, you cannot pass a dictionary as a positional argument to 'kwargs'. Although you could have done somehing like:

In [59]: fun(1, **{'b':2, 'c':3})
a is 1
We expect kwargs 'b' and 'c' in this function
b is 2
c is 3

Using "**" in front of the dictionary unpacks the dictionary and passes the items in dictionary as keyword arguments.

A practical example of where we use '*args', '**kwargs' and why we use it.

Whenever we inherit a class and override some of the methods of inherited class, we should use '*args' and '**kwargs' and pass the received positional and keyword arguments to the superclass method. Can be better understood with an example.

In [4]: class Model(object):
   ...:     def __init__(self, name):
   ...:         self.name = name
   ...:     def save(self, force_update=False, force_insert=False):
   ...:         if force_update and force_insert:
   ...:             raise ValueError("Cannot perform both operations")
   ...:         if force_update:
   ...:             #Update an existing record
   ...:             print "Updated an existing record"
   ...:         if force_insert:
   ...:             #Create a new record
   ...:             print "Created a new record"
   ...:

We defined a class. We can create objects of this class and objects of this class have a method "save()". Assume that the objects of this class can be saved in a database which is being done inside the save() method. Depending on the arguments we pass to save() method, it is determined whether new records need to be created in the database or an existing record need to be updated.

We want a new class where we want 'Model' behaviour but we only want to have save the objects of this class after we have checked some conditions. So let's subclass 'Model' and override 'save()' of 'Model'.

In [6]: class ChildModel(Model):
   ...:     def save(self, *args, **kwargs):
   ...:         if self.name=='abcd':
   ...:             super(ChildModel, self).save(*args, **kwargs)
   ...:         else:
   ...:             return None
   ...:

Actual saving of object(as per our assumption, connecting with database and creating/updating) happens in the "save" method of "Model". So we need to call the "save()" method of superclass from save() method of ChildModel. Also, save() method of subclass i.e ChildModel should be able to accept any arguments that save() of superclass accepts and must pass through these arguments to the superclass save(). So we have "*args" and "**kwargs" in the argument list of subclass save() method to receive any positional arguments or keyword arguments beyond the formal parameter list.

Let's create an instance of ChildModel and save it.

In [7]: c=ChildModel('abcd')

So, we created an instance of ChildModel with name='abcd'.

In [9]: c.save(force_insert=True)
Created a new record       #Output

Here, we passed a keyword argument to save() of the object. The save() we called is the subclass save(). It received a dictionary containing the keyword argument in "kwargs". Then it used "**" to unpack this dictionary as keyword arguments and then passed it to the superclass save(). So, superclass save() got a keyword argument 'force_insert' and acted accordingly.

Let's try to pass another keyword argument.

In [10]: c.save(force_update=True)
Updated an existing record      #Output

That was all about it. Hope you liked the post.

訂閱：文章 (Atom)

2015年11月16日 星期一