2013年1月31日 星期四

[Python] BeautifulSoup 中Tag 物件的 string 與 text 屬性的差別是什麼?

 BeautifulSoup 是一款相當著名的 Python HTML/XML parser。簡單的使用方式讓幾乎任何人都能很輕易的上手並使用這個 library。唯一的問題是,BeautifulSoup 的文件實在太少,而且似乎沒有一份完整的 API 說明,造成在使用某些功能時必須三不五時的按 F3 trace library 原始碼,實在有點擾人。在這邊要說明的就是其中一個困擾我蠻久的一個問題:Tag物件中stringtext屬性的差別到底是什麼?

 這兩個屬性間的差別是:Tag.string代表的是該 HTML/XML 標籤內含的文字,其資料型態為bs4.NavigableString。例如:

例一:
HTML:
<td width="20%"> Something </td>

Python code:
td = soup.find('td')
print table.string # Something
print table.text # Something

 Tag.text則是該標籤中,所有子標籤內含的文字。若其中有標籤不具備文字內容,則text會將其視為空字串一併丟出。所以,Tag.text總是有值,而不會出現為None的情況

例二:

soup = BeautifulSoup('<img src="/img/test.jpg"/>')

print soup.string # None
print soup.text # (空字串)

例三 :
HTML:
<table>
  <tr>
    <td width="20%"> ID </td> 
    <td width="30%"> 姓名 </td>
  <tr/>
  <tr>
    <td width="20%"> 001 </td> 
    <td width="30%"> Shih-Peng Lin </td>
  <tr/>
</table>

Python code:
table = soup.find('table')
print table.string # None
print table.text  
# ID 
# 姓名 
# 001 
# Shih-Peng Lin

 換句話說就是:如果一個Tagstring,那他就必定有text;但如果Tagtext屬性,則不一定會有string屬性。

2013年1月2日 星期三

[Python] Python中的多執行緒


 在Python中如果要使用執行緒的話,其函式庫中提供了兩種方法:一種是以函式的方式使用,另一種則是將工作包裝成類別的方式。
1. 使用thread模組中的start_new_thread()來建立新的執行緒:
  1. import time  
  2. import thread  
  3.  
  4. def timer(no,interval):  #自己写的线程函数   
  5.     while True:   
  6.         print 'Thread :(%d) Time:%s'%(no,time.ctime())   
  7.     time.sleep(interval)   
  8.  
  9. def test(): #使用thread.start_new_thread()来产生2个新的线程   
  10.     thread.start_new_thread(timer,(1,1))     
  11.     thread.start_new_thread(timer,(2,3))   
  12.  
  13. if __name__=='__main__':   
  14.     test()  

 这个是thread.start_new_thread(function,args[,kwargs])函数原型,其中function参数是你将要调用的线程函数;args是讲传递给你的线程函数的参数,他必须是个tuple类型;而kwargs是可选的参数。
 线程的结束一般依靠线程函数的自然结束;也可以在线程函数中调用thread.exit(),他抛出SystemExit exception,达到退出线程的目的。
2、將工作包裝成threading.Thread的子類別來使用多執行緒:
  1. import threading  
  2. import time  
  3. class timer(threading.Thread):     #我的timer类继承自threading.Thread类   
  4.     def __init__(self,no,interval):    
  5.         #在我重写__init__方法的时候要记得调用基类的__init__方法   
  6.         threading.Thread.__init__(self)        
  7.         self.no=no   
  8.         self.interval=interval   
  9.            
  10.     def run(self):  #重写run()方法,把自己的线程函数的代码放到这里   
  11.         while True:   
  12.             print 'Thread Object (%d), Time:%s'%(self.no,time.ctime())   
  13.         time.sleep(self.interval)   
  14.                
  15. def test():   
  16.     threadone=timer(1,1)    #产生2个线程对象   
  17.     threadtwo=timer(2,3)   
  18.     threadone.start()   #通过调用线程对象的.start()方法来激活线程   
  19.     threadtwo.start()   
  20.        
  21. if __name__=='__main__':   
  22.     test()  

[Python] 了解 *、*args、** 與 **kwargs


When i started learning Python, i was very confused regarding what args, kwargs, * and ** does. And i feel there are few like me who had this confusion and problem. With this post, i intend to reduce (hopefully i can eliminate) that confusion.
Throughout this post, i will be using ipython and i suggest you to try everything on ipython as well. We will intentionally make some mistakes along the way, so that we can understand this topic better.

Let's divide our work under five sections:

  1. Understanding what '*' does from inside a function call.
  2. Understanding what '*args' mean inside a function definition.
  3. Understanding what '**' does from inside a function call.
  4. Understanding what '**kwargs' mean inside a function definition.
  5. A practical example of where we use 'args', 'kwargs' and why we use it.

Understanding what * does from inside a function call.

Let's define a function "fun" which takes three positional arguments.
In [5]: def fun(a, b, c):
   ...:     print a, b, c
   ...:     
   ...:
Call this function passing three positional arguments
In [7]: fun(1,2,3)
1 2 3                     #Output
So, calling this function by passing three positional arguments, prints the three arguments passed to the function.
Let's make a list with three integer values in it.
In [8]: l = [1,2,3]
Let's use '*' now.
In [9]: fun(*l)
1 2 3                     #Output

What did '*' do?

It unpacked the values in list 'l' as positional arguments. And then the unpacked values were passed to function 'fun' as positional arguments.
So, unpacking the values in list and changing it to positional arguments meant writing fun(*l) was equivalent to writing fun(1,2,3). Keep in mind that l=[1,2,3]. Let's try with some other value of 'l'.
In [10]: l=[5,7,9]
In [11]: fun(*l)
5 7 9                     #Output
Let's make some errors now. Let's put four values in "l".
In [12]: l=[3,5,6,9]
Now, try to call function "fun".
In [13]: fun(*l)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/akshar/branding_git/netconference/<ipython console> in <module>()
TypeError: fun() takes exactly 3 arguments (4 given)
So, in our last statement which is 'fun(*l)' we did not get a proper output and a TypeError was raised. See the error, it says "TypeError: fun() takes exactly 3 arguments (4 given)".

Why did it happen?

list 'l' contains four values. So, when we tried 'fun(*l)', 'l' was unpacked so that its value could be sent as positional arguments. But, "l" has four values in it. So, writing 'fun(*l)' was equivalent to writing 'fun(3,5,6,9)'. But, 'fun' is defined to take only three positional arguments and hence we got this error. Similary, you can follow same steps with two values in list 'l' and notice the error.
In [14]: l=[5,6]
In [15]: fun(*l)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/akshar/branding_git/netconference/<ipython console> in <module>()
TypeError: fun() takes exactly 3 arguments (2 given)
Let's mix '*l' with a positional argument.
In [16]: fun(1, *l)
1 5 6                  #Output.
Here we gave one positional argument which is 1 and two values i.e 5 and 6 were unpacked from "l" and hence 1,5 and 6 were passed to 'fun'.
Hope you were able to follow what '*' does when used inside a function call.

Understanding what '*args' mean inside a function definition.

Let's change the function definition now.
In [18]: def fun(*args):
   ....:     print args
   ....:     
   ....:
Call this function with one positional argument.
In [19]: fun(1)
(1,)                  #Output
Now call this function with two positional arguments or any number of positional arguments you wish.
In [20]: fun(1,2,3)
(1, 2, 3)

What does '*args' in a function definition do?

It recieves a tuple containing the positional arguments beyond the formal parameter list. So, "args" is a tuple. Don't worry about the part "formal parameter list" in our explanation, it will be clear with next few examples. In our last example when we printed "args", it printed a tuple which contained all the values we passed while calling the function.
Let's mix "*args" with some "formal parameter list". Note that in our last example we didn't have any formal parameter list. Let's redefine our function.
In [21]: def fun(a, *args):
   ....:     print "a is ", a
   ....:     print "args is ", args
   ....:     
   ....:
In this function definition, parameter "a" constitue the "formal parameter list". Let's call "fun" with four positional argument.
In [22]: fun(1, 2, 3, 4)
a is  1                                #Output
args is  (2, 3, 4)
So, we can see that 'a' is assigned 1 which was the first positional argument. There is only one parameter "*args" defined after "a". So, "args" received a tuple containing the positional arguments beyond the formal parameter list. So, args received 2, 3 and 4 as a tuple.
We can also call "fun" with just one positional argument. Let's do that.
In [23]: fun(1)
a is  1                                #Output
args is  ()                            #args received an empty tuple
Here, we passed only one argument to the function which was assigned to the formal parameter 'a'. So, 'args' received an empty tuple as can be seen from the output.
After we have "args", we can extract the values and do whatever we want. Let's redefine "fun".
In [24]: def fun(a, *args):
   ....:     print a
   ....:     print "args can receive a tuple of any number of arguments. Let's print all that."
   ....:     for arg in args:
   ....:         print arg
   ....:         
   ....:
We can call "fun" with any number of arguments.
In [25]: fun(1,5,6,7)
1                                                                         #Output
args can receive a tuple of any number of arguments. Let's print all that.
5
6
7
Since 'args' is a tuple, we could iterate over it.
Now, let's consider a case where we use whatever we saw till here. Here we need to use two functions. First function has to take an arbitrary number of arguments and it has to calculate the sum of all the arguments except the first argument. Also, this function has to make use of another function to calculate the sum. Weird use case, but we just need to recap whatever we did till here. The objective here is to see how we get a variable number of arguments in a function and pass these arguments to another function.
Let's first write the function which has to calculate the sum i.e the second function. For our use case, this function will be used in the first function which we are yet to write.
In [26]: def calculate_sum(*args):
   ....:     return sum(args)
   ....:
Here we make use of 'sum'. Function 'sum' is an inbuilt function which takes a tuple or a list and return the sum of all the elements in the tuple. From our function definition we can see that 'args' will receive a tuple containing all the positional arguments passed to this function. So, 'args' will be a tuple and can be directly used as an argument to function 'sum'. Let's write the other function which takes any number of arguments and uses previous function to calulate the sum of all arguments except the first argument.
In [29]: def ignore_firstargs_calculate_sum(a, *iargs):
   ....:     required_sum = calculate_sum(*iargs)
   ....:     print "sum is", required_sum
   ....:     
   ....:
We can pass any number of arguments to this function. First argument will be recieved by 'a' which is a formal parameter. All other arguments will be recieved by 'iargs' as a tuple. As per the case we are considering, we want to calculate the sum of all arguments except the first. So, we leave 'a' as it receives the first argument. 'iargs' is the tuple containing all arguments except the first. We will make use of function 'calculate_sum'. But 'calculate_sum' expects number of positional arguments to be sent to it which it will receive in 'args' as a tuple. So, in function 'ignore_firstargs_calculate_sum' we need to unpack 'iargs', as it is a tuple, and then send the unpacked positional arguments to 'calculate_sum'. Remember, we used '*' to unpack a list/tuple.
So, we write 'required_sum=calculate_sum(*iargs)'.
We can't write 'required_sum=calculate_sum(iargs)' because we need to unpack the values in the tuple 'iargs' before sending to 'calculate_sum'. Not using '*' will not unpack the values and hence we won't have the desired behaviour. Let's use the function we wrote.
In [34]: ignore_firstargs_calculate_sum(3,1,2,6)
sum is 9                          #Output
The output is the sum of all arguments except the first argument.

Understanding what '**' does when used from inside a function.

Let's consider a simple example first. Let's define a function which takes three arguments.
In [35]: def fun(a, b, c):
   ....:     print a, b, c
   ....:     
   ....:
Let's call this function in various ways.
In [36]: fun(1,2,3)
1 2 3                            #Output
In [37]: fun(1, b=4, c=6)
1 4 6                            #Output
Let's use "**" from inside the function call. For this we want a dictionary. Remember, while using "*" in the function call, we required a list/tuple. For using "**" in the function call, we require a dictionary.
In [38]: d={'b':5, 'c':7}
Let's call "fun" using "**" in the function call.
In [39]: fun(1, **d)
1 5 7

What "**" did while being used in a function call?

It unpacked the dictionary used with it, and passed the items in the dictionary as keyword arguments to the function. So writing "fun(1, **d)" was equivalent to writing "fun(1, b=5, c=7)".
Let's try some more examples to understand it better.
In [40]: d={'c':3}
In [42]: fun(1,2,**d)           #This is equivalent to fun(1,2,c=3)
1 2 3
In [43]: d={'a':7,'b':8,'c':9}
In [44]: fun(**d)
7 8 9                           #Output
Let's make some errors now.
In [45]: d={'a':1, 'b':2, 'c':3, 'd':4}
In [46]: fun(**d)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/akshar/branding_git/netconference/<ipython console> in <module>()
TypeError: fun() got an unexpected keyword argument 'd'
Last statement was equivalent to fun(a=1, b=2, c=3, d=4). But, "fun" expected only three arguments and hence we got this error.
In [47]: d={'a':1, 'b':5, 'd':9}
In [48]: fun(**d)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/akshar/branding_git/netconference/<ipython console> in <module>()
TypeError: fun() got an unexpected keyword argument 'd'
Last statement was equivalent to fun(a=1, b=5, d=9). Although it passed three arguments which is the number of arguments expected by "fun", but "fun" does not have 'd' in its parameter list. But, 'd' was passed as a keyword argument And hence we got this error.
So, "**" unpacks the dictionary i.e the key values pairs in the dictionary as keyword arguments and these are sent as keyword arguments to the function being called. "*" unpacks a list/tuple i.e the values in the list as positional arguments and these are sent as positional arguments to the function being called.

Understanding "**kwargs" in a function definition.

Let's redefine our function "fun".
In [49]: def fun(a, **kwargs):
   ....:     print a, kwargs
   ....:     
   ....:
So, this function can only take one positional argument since formal parameter list contains only one variable 'a'. But with "**kwargs", it can take any number of keyword arguments. Let's see some examples.
In [50]: fun(1, b=4, c=5)
1 {'c': 5, 'b': 4}                #Output
In [51]: fun(2, b=6, c=7, d=8)
2 {'c': 7, 'b': 6, 'd': 8}        #Output

What does "**kwargs" mean when used in a function definition?

With "**kwargs" in the function definition, kwargs receives a dictionary containing all the keyword arguments beyond the formal parameter list. Remember 'kwargs' will be a dictionary. In our previous two examples, when we printed kwargs, it printed a dictionary containing all the keyword arguments beyond the formal parameter list.
Let's again redefine our function.
In [54]: def fun(a, **kwargs):
   ....:     print "a is", a
   ....:     print "We expect kwargs 'b' and 'c' in this function"
   ....:     print "b is", kwargs['b']
   ....:     print "c is", kwargs['c']
   ....:     
   ....:
Let's call "fun" now.
In [55]: fun(1, b=3, c=5)
a is 1                  We expect kwargs 'b' and 'c' in this function
b is 3
c is 5
Let's make some errors now.
In [56]: fun(1, b=3, d=5)
a is 1
We expect kwargs 'b' and 'c' in this function
b is 3
c is---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/home/akshar/branding_git/netconference/<ipython console> in <module>()
/home/akshar/branding_git/netconference/<ipython console> in fun(a, **kwargs)
KeyError: 'c'
We were able to call the function. First positional argument was printed. Keyword argument 'b' was printed. But the other keyword argument we passed was 'd'. Since function expected a keyword argument 'c' and tried to access it from the dictionary "kwargs". But since we did not pass any keyword argument 'c', we got this error. If we add a keyword argument 'c' in the function call, we won't get the error anymore.
In [57]: fun(1, b=3, d=5, c=7)
a is 1
We expect kwargs 'b' and 'c' in this function
b is 3
c is 7
Since having "**kwargs" in the function argument list, we can pass any number of keyword arguments. We passed 'd' but did not make any use of it in the function.
Let's make one more error.
In [58]: fun(1, {'b':2, 'c':3})
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/akshar/branding_git/netconference/<ipython console> in <module>()
TypeError: fun() takes exactly 1 argument (2 given)
As the error says, Function "fun" expected only one postional argument but was given two. So, although 'kwargs' receives the keyword arguments as a dictionary, you cannot pass a dictionary as a positional argument to 'kwargs'. Although you could have done somehing like:
In [59]: fun(1, **{'b':2, 'c':3})
a is 1
We expect kwargs 'b' and 'c' in this function
b is 2
c is 3
Using "**" in front of the dictionary unpacks the dictionary and passes the items in dictionary as keyword arguments.

A practical example of where we use 'args', '*kwargs' and why we use it.

Whenever we inherit a class and override some of the methods of inherited class, we should use '*args' and '**kwargs' and pass the received positional and keyword arguments to the superclass method. Can be better understood with an example.
In [4]: class Model(object):
   ...:     def __init__(self, name):
   ...:         self.name = name
   ...:     def save(self, force_update=False, force_insert=False):
   ...:         if force_update and force_insert:
   ...:             raise ValueError("Cannot perform both operations")
   ...:         if force_update:
   ...:             #Update an existing record
   ...:             print "Updated an existing record"
   ...:         if force_insert:
   ...:             #Create a new record
   ...:             print "Created a new record"
   ...:
We defined a class. We can create objects of this class and objects of this class have a method "save()". Assume that the objects of this class can be saved in a database which is being done inside the save() method. Depending on the arguments we pass to save() method, it is determined whether new records need to be created in the database or an existing record need to be updated.
We want a new class where we want 'Model' behaviour but we only want to have save the objects of this class after we have checked some conditions. So let's subclass 'Model' and override 'save()' of 'Model'.
In [6]: class ChildModel(Model):
   ...:     def save(self, *args, **kwargs):
   ...:         if self.name=='abcd':
   ...:             super(ChildModel, self).save(*args, **kwargs)
   ...:         else:
   ...:             return None
   ...:
Actual saving of object(as per our assumption, connecting with database and creating/updating) happens in the "save" method of "Model". So we need to call the "save()" method of superclass from save() method of ChildModel. Also, save() method of subclass i.e ChildModel should be able to accept any arguments that save() of superclass accepts and must pass through these arguments to the superclass save(). So we have "*args" and "**kwargs" in the argument list of subclass save() method to receive any positional arguments or keyword arguments beyond the formal parameter list.
Let's create an instance of ChildModel and save it.
In [7]: c=ChildModel('abcd')
So, we created an instance of ChildModel with name='abcd'.
In [9]: c.save(force_insert=True)
Created a new record       #Output
Here, we passed a keyword argument to save() of the object. The save() we called is the subclass save(). It received a dictionary containing the keyword argument in "kwargs". Then it used "**" to unpack this dictionary as keyword arguments and then passed it to the superclass save(). So, superclass save() got a keyword argument 'force_insert' and acted accordingly.
Let's try to pass another keyword argument.
In [10]: c.save(force_update=True)
Updated an existing record      #Output
That was all about it. Hope you liked the post.