python modules

[TOC]
Python features a dynamic type system and automatic memory management and supports multiple programming paradigms.

module

Python 正则表达式 re 模块

asyncio

ref1: Python Async/Await入门指南

Note: asynchronous generators and asynchronous comprehensions is added until python3.6

Python中常见的四种函数形式:

  1. 普通函数
  2. 生成器函数
    在3.5过后,我们可以使用async修饰将普通函数和生成器函数包装成异步函数和异步生成器。
  3. 异步函数(协程)
  4. 异步生成器

coroutine

  1. 直接调用异步函数不会返回结果,而是返回一个coroutine对象:

    1
    2
    print(async_function())
  2. 协程需要通过其他方式来驱动,因此可以使用这个协程对象的send方法给协程发送一个值:
    print(async_function().send(None))
    不幸的是,如果通过上面的调用会抛出一个异常:
    StopIteration: 1
    因为生成器/协程在正常返回退出时会抛出一个StopIteration异常,而原来的返回值会存放在StopIteration对象的value属性中,通过以下捕获可以获取协程真正的返回值

    1
    2
    3
    4
    5
    def (coroutine):
    try:
    coroutine.send(None)
    except StopIteration as e:
    return e.value
  3. await
    await: suspends the execution of the coroutine until the awaitable it takes completes and returns the result.

await语法只能出现在通过async修饰的函数中,否则会报SyntaxError错误。

原理:
await后面的对象需要是一个Awaitable,或者实现了相关的协议。
查看Awaitable抽象类的代码,表明了只要一个类实现了__await__方法,那么通过它构造出来的实例就是一个Awaitable:
Coroutine类继承了Awaitable,而且实现了send,throw和close方法。所以await一个调用异步函数返回的协程对象是合法的。
For more details see ref1.

event

  • run_until_complete

    1
    2
    3
    loop = asyncio.get_event_loop()
    ...
    loop.run_until_complete() # it's blocking (阻塞)
  • Event loop is closed
    Jupyter的一个project就是一个进程,而一个进程中默认只有一个event。例如:

    1
    2
    3
    4
    5
    6
    7
    >>> import asyncio
    >>> asyncio.get_event_loop().close()
    >>> asyncio.get_event_loop().is_closed()
    True
    >>> asyncio.get_event_loop().run_until_complete(asyncio.sleep(1))
    .....
    RuntimeError: Event loop is closed

event loop在asyncio中有复杂的机制,引用Flask作者著名的文章I don’t understand Python’s Asyncio

On the surface it looks like each thread has one event loop but that’s not really how it works.

  • if you are the main thread an event loop is created when you call asyncio.get_event_loop()
  • if you are any other thread, a runtime error is raised from asyncio.get_event_loop()
  • You can at any point asyncio.set_event_loop() to bind an event loop with the current thread. Such an event loop can be created with the asyncio.new_event_loop() function.
  • Event loops can be used without being bound to the current thread.
  • asyncio.get_event_loop() returns the thread bound event loop, it does not return the currently running event loop.

更详细的讨论请参见原文,总之这里有许多问题。

如果想避免之前的示例问题,参考在一段Python程序中使用多次事件循环一文,可以如下操作:

我们可以使用asyncio.new_event_loop函数建立一个新的事件循环,并使用asyncio.set_event_loop设置全局的事件循环,这时候就可以多次运行异步的事件循环了,不过最好保存默认的asyncio.get_event_loop并在事件循环结束的时候还原回去。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import asyncio
async def doAsync():
await asyncio.sleep(0)
#...
def runEventLoop()
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(doAsync())
loop.close()
if __name__ == "__main__":
oldloop = asyncio.get_event_loop()
runEventLoop()
runEventLoop()
asyncio.set_event_loop(oldloop)

gather

ref: Getting values from functions that run as asyncio tasks

1
2
3
4
5
loop = asyncio.get_event_loop()
tasks = func_normal(), func_infinite()
a, b = loop.run_until_complete(asyncio.gather(*tasks))
print("func_normal()={a}, func_infinite()={b}".format(**vars())) # 这段输出的代买没看懂
loop.close()

另外,run_until_complete也会返回函数结果:

1
2
3
4
done, _ = loop.run_until_complete(asyncio.wait(tasks))
for fut in done:
print("return value is {}".format(fut.result()))
loop.close()

Tips

  • asyncio.run() is added in python3.7
  • asyncio.sleep(1)
    异步 I/O 里面的 sleep() 方法, 它也是一个协程, 异步 I/O 里面不能使用 time.sleep(), time.sleep() 会阻塞整个线程
  • RuntimeError: This event loop is already running problem
    This problem happens because:

    The kernel itself runs on an event loop, and as of Tornado 5.0, it’s using the asyncio event loop. So the asyncio event loop is always running in the kernel.

Can’t invoke asyncio event_loop after tornado 5.0 update
Solvation:
pip3 install tornado==4.5.3 and restart notebook

aiohttp

ref: 异步爬虫: async/await 与 aiohttp的使用,以及例子

  • get

    1
    2
    async with aiohttp.get('https://github.com') as r:
    await r.text()
  • timeout

    1
    2
    3
    with aiohttp.Timeout(0.001):
    async with aiohttp.get('https://github.com') as r:
    await r.text()
  • session

    session可以进行多项操作,比如post, get, put, head等等

1
2
3
4
5
6
7
8
9
10
async def getPage(url,res_list):
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
### example of proxy and cookie
# conn = aiohttp.ProxyConnector(proxy="http://127.0.0.1:8087")
# session.get(url,headers=headers, connector=conn)
# async with ClientSession({'cookies_are': 'working'}) as session:
async with aiohttp.ClientSession() as session:
async with session.get(url,headers=headers) as resp:
assert resp.status==200
res_list.append(await resp.text())

atexit

ref: 深入理解python中的atexit模块 非常好的简介,如果需要进一步了解,可以参考。

atexit 模块介绍
python atexit 模块定义了一个 register 函数,用于在 python 解释器中注册一个退出函数,这个函数在解释器正常终止时自动执行,一般用来做一些资源清理的操作。 atexit 按注册的相反顺序执行这些函数; 例如注册A、B、C,在解释器终止时按顺序C,B,A运行。
Note:如果程序是非正常crash,或者通过os._exit()退出,注册的退出函数将不会被调用。

heapq (Heap queue algorithm)

  • nlargest/nsmallest
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    nums = [1, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2]
    heapq.nlargest(3, nums)
    # or more complicated:
    portfolio = [
    {'name': 'IBM', 'shares': 100, 'price': 91.1},
    {'name': 'AAPL', 'shares': 50, 'price': 543.22},
    {'name': 'FB', 'shares': 200, 'price': 21.09},
    {'name': 'HPQ', 'shares': 35, 'price': 31.75},
    {'name': 'YHOO', 'shares': 45, 'price': 16.35},
    {'name': 'ACME', 'shares': 75, 'price': 115.65}
    ]
    cheap = heapq.nsmallest(3, portfolio, key=lambda s: s['price'])

re module

compile

[Python Regex Flags] (http://xahlee.info/python/python_regex_flags.html)

  1. compile(r’[0-9]+’)
    Usage:

    1
    2
    3
    import re
    pattern = re.compile(r'[0-9]+')
    frist_number = pattern.match(your_str).group(0)
  2. another example

    1
    2
    3
    4
    import re
    mys = 'abe(ac)ad)'
    p1 = re.compile(r'[(](.*?)[)]', re.S)
    match_list = re.findall(p1, mys) # findall return a list of matched string

other func

  1. replace(str1, str2)

  2. split()

    string 对象的 split() 方法只适应于非常简单的字符串分割情形, 它并不允许有多个分隔符或者是分隔符周围不确定的空格。

    1
    2
    3
    >>> line = 'asdf fjdk; afed, fjek,asdf, foo'
    >>> re.split(r'[;,s]s*', line)
    ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

    当你使用 re.split() 函数时候,需要特别注意的是正则表达式中是否包含一个括号捕获分组。 如果使用了捕获分组,那么被匹配的文本也将出现在结果列表中。

    1
    2
    3
    >>> fields = re.split(r'(;|,|s)s*', line)
    >>> fields
    ['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

sys module

  • sys.version/version_info(version_info is a object)

os module

1. about path

os.path模块

  • about path

    1
    2
    3
    4
    5
    os.getcwd()
    os.chdir('../')
    os.listdir()
    os.chdir('../')
    os.path.exist #determine if a file or dir exists
  • os.remove
    remove a file; if file does not exists, an Error will be throwed out.

  • os.rmdir
    remove a dir

  • os.path.splitext(path)
    Split the pathname path into a pair (root, ext)

  • os.path.basename(your_path)

2. os.popen

  1. shell cmd is executed in background and you can’t change it

  2. inplemented by subprocess.Popen, so why not you use subprocess?

python doc

time & calendar

Python 日期和时间 runoob

  • ‘datetime’ object & ‘timedelta’ object

    1
    2
    3
    4
    >>> datetime.strptime("2018-11-28T19:57:01.522Z",FMT)
    datetime.datetime(2018, 11, 28, 19, 57, 1, 522000)
    >>> datetime.strptime("2018-11-28T19:56:55.557Z",FMT) - datetime.strptime("2018-11-28T19:57:01.522Z",FMT)
    datetime.timedelta(-1, 86394, 35000)
  • time.time()
    Return the current time in seconds since the Epoch.

  • mktime(tupletime)

  • localtime()
    Convert seconds since the Epoch to a time tuple expressing local time.
    When ‘seconds’ is not passed in, convert the current time instead.

  • strftime()
    strftime(format[, tuple]) # -> string

  • strptime()
    strptime(string, format) # -> struct_time
    Parse a string to a time tuple according to a format specification.

  • time.sleep(secs)

  • calendar.isleap(year)

  • calendar.weekday(year,month,day)

subprocess module

python doc

1. For cmd that needn’t stdout

subprocess.run("cp standard_py/*py .", shell=True, check=True)

  • shell=True
    you can use a string instead of a series of args!
  • check=True
    throw an Error if shell cmd exit wrong!

2. For cmd needing stdout

An example:

1
2
ret = subprocess.run("ls standard_py/*py", shell=True, check=True, universal_newlines=True, stdout=PIPE)
print(ret.stdout, end="")
  • universal_newlines=True
    stdout
    Captured stdout from the child process. A bytes sequence, or a string if run() was called with universal_newlines=True. None if stdout was not captured.

  • stdout=PIPE
    without this argument, stdout will be printed as stdout of python script, instead of captured, as from python doc:

    This(ct: means run) does not capture stdout or stderr by default. To do so, pass PIPE for the stdout and/or stderr arguments.

glob module

The glob module finds all the pathnames matching a specified pattern

  • glob.glob(pathname, *, recursive=False)
  • glob.iglob(pathname, recursive=False)
    Return an iterator which yields the same values as glob()

collections module

廖雪峰

  • Counter

    1
    2
    >>> Counter([1,2,2,2,2,3,3,3,4,4,4,4])
    Counter({1: 5, 2: 3, 3: 2})
  • deque (double-ended queue, pronounced as ‘Deck’)
    doc: deque([iterable[, maxlen]])
    method: pop(popleft) / append / extend; clear / copy (It’s shadow copy ) / insert / remove; count / index ; reverse / rotate

  • python doc: OrderedDict
    This dict will record the insert order! And this is the only difference between OrderedDict and dict
    NOTE: OrderedDict may looks like a list of tuple, but it’s definitely not!

    1
    2
    3
    import collections
    dic = collections.OrderedDict()
    dic['k1'] = 'v1'

Standard Usage:

1
2
3
4
# dictionary sorted by key
OrderedDict(sorted(d.items(), key=lambda t: t[0]))
# dictionary sorted by value
od = OrderedDict(sorted(mydict.items(), key=lambda t: t[1], reverse=True))

PIL

  • img = Image.open(‘origin.png’) #支持多种格式
    注意:类似于.htm和.html,.jpg和.jpeg没有区别,只是两种写法
  • font
    font = ImageFont.truetype("FreeMono.ttf", 28, encoding="unic")
  • img2.save('./test_image_data/cat_001_blur.jpeg','jpeg') #save(‘path’,‘format’)
  • resize()、rotate()、convert(mode=‘your_mode’)
  • Coordinates (0, 0) in the upper left corner.
  • draw = ImageDraw.Draw(img)
    1
    2
    3
    #img is from:
    #img = Image.open('./test_image_data/cat_001.jpg')
    draw.text((width - add_width, 0), number, font=font, fill=fillcolor) # first parameter is the start point of the draw

random

doc

  • random.seed(a=None, version=2)

functions for integers

  • random.randrange(start, stop[, step])
  • random.randint(a, b)

    return a random integer N such that a <= N <= b. Alias for randrange(a, b+1).

real-valued distributions

  • random.random()

    Return the next random floating point number in the range [0.0, 1.0).

  • random.uniform(a, b)
  • random.gauss(mu, sigma)

shutil

The shutil module offers a number of high-level operations on files and collections of files. In particular, functions are provided which support file copying and removal.

shutil.copyfile(src, dst)

logging[1] [2]

  • A sample example:[2]
    1
    2
    3
    import logging
    logging.warning('Watch out!') # will print a message to the console
    logging.info('I told you so') # will not print anything

result is : WARNING:root:Watch out!

  • Used in program:
1
2
3
4
5
6
7
8
9
10
11
import logging
logger = logging.getLogger()
handler = logging.StreamHandler()
formatter = logging.Formatter(
'%(asctime)s %(name)-12s %(levelname)-8s %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.DEBUG)
logger.debug('often makes a very good meal of %s', 'visiting tourists')
  • logging.warn() vs. logging.warning()
    logging.warning() just logs something at the WARNING level, while logging.warn will raise an exception.
    ref: Python warnings.warn() vs. logging.warning()

  • filter
    一般都是用继承logging.Filter类的方法,再写filter method。
    官方说3.2版本之后,用个函数就可以作为filter,但是没有示例程序。

functools

wraps

ref: What does functools.wraps do?
普通decorator,会导致被修改的函数的属性产生变化(比如func_1,加decorator后,名字会被修改)。

1
2
3
4
5
6
7
from functools import wraps
def logged(func):
def with_logging(*args, **kwargs):
print func.__name__ + " was called"
return func(*args, **kwargs)
return with_logging

简而言之:decorator都需要functools.wraps

pickle

pickle.load是从file-like object中导入数据;对应的,pickle.dump也是针对file-like object。[3]

  • TIPS
    一个存有bs4.element.Tag类型的对象,并不一定能pickle。
    因为pickle要求数据可以遍历,而bs4.element.Tag类型的对象可能有环?或者单纯是我的数据中,bs4.element.Tag类型的数量太多了?

BeautifulSoup4

  • beautifulsoup4
    • 文档
    • find_all(text="your_text")不好用!
    • ‘html5lib’ engine is slower but more robust than ‘lxml’!
    • soup.prettify() and <tag>.prettify()
      Formatting html

四大对象种类[4]

  • Tag: 通俗点讲就是 HTML 中的一个个标签
    如何抓取超链接?
    因为超链接不是显示存储的。要想得到超链接的字符串,需要下面的操作:
    1
    2
    if isinstance(i, bs4.element.Tag) and i.has_attr('href'):
    con_str.append(i['href'] )
  • NavigableString: print(type(soup.p.string)), that’s it.
  • BeautifulSoup: 一个文档的全部内容.大部分时候可以把它当作是一个特殊的 Tag
  • Comment: 注释

Selenium

  • initialize

    1
    2
    chromePath = r'/usr/local/bin/chromedriver'
    wd = webdriver.Chrome(executable_path= chromePath)
  • usage

    1
    2
    3
    wd.find_element_by_id('login_pwd')
    wd.find_element_by_class_name('radiocheck').click()
    wd.find_element_by_xpath('//*[@id="sendpck"]/img')

NOTE:
1. you can search an element and then click()
2. you can get xpath in source code page.

  • login website
    Please refer to login fucntion in download_voice.ipynb

TIPS

  • 断网
    有时候断网后,selenium就一直无法重新连接,尽管你在浏览器中手动访问网站是可以的。
    这时可以手动刷新一下selenium要访问的页面,问题可能就解决了。
    我猜,这可能是因为selenium保留了上次访问的失败状态,这导致重新链接到网络后,selenium仍然无法正常工作。

reference


  1. THe hitchhiker’s guide to python

  2. Official Logging How to

  3. about pickle

  4. Beautiful Soup 的用法