[O'REILLY-데이터 과학] 스터디 스따뜨 & 2장 기본기 다지기

데이터 과학 입문을 위해 구입한 책.

기술적인 내용은 차치하고 일단 저자가 개그감각이 있고 글이 굉장히 이해하기 쉽게 쓰여져 있다.

이것만으로도 일단 이 책은 데이터 과학 입문자에게 너무 좋은 책이라는 생각이든다. (재미있게 공부할 수 있는게 해주는 선생님이 최고 아닌가?)

단, 이 책은 코드를 기본적으로 안다는 가정하에 내용이 작성되었다고 생각한다.

기존에 프로그래밍 언어에 대해 아는것이 전혀 없다면 쉽게 읽히지 않을 수도 있다.

책에서는 파이썬 2.7 을 사용하기를 권장한다. (현재 파이썬 3.8이 최신이다)

이유는 안정되어 있으며, 많은 중요한 라이브러리들이 2.7에서만 동작한다고 한다.

그러니 나도 3.8을 삭제하고 (뭐든 최신이 좋은데...) 2.7로 다시 설치해본다.

python 2.7을 새로 설치하고 파이썬 교리를 확인해보자.

In [1]: import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

이제 겨우 16페이지지만 책이 점점 더 마음에 들기 시작했다.

파이썬 2.7에서는 정수 나눗셈(integer division)이 기본으로 division 모듈을 import 해서 사용해야 한다.

In [2]: val = 5/2

In [3]: val
Out[3]: 2

In [4]: from __future__ import division

In [5]: val = 5/2

In [6]: val
Out[6]: 2.5

파이썬은 일급 함수의 특성을 가지는데, Javascript를 해봤기 때문에 개념이 어렵지 않다. (Everything is connected ?)

파이썬의 list 사용 예

In [1]: x = range(10)

In [2]: x
Out[2]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [3]: x[-1]
Out[3]: 9

In [4]: x[-2]
Out[4]: 8

In [5]: x[:3]
Out[5]: [0, 1, 2]

In [6]: x[1:3]
Out[6]: [1, 2]

In [7]: x[6:]
Out[7]: [6, 7, 8, 9]

In [8]: x[-3:]
Out[8]: [7, 8, 9]

In [9]: x[1:-1]
Out[9]: [1, 2, 3, 4, 5, 6, 7, 8]

In [10]: x[:]
Out[10]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

-- 참/거짓

이 방법은 list 안의 항목을 모두 확인하기 때문에 속도가 중요하다면 테스트 해보고 사용!

In [11]: 1 in [1,2,3]
Out[11]: True

-- list 연결 (extend)

In [12]: x = [1,2,3]

In [13]: x.extend([4,5,6])

In [14]: x
Out[14]: [1, 2, 3, 4, 5, 6]

-- list에 항목 추가 (append)

In [15]: x.append(7)

In [16]: x
Out[16]: [1, 2, 3, 4, 5, 6, 7]

-- list 풀기 (unpack)

list의 항목 개수(3개)가 변수 개수(x,y - 2개) 보다 많아서 에러 발생

In [17]: x,y = [1,2]

In [18]: x
Out[18]: 1

In [19]: y
Out[19]: 2

In [20]: x,y = [1,2,3]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-c393a66ccb78> in <module>()
----> 1 x,y = [1,2,3]

ValueError: too many values to unpack

버릴 항목은 _ 로 표기

In [23]: _,_,z = [1,2,3]

In [24]: z
Out[24]: 3

tuple

tuple은 변경할 수 없는 list

In [25]: tup1 = (3,4)

In [26]: tup2 = 5,6

In [27]: tup1[0]
Out[27]: 3

In [28]: tup2[0]
Out[28]: 5

tuple과 list의 다중 할당(multiple assignment) 기능

엄청나다.. temp 변수를 안써도 된다!!!

In [29]: x,y = 1,2

In [30]: x,y = y,x

In [31]: x
Out[31]: 2

In [32]: y
Out[32]: 1

dict

dict를 만드는 방법

In [33]: empty_dict={}

In [34]: grades = { 'gury': 90, 'doyoung': 15 }

In [35]: grades['gury']
Out[35]: 90

in 기능의 속도는 list에서보다 dict에 사용하는 것이 훨씬 빠르다.

점점 정리하기가 귀찮다. 어려운 내용은 아니지 책을 보자.

defaultdict

defaultdict와 dict의 유일한 차이점은,

만약 존재하지 않는 key가 주어지는 경우, defaultdict는 새로운 항목으로 추가 해준다는 것이다.

(dict의 경우는 존재하지 않는 key로 dict를 access하면 Exception 발생하며 get으로 접근하면 None이라는 값이 반환됨)

In [1]: from collections import defaultdict

In [2]: word_counts = defaultdict(int)   # int 는 value 값을 기본으로 0을 가져감

In [3]: word_counts
Out[3]: defaultdict(int, {})

In [4]: word = ['a', 'b', 'b', 'c', 'c', 'd']

In [5]: for w in word:
   ...:     word_counts[w] += 1
   ...:

In [6]: word_counts
Out[6]: defaultdict(int, {'a': 1, 'b': 2, 'c': 2, 'd': 1})

defaultdict를 생성할 때, 인자로 넘어간 값은 value 기본 값이 된다.

In [1]: from collections import defaultdict

In [2]: dd_list = defaultdict(list)

In [3]: dd_list['list_1'].append(1)

In [4]: dd_list['list_1'].append(3)

In [5]: dd_list
Out[5]: defaultdict(list, {'list_1': [1, 3]})

In [6]: print(dd_list)
defaultdict(<type 'list'>, {'list_1': [1, 3]})

Counter

In [7]: from collections import Counter

In [8]: c = Counter([0,1,2,3,3,0])

In [9]: c
Out[9]: Counter({0: 2, 1: 1, 2: 1, 3: 2})

set

항목의 집합을 나타내는 구조, 집합이므로 중복을 처리해준다. (제거해줌)

집합에서 in 은 속도가 굉장히 빠르다.

수많은 항목 중에서 특정 항목의 존재 여부를 확인해야 하는 경우, list 보다는 set을 사용하라

흐름 제어

if문 한줄로 표현하기

In [1]: x = 10

In [2]: parity = "even" if x % 2 == 0 else "odd"

In [3]: parity
Out[3]: 'even'

True & False

파이썬은 Null이 아닌 'None' 으로 표현

In [8]: x
Out[8]: 9

In [9]: print x is None
False

In [10]: x = None

In [11]: print x is None
True

- 거짓인 것들

False

None

[] (빈 list)

() (빈 dict)

set()

0 또는 0.0

Generator & Iterator

순차적인 값을 사용해야 하는 경우, list에 해당 값을 다 저장해놓고 사용하는 것은 메모리에 비효율적이므로 yield를 이용해 Generator를 만들어 사용하는것이 좋다.

괄호를 이용해서 Generator를 만들 수"도" 있다.

Generator의 단점은 하나는 단 한번만 반복할 수 있다는 것이라는데.. Why? (조사가 필요합니다..)

In [14]: def lazy_range(n):
    ...:     i=0
    ...:     while i < n:
    ...:         yield i
    ...:         i += 1
    ...:

In [15]: lazy_evens_below_20 = (i for i in lazy_range(20) if i%2 == 0 )  # Generator 생성

In [16]: lazy_evens_below_20
Out[16]: <generator object <genexpr> at 0x04850E68>

enumerate

list를 반복하면서 항목, 인덱스를 모두 사용해야 하는 경우

In [7]: for i, val in enumerate(arr):
   ...:     print(i, val)
   ...:
(0, 'A')
(1, 'B')
(2, 'C')

엉금엉금

[O'REILLY-데이터 과학] 스터디 스따뜨 & 2장 기본기 다지기

dict

defaultdict

Counter

set

흐름 제어

True & False

Generator & Iterator

enumerate

댓글

티스토리툴바

[O'REILLY-데이터 과학] 스터디 스따뜨 & 2장 기본기 다지기

dict

defaultdict

Counter

set

흐름 제어

True & False

Generator & Iterator

enumerate

관련글

댓글

티스토리툴바