Artificial Neural Networks

Artificial Neural Networks 자연언어처리연구실 황규백

Artificial neural network(ANN) • General, practical method for learning real-valued, discrete-valued, vector-valued functions from examples • BACPROPAGATION 알고리즘 • Use gradient descent to tune network parameters to best fit a training set of input-output pairs • ANN learning • Training example의 error에 강하다. • Interpreting visual scenes, speech recognition, learning robot control strategy

Biological motivation • 생물학적인 뉴런과의 유사성 • 병렬 계산(parallel computing) • 분산 표현(distributed representation) • 생물학적인 뉴런과의 차이점 • 처리 단위(뉴런)의 출력

ALVINNsystem

신경망 학습에 적합한 문제 • 학습해야 하는 현상이 여러 가지 속성에 의해 표현되는 경우 • 출력 결과는 문제에 적당한 종류의 값을 가질 수 있다. • 학습 예제에 에러(noise)가 존재할 가능성 • 긴 학습 시간 • 학습 결과의 신속한 적용 • 학습된 결과를 사람이 이해하는 것이 필요없는 경우

Perceptrons • vector of real-valued input • weights & threshold • learning: choosing values for the weights

Perceptron learning의 hypotheses space • n: input vector의 차수

Perceptron의 표현력 • linearly separable example에 대한 hyperplane decision surface • many boolean functions(XOR 제외) • m-of-n function • disjunctive normal form: 복수의 unit

Perceptron rule • 유한번의 학습 후 올바른 가중치를 찾아내려면 충족되어야 할 사항 • training example이 linearly separable • 충분히 작은 learning rate

Gradient descent &Delta rule • for non-linearly separable • unthresholded • od 는 w에 대한 함수값

Hypethesis space

Gradient descent • gradient: steepest increase in E

Gradient descent(cont’d) • Training example의 linearly separable 여부에 관계없이 하나의 global minimum을 찾는다. • Learning rate가 큰 경우 overstepping의 문제 -> learning rate를 점진적으로 줄이는 방법을 사용하기도 한다.

Stochastic approximation to gradient descent • Gradient descent가 사용되기 위해 • hypothesis space is continuously parameterized • error가 hypothesis parameter에 의해 미분 가능해야 한다. • Gradient descent의 단점 • 시간이 오래 걸린다. • 다수의 local minima가 존재하는 경우

Stochasticapproximation togradient descent(cont’d) • 하나의 training example을 적용해서 E를 구하고 바로 weight를 갱신한다. • 실제의 descent gradient를 추측 • 보다 낮은 learning rate를 사용 • multiple local minima를 피할 가능성이 있다. • Delta rule

Remark • Perceptron rule • thresholded output • 정확한weight • linearly separable • Delta rule • unthresholded output • 점근적으로 에러를 최소화하는 weight • non-linearly separable

Multilayer networks • Nonlinear decision surface

Differential threshold unit • Sigmoid function • nonlinear, differentiable

BACKPROPAGATION알고리즘 • 새로운 error의 정의

BACKPROPAGATION알고리즘(cont’d) • Multiple local minima • Termination • fixed number of iteration • error threshold • error of separate validation set

BACKPROPAGATION알고리즘(cont’d) • Adding momentum • 직전의 loop에서의 weight 갱신이 영향을 미침 • Learning in arbitrary acyclic network • downstream(r)

BACPROPAGATION rule

BACKPROPAGATION rule(cont’d) • Training rule for output unit

BACKPROPAGATION rule(cont’d) • Training rule for hidden unit

Convergence and local minima • Only guarantee local minima • This problem is not severe • Algorithm is highly effective • the more weights, the less local minima problem • weight는 처음에 0에 가까운 값으로 초기화 • 해결책 • momentum, stochastic, 복수의 network

Feedfoward network의 표현력 • Boolean functions • with two layers • disjunctive normal form • 하나의입력에 하나의 hidden unit • Continuous functions(bounded) • with two layers • Arbitrary functions • with three layers • linear combination of small functions

Hypothesis space search • continuous -> distinct보다 유용 • Inductive bias • characterize의 어려움 • 완만한 interpolation

Hidden layer representation • 입력값 들의 특성을 스스로 파악해서 hidden layer에 표현하는 능력이 있다. • 사람이 미리 정해 준 feature만을 사용하는 경우보다 유연하며 미리 알 수 없는 특성을 파악하는데 유용하다.

Generalization, overfitting, stopping criterion • Terminating condition • error threshold는 위험 • Generalization accuracy의 고려 • Weight decay • Validation data • Cross-validation approach • K-fold cross-validation

Face recognition • for non-linearly separable • unthresholded • od 는 w에 대한 함수값

Input image:120*128 ->30*32 • 계산상의 복잡도 감소 • mean value(cf, ALVINN) • 1-of-n output encoding • many weights • 모호성 해소에 도움 • <0.9, 0.1, 0.1, 0.1> • 2 layers, 3 units -> 90% success • learned hidden units

Alternativce error functions • Weight-tuning rule에 새로운 제약조건을 첨가하기 위해 사용 • Penalty term for weight magnitude • reducing the risk of overfitting • Derivative of target function • Minimizing cross-entropy • for probabilistic function • Weight sharing • speech recognition

Alternative error minimization procedures • Line search • direction: same as backpropagation • distance: minimum of the error function in this line • very large or very small • Conjugate gradient • new direction: component of the error gradient remains zero

Recurrent networks

Dynamically modifying network structure • 목적: 일반화의 정확도와 학습 효율의 향상 • 확장(without hidden unit) • CASCADE-CORRELATION • 학습 시간 단축, overfitting 문제 • 축소 • “optimal brain damage” • 학습 시간 단축

Artificial Neural Networks