1 / 15

Embedded Systems Programming

Embedded Systems Programming. Writing Optimised C code for ARM. Why write optimised C code?. For embedded system size and/or speed are of key importance The compiler optimisation phase can only do so much

yvette-hyde
Download Presentation

Embedded Systems Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Embedded Systems Programming Writing Optimised C code for ARM

  2. Why write optimised C code? • For embedded system size and/or speed are of key importance • The compiler optimisation phase can only do so much • In order to write optimal C code you need to know details of the underlying hardware and the compiler

  3. What compilers can’t do • void memclr( char * data, int N) • { • for (; N > 0; N--) • { • *data=0; • data++; • } • } • Is N == on first loop? • 0 – 1 is dangerous! • Is data array 4 byte aligned? • Can store using int • Is N a multiple of 4? • Could do 4 word blocks at a time • Compilers have to be conservative!

  4. An example Program The program might seem fine – even resource friendly Using a char saves space for loops make good assembler Lets look at the assembler code • /* program showing inefficient • * variable and loop • * usage craig Nov 04 • */ • int checksum_1(int *data) • { • char i; int sum = 0; • for (i =0; i < 64; i++) • sum += data[i]; • return sum; • }

  5. .text .align 2 .global checksum_1 .type checksum_1,functionchecksum_1: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 1, current_function_anonymous_args = 0 mov ip, sp stmfd sp!, {fp, ip, lr, pc} sub fp, ip, #4 mov r1, r0 mov r0, #0 @ sum = 0 mov r2, r0 @ i = 0.L6: ldr r3, [r1, r2, asl #2] @ data[i] add r0, r0, r3 @ sum = data[i] add r3, r2, #1 @ i ++ and r2, r3, #255 cmp r2, #63 @ i < 64 bls .L6 ldmea fp, {fp, sp, pc}.Lfe1: .size checksum_1,.Lfe1-checksum_1

  6. What is wrong? • The use of char means that the compiler has to cast to look at 8 bits – using • and r2, r3, #255 • The loop variable requires a register and initialisation • If the loop is called often then the tests and branch is quite an overhead

  7. Variable sizes • In general the compiler will use 32bit registers for local variables but will have to cast them when used as 8 or 16 bit values • If you can, use unsigned ints, if you can’t explicitly cast • Using signed shorts can be quite a problem for compilers

  8. Watch your shorts! short add( short a, short b) { return a + (b >> 1); } • The above C code turns into the rather nasty assembler • The gnu C compiler is very cautious when confronted with short variables Becomes …. mov ip, sp stmfd sp!, {fp, ip, lr, pc} sub fp, ip, #4 mov r1, r1, asl #16 mov r0, r0, asl #16 mov r0, r0, asr #16 add r0, r0, r1, asr #17 mov r0, r0, asl #16 mov r0, r0, asr #16 ldmea fp, {fp, sp, pc

  9. Loops #1 • As well as using a char for a loop counter the loop counter could be redundant • Terminate loops by counting down to 0 the reduces register usage and means no initialisation • Use do..while instead of for loops

  10. Efficient loop C */ * Program to show efficient use of * variables and loops */ int checksum_2(int *data) { int sum = 0, i = 64; do { sum += *(data++); } while ( --i != 0 ); return sum; }

  11. Efficient loop assembler checksum_2: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 1, current_function_anonymous_args = 0 mov ip, sp stmfd sp!, {fp, ip, lr, pc} sub fp, ip, #4 mov r1, r0 mov r0, #0 @ sum = 0 mov r2, #64 @ i = 64.L6: ldr r3, [r1], #4 @ *(data++) add r0, r0, r3 @ sum = *(data++) subs r2, r2, #1 @ --i bne .L6 ldmea fp, {fp, sp, pc}

  12. Loop unrolling • If a loop is going to be repeated often then the test and branch can be quite an overhead • If the loop is a multiple of 4 and is done quite a lot then the loop can be unrolled • This increases code a size but is more speed efficient • Sizes that are not multiples of 4 can be done but are less efficient.

  13. An unrolled loop * Program to show efficient use of * variables and loops & loop unrolling */ int checksum_2(int *data) { int sum = 0, i = 64; do { sum += *(data++); sum += *(data++); sum += *(data++); sum += *(data++); i -= 4; } while ( i != 0 ); return sum; }

  14. checksum_2: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 1, current_function_anonymous_args = 0 mov ip, sp stmfd sp!, {fp, ip, lr, pc} sub fp, ip, #4 mov r2, r0 mov r0, #0 mov r1, #64.L6: ldr r3, [r2], #4 add r0, r0, r3 ldr r3, [r2], #4 add r0, r0, r3 ldr r3, [r2], #4 add r0, r0, r3 ldr r3, [r2], #4 add r0, r0, r3 subs r1, r1, #4 bne .L6 ldmea fp, {fp, sp, pc}

  15. Loop unrolling ! = 4 /* Program to show use of * loop unrolling */ int checksum_2(int *data, unsigned int N) { int sum = 0; unsigned int i; for ( i = N/4; i != 0; i--) { sum += *(data++); sum += *(data++); sum += *(data++); sum += *(data++); } for ( i = N&3; i != 0; i--) sum += *(data++); return sum; }

More Related