EzDevInfo.com

arm interview questions

Top arm frequently asked interview questions

What is the difference between FIQ and IRQ interrupt system?

I want to know the difference between FIQ and IRQ interrupt system in any microprocessor, e.g: ARM926EJ.

What's the difference between hard and soft floating point numbers?

When I compile C code with my cross toolchain, the linker prints pages of warnings saying that my executable uses hard floats but my libc uses soft floats. What's the difference?

Source: (StackOverflow)

What is the use of ARM EABI v7a System image in android?

What for do we need ARM EABI v7a System image in Android development? What is the purpose of that particular image?

Source: (StackOverflow)

Test NEON-optimized cv::threshold() on mobile device [closed]

I have been writing some optimizations for the OpenCV's threshold function, for ARM devices (mobile phones). It should be working on both Android and iPhone.

However, I do not have a device to test it on, so I am looking for volunteers to give me a little help. If that motivates you more, I am planning to send it to OpenCV to be integrated into the main repository.

I would be interested in code correctness, and if it happens to work as intended, some statistics for original/optimized performance. Do not forget to look at all scenarios.

So, here is the code. To run it, paste in on opencv/modules/imgproc/src/thresh.cpp, at line 228 (as of 2.4.2) - just below SSE block, and recompile OpenCV.

Also, add this line at the top of the file

#include <arm_neon.h>

Main code body:

#define CV_USE_NEON 1
#if CV_USE_NEON
    //if( checkHardwareSupport(CV_CPU_ARM_NEON) )
    if( true )
    {
        uint8x16_t thresh_u = vdupq_n_u8(thresh);
        uint8x16_t maxval_ = vdupq_n_u8(maxval);

        j_scalar = roi.width & -8;

        for( i = 0; i < roi.height; i++ )
        {
            const uchar* src = (const uchar*)(_src.data + _src.step*i);
            uchar* dst = (uchar*)(_dst.data + _dst.step*i);

            switch( type )
            {
            case THRESH_BINARY:
                for( j = 0; j <= roi.width - 32; j += 32 )
                {
                    uint8x16_t v0, v1;
                    v0 = vld1q_u8 ( src + j );
                    v1 = vld1q_u8 ( src + j + 16 );
                    v0 = vcgtq_u8 ( v0, thresh_u );
                    v1 = vcgtq_u8 ( v1, thresh_u );
                    v0 = vandq_u8 ( v0, maxval_ );
                    v1 = vandq_u8 ( v1, maxval_ );
                    vst1q_u8 ( dst + j, v0 );
                    vst1q_u8 ( dst + j + 16, v1 );
                }


                for( ; j <= roi.width - 8; j += 8 )
                {
                    uint8x8_t v2;
                    v2 = vld1_u8( src + j );
                    v2 = vcgt_u8 ( v2, vget_low_s8 ( thresh_u ) );
                    v2 = vand_u8 ( v2, vget_low_s8 ( maxval_ ) );
                    vst1_u8 ( dst + j, v2 );                    
                }
                break;

            case THRESH_BINARY_INV:         
                for( j = 0; j <= roi.width - 32; j += 32 )
                {
                    uint8x16_t v0, v1;
                    v0 = vld1q_u8 ( src + j );
                    v1 = vld1q_u8 ( src + j + 16 );
                    v0 = vcleq_u8 ( v0, thresh_u );
                    v1 = vcleq_u8 ( v1, thresh_u );
                    v0 = vandq_u8 ( v0, maxval_ );
                    v1 = vandq_u8 ( v1, maxval_ );
                    vst1q_u8 ( dst + j, v0 );
                    vst1q_u8 ( dst + j + 16, v1 );
                }


                for( ; j <= roi.width - 8; j += 8 )
                {
                    uint8x8_t v2;
                    v2 = vld1_u8( src + j );
                    v2 = vcle_u8 ( v2, vget_low_s8 ( thresh_u ) );
                    v2 = vand_u8 ( v2, vget_low_s8 ( maxval_ ) );
                    vst1_u8 ( dst + j, v2 );                    
                }
                break;

            case THRESH_TRUNC:
                for( j = 0; j <= roi.width - 32; j += 32 )
                {
                    uint8x16_t v0, v1;
                    v0 = vld1q_u8 ( src + j );
                    v1 = vld1q_u8 ( src + j + 16 );
                    v0 = vminq_u8 ( v0, thresh_u );
                    v1 = vminq_u8 ( v1, thresh_u );                 
                    vst1q_u8 ( dst + j, v0 );
                    vst1q_u8 ( dst + j + 16, v1 );
                }


                for( ; j <= roi.width - 8; j += 8 )
                {
                    uint8x8_t v2;
                    v2 = vld1_u8( src + j );
                    v2 = vmin_u8  ( v2, vget_low_s8 ( thresh_u ) );                 
                    vst1_u8 ( dst + j, v2 );                    
                }
                break;

            case THRESH_TOZERO:         
                for( j = 0; j <= roi.width - 32; j += 32 )
                {
                    uint8x16_t v0, v1;
                    v0 = vld1q_u8 ( src + j );
                    v1 = vld1q_u8 ( src + j + 16 );             
                    v0 = vandq_u8 ( vcgtq_u8 ( v0, thresh_u ), vmaxq_u8 ( v0, thresh_u ) );
                    v1 = vandq_u8 ( vcgtq_u8 ( v1, thresh_u ), vmaxq_u8 ( v1, thresh_u ) );
                    vst1q_u8 ( dst + j, v0 );
                    vst1q_u8 ( dst + j + 16, v1 );
                }


                for( ; j <= roi.width - 8; j += 8 )
                {
                    uint8x8_t v2;
                    v2 = vld1_u8 ( src + j );                    
                    v2 = vand_u8 ( vcgt_u8 ( v2, vget_low_s8(thresh_u) ), vmax_u8 ( v2, vget_low_s8(thresh_u) ) );
                    vst1_u8 ( dst + j, v2 );                    
                }
                break;

            case THRESH_TOZERO_INV:
                for( j = 0; j <= roi.width - 32; j += 32 )
                {
                    uint8x16_t v0, v1;
                    v0 = vld1q_u8 ( src + j );
                    v1 = vld1q_u8 ( src + j + 16 );             
                    v0 = vandq_u8 ( vcleq_u8 ( v0, thresh_u ), vminq_u8 ( v0, thresh_u ) );
                    v1 = vandq_u8 ( vcleq_u8 ( v1, thresh_u ), vminq_u8 ( v1, thresh_u ) );
                    vst1q_u8 ( dst + j, v0 );
                    vst1q_u8 ( dst + j + 16, v1 );
                }


                for( ; j <= roi.width - 8; j += 8 )
                {
                    uint8x8_t v2;
                    v2 = vld1_u8 ( src + j );                    
                    v2 = vand_u8 ( vcle_u8 ( v2, vget_low_s8(thresh_u) ), vmin_u8 ( v2, vget_low_s8(thresh_u) ) );
                    vst1_u8 ( dst + j, v2 );                    
                }
                break;
            }
        }
    }
#endif

Source: (StackOverflow)

How to affect Delphi XEx code generation for Android/ARM targets?

Embarcadero's Delphi compilers use an LLVM backend to produce native ARM code for Android devices. I have large amounts of Pascal code I need to compile into Android applications, and I would like to know how to make Delphi generate more efficient code. Right now I'm not even talking about advanced features like automatic SIMD optimizations, just about producing reasonable code. Surely there must be a way to pass parameters to the LLVM side, or somehow affect the result? Usually, any compiler will have many options to affect code compilation and optimization, but Delphi's ARM targets seem to be just "optimization on/off" and that's it.

LLVM is supposed to be capable of producing reasonably tight and sensible code, but it seems that Delphi is using its facilities in a weird way. Delphi wants to use the stack very heavily, and it generally only utilizes the processor's registers r0-r3 as temporary variables. Perhaps the craziest of all, it seems to be loading normal 32 bit integers as four 1-byte load operations. How to make Delphi produce better ARM code, and without the byte-by-byte hassle it is making for Android?

Edit: at first I thought the byte-by-byte loading was for swapping byte order from big-endian, but that was not the case, it is really just loading a 32 bit number with 4 byte loads.

Edit2: Notlikethat's comment would give a sensible explanation for that: it might be to load the full 32 bits without doing an unaligned word-sized memory load. (whether it SHOULD avoid that is another thing, which would hint to the whole thing being a compiler bug)

Let's look at this simple function:

function ReadInteger(APInteger : PInteger) : Integer;
begin
  Result := APInteger^;
end;

Even with optimizations switched on, Delphi XE7 with update pack 1, as well as XE6, produce the following ARM assembly code for that function:

Disassembly of section .text._ZN16Uarmcodetestform11ReadIntegerEPi:

00000000 <_ZN16Uarmcodetestform11ReadIntegerEPi>:
   0:   b580        push    {r7, lr}
   2:   466f        mov r7, sp
   4:   b083        sub sp, #12
   6:   9002        str r0, [sp, #8]
   8:   78c1        ldrb    r1, [r0, #3]
   a:   7882        ldrb    r2, [r0, #2]
   c:   ea42 2101   orr.w   r1, r2, r1, lsl #8
  10:   7842        ldrb    r2, [r0, #1]
  12:   7803        ldrb    r3, [r0, #0]
  14:   ea43 2202   orr.w   r2, r3, r2, lsl #8
  18:   ea42 4101   orr.w   r1, r2, r1, lsl #16
  1c:   9101        str r1, [sp, #4]
  1e:   9000        str r0, [sp, #0]
  20:   4608        mov r0, r1
  22:   b003        add sp, #12
  24:   bd80        pop {r7, pc}

Just count the number of instructions and memory accesses Delphi needs for that. And constructing a 32 bit integer from 4 byte-loads... If I change the function a little bit and use a var parameter instead of a pointer, it is slightly less convoluted:

Disassembly of section .text._ZN16Uarmcodetestform14ReadIntegerVarERi:

00000000 <_ZN16Uarmcodetestform14ReadIntegerVarERi>:
   0:   b580        push    {r7, lr}
   2:   466f        mov r7, sp
   4:   b083        sub sp, #12
   6:   9002        str r0, [sp, #8]
   8:   6801        ldr r1, [r0, #0]
   a:   9101        str r1, [sp, #4]
   c:   9000        str r0, [sp, #0]
   e:   4608        mov r0, r1
  10:   b003        add sp, #12
  12:   bd80        pop {r7, pc}

I won't include the disassembly here, but for iOS, Delphi produces identical code for the pointer and var parameter versions, and they are almost but not exactly the same as the Android var parameter version. Edit: to clarify, the byte-by-byte loading is only on Android. And only on Android, the pointer and var parameter versions differ from each other. On iOS both versions generate exactly the same code.

For comparison, here's what FPC 2.7.1 (SVN trunk version from March 2014) thinks of the function with optimization level -O2. The pointer and var parameter versions are exactly the same.

Disassembly of section .text.n_p$armcodetest_$$_readinteger$pinteger$$longint:

00000000 <P$ARMCODETEST_$$_READINTEGER$PINTEGER$$LONGINT>:

   0:   6800        ldr r0, [r0, #0]
   2:   46f7        mov pc, lr

I also tested an equivalent C function with the C compiler that comes with the Android NDK.

int ReadInteger(int *APInteger)
{
    return *APInteger;
}

And this compiles into essentially the same thing FPC made:

Disassembly of section .text._Z11ReadIntegerPi:

00000000 <_Z11ReadIntegerPi>:
   0:   6800        ldr r0, [r0, #0]
   2:   4770        bx  lr

Source: (StackOverflow)

How does the ARM architecture differ from x86? [closed]

I have been hearing much about the ARM and x86 Architectures. Is the x86 Architecture specially designed to work with a keyboard while ARM expects to be mobile? What are the key differences between the two?

Source: (StackOverflow)

Mono on Raspberry Pi

I've seen a lot of talk about running Mono/.NET code on the Raspberry Pi. Has there been any succceses in actually running any Mono code on the Raspberri Pi?

On their site, they list several Linux distributions that work on the device and some of these distributions include Mono. However, none detail whether Mono works on it.

Is there a working implementation?

Source: (StackOverflow)

Windows Phone 7 and native C++/CLI

Microsoft recently released tools and documentation for its new Phone 7 platform, which to the dismay of those who have a big C++ codebase (like me) doesn't support native development anymore. Although I've found speculation about this decision being reversed, I doubt it. So I was thinking how viable would be to make this codebase available to Phone 7 by adapting it to compile under C++/CLI. Of course the user interface parts couldn't be ported, but I'm not sure about the rest. Anyone had a similar experience? I'm not talking about code that does heavy low-level stuff - but there's a quite frequent use of templates and smart pointers.

Source: (StackOverflow)

Quickly find whether a value is present in a C array?

I have an embedded application with a time-critical ISR that needs to iterate through an array of size 256 (preferably 1024, but 256 is the minimum) and check if a value matches the arrays contents. A bool will be set to true is this is the case. MCU is a NXP LPC4357, ARM Cortex M4 core, compiler is GCC. I already have combined optimisation level 2 (3 is slower) and placing the function in RAM instead of flash. I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i!=0 is faster than checking if i<256). All in all, I end up with a duration of 12.5us which has to be reduced drastically to be feasible. This is the (pseudo) code I use now:

uint32_t i;
uint32_t *array_ptr = &theArray[0];
uint32_t compareVal = 0x1234ABCD;
bool validFlag = false;

for (i=256; i!=0; i--)
{
    if (compareVal == *array_ptr++)
    {
         validFlag = true;
         break;
     }
}

What would be the absolute fastest way to do this? Using inline assembly is allowed. Other 'less elegant' tricks also allowed.

Source: (StackOverflow)

Why use armeabi-v7a code over armeabi code?

In my current project I make use of multiple .so files. These are located at the armeabi and armeabi-v7a folder. Unfortunately one of the .so files is a 6MB and I need to reduce file size. Instead of having a fat APK file, I would like to use just the armeabi files and remove the armeabi-v7a folder.

According to the NDK documentation, armeabi-v7a code is extended armeabi code which can contain extra CPU instructions. This all goes beyond my expertise, but I question why one would like to have both armeabi-v7a and armeabi code. There must be a good reason to have both, right?

On my test devices this all seem to work fine. These have ARM v7 CPU's. Is it safe to assume that everything works now?

Source: (StackOverflow)

How does the linux kernel manage less than 1GB physical memory?

I'm learning the linux kernel internals and while reading "Understanding Linux Kernel", quite a few memory related questions struck me. One of them is, how the Linux kernel handles the memory mapping if the physical memory of say only 512 MB is installed on my system.

As I read, kernel maps 0(or 16) MB-896MB physical RAM into 0xC0000000 linear address and can directly address it. So, in the above described case where I only have 512 MB:

How can the kernel map 896 MB from only 512 MB ? In the scheme described, the kernel set things up so that every process's page tables mapped virtual addresses from 0xC0000000 to 0xFFFFFFFF (1GB) directly to physical addresses from 0x00000000 to 0x3FFFFFFF (1GB). But when I have only 512 MB physical RAM, how can I map, virtual addresses from 0xC0000000-0xFFFFFFFF to physical 0x00000000-0x3FFFFFFF ? Point is I have a physical range of only 0x00000000-0x20000000.
What about user mode processes in this situation?
Every article explains only the situation, when you've installed 4 GB of memory and the kernel maps the 1 GB into kernel space and user processes uses the remaining amount of RAM.

I would appreciate any help in improving my understanding.

Thanks..!

Source: (StackOverflow)

Installing Raspberry Pi Cross-Compiler

I am attempting to get cross-compiling for Raspberry Pi working on my Ubuntu machine.

During my initial attempts I was using the arm-linux-gnueabi compiler, which is available in the Ubuntu repo. I got this working. I was able to build all my dependencies and use the cross-compiler in my cmake project.

However, I believe I should be using the hf version, so I switched to arm-linux-gnueabihf. Then I realized that this does not work with Raspberry Pi since it is armv6.

After some Googling, I then found the pre-built toolchain from GitHub: https://github.com/raspberrypi/tools.

I downloaded the toolchain, but I don't really understand how to "install" it. I extracted the files to my home directory. The directory structure looks like this:

/gcc-linearo-arm-linux-gnueabihf-raspbian
    /arm-linux-gnueabihf
        /bin
            (contains g++, gcc, etc)
        /lib
            (contains libstdc++ library)
    /bin
        (contains arm-linux-gnueabihf-g++, arm-linux-gnueabihf-...)
    /lib
        (gcc lib stuff)

If I change directory to the INNER bin folder I am able to compile a test program from the terminal without any problems.

~/tools/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian/
arm-linux-gnueabihf/bin$ g++ test.cpp -o test

I then tried to compile a test program in the OUTER bin folder, which contains the prefixed versions of the tools.

 ~/tools/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian/bin$ 
 arm-linux-gnueabihf-g++ test.cpp -o test

However, when I try to use the compiler now (from outside the inner bin directory), it is unable to find the libstdc++ shared library that comes with the toolchain:

arm-linux-gnueabihf-gcc: error while loading shared libraries: 
libstdc++.so.6: cannot open shared object file: No such file or directory.

Furthermore, I want to be able to use the compiler without having to navigate to the bin directory. So I tried adding the OUTER bin directory (since I want the prefixed versions) and both lib directories to my PATH:

export PATH=$PATH:~/tools/.../bin
export PATH=$PATH:~/tools/.../lib
export PATH=$PATH:~/tools/.../.../lib

However, this results in the same error. How should I "install" the toolchain so that I can use the toolchain from everywhere, just like I can when I use the cross-compilers from the Ubuntu repo?

Source: (StackOverflow)

Cross-compilation for Raspberry Pi in GCC. Where to start?

TL/DR: Where can I find more information about building a GCC 4.7.0 cross-compiling toolchain for ARM (gnueabi) platform (intended to run on a Raspberry Pi device)?

I have just got a brand new Raspberry Pi and I am very eager to start programming for it. I've managed to install the GCC toolchain (I am using the Arch Linux system image) and compiled some basic programs, all working fine.

I've also tried to compile the Boost libraries because I often use them in my projects and everything seemed to work fine by following the instructions (./bootstrap.sh + ./b2) except for the fact that the compilation was painfully slow. I left it on for a few hours but it barely got past the first few source files. After I left it running for the night, I discovered that the build process aborted due to RAM shortage.

So, my guess is that Rasp Pi is simply underpowered for compiling something of such size as Boost. So, cross-compilation comes to my mind. However, even though there is a lot of information about ARM cross compilation available online, I find it confusing. Where does one even start?

I have a recent GCC version (4.7.0) available on my Raspberry Pi, so I would ideally like to cross-compile with the same version. Where can I get the GCC 4.7.0 toolchain for ARM? (I will be compiling on x86 CentOS 6.2)

Edit:

I deallocated unneeded GPU memory and set up a 4GB swap partition on a USB drive, while build files are on a NFS share. Boost is now compiling much much faster, so it is manageable. I would still like to know how can I set up a GCC 4.7 toolchain for cross compilation on my x86 PC though, since I intend to do a lot of compiling and I would like it to be as fast as possible.

Edit 2:

Since GCC 4.7.0 is relatively new, there does not seem to be a pre-built cross-compiler (i386->ARM). I will probably have to build one myself, which seems an non-trivial task (I've tried and failed). Does anyone know of a tutorial to follow for building a GCC cross-compiler, hopefully for one of the recent versions?

I've tried with this great shell script (which worked great for building a same-arch compiler) and I've successfully built binutils and GCC's prerequisites, but then GCC build kept failing with many cryptic errors. I am really lost here, so I would greatly appreciate your help.

GCC on Raspberry Pi was configured with

--prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib 
--mandir=/usr/share/man --infodir=/usr/share/info 
--with-bugurl=https://bugs.archlinux.org/ 
--enable-languages=c,c++,fortran,lto,objc,obj-c++ --enable-shared 
--enable-threads=posix --with-system-zlib --enable-__cxa_atexit 
--disable-libunwind-exceptions --enable-clocale=gnu 
--disable-libstdcxx-pch --enable-libstdcxx-time 
--enable-gnu-unique-object --enable-linker-build-id --with-ppl 
--enable-cloog-backend=isl --enable-lto --enable-gold 
--enable-ld=default --enable-plugin --with-plugin-ld=ld.gold 
--with-linker-hash-style=gnu --disable-multilib --disable-libssp 
--disable-build-with-cxx --disable-build-poststage1-with-cxx 
--enable-checking=release --host=arm-unknown-linux-gnueabi 
--build=arm-unknown-linux-gnueabi

Edit 3:

I managed to build a 4.7 GCC toolchain for ARM (yay!) using this shell script as suggested by user dwelch in the comments. I also built newlib and libstdc++ using this article as a guide. The toolchain works fine, but hen I run the executable on my Raspberry Pi, it fails with Illegal instruction. What could be the cause of that?

Source: (StackOverflow)

what is the difference between ELF files and bin files

The final images produced by compliers contain both bin file and extended loader format ELf file ,what is the difference between too , especially the utility of ELF file.

Source: (StackOverflow)

What is the difference between arm-linux-gcc and arm-none-linux-gnueabi

What is the difference between arm-linux-gcc and arm-none-linux-gnueabi and arm-linux-gnueabi toolchains?

Do they compile differently?

Source: (StackOverflow)