All of the best integrated circuits today come in hard to solder BGA packages. Because BGA packages have connections under the chip soldering is harder and it needs to be done using a reflow oven or hot plate. Another problem is with designing the PCB, vias and traces need to be small enough to fit between the solder balls and there needs to be usually quite many layers in the board to make room for all the closely packed traces. This means that a cheap Chinese two layer board doesn't have enough room and more layers are needed. Adding layers increases the cost of the board dramatically when ordering only a few copies.

I wanted to try designing a board with BGA chips in it to see how hard soldering them could be. So I decided to design a small ARM embedded system that can run Linux. ARM processor that I decided to use was AT91SAM9N12 in a 217 ball LFBGA package, just because it was the cheapest ARM processor with memory management unit which is required to run Linux. Originally I wanted to have only one BGA chip, but RAM in BGA package was significantly cheaper than in other packages and I decided to also have DDR2 memory in a BGA package.

Positioning VIA for maximum amount of space.

Finding a manufacturer for the board turned out to be a slightly challenging task. Two layers wouldn't be enough and at least four layers would be needed. Ball diameter of the 217-LFBGA package is 0.4 mm and distance between adjacent balls is 0.8 mm. To make a little bit more room for the vias ball land pattern can be made slightly smaller than the solder ball. I used 0.36 mm pads. Putting via between the four balls maximises the available space. Manufacturer needs to be able to make a via that fits in about 0.8 mm space. Almost any manufacturer can make a via with this diameter, but the problem is that this distance includes via drill diameter, two times the annular ring around the via and two times the minimum distance between via and trace. For example iTead's four layer board has minimum via drill diameter of 0.3 mm, minimum annular ring width of 0.15 mm and minimum 0.15 mm between via and trace. This adds up to 0.9 mm which means that minimum sized via can't fit between the BGA balls. The only reasonably priced manufacturer I found was OSH parks. Their four layer board has smaller limits and via can just barely fit between the BGA balls. As a bonus it's also cheaper than iTead for small board.

Minimum via with OSH park design rules, fits just barely

Even though the via can fit between the BGA balls, there are still some problems: There is not enough room for trace to go between two vias. This means that it's not possible to route the BGA using standard escape routing where every pad has one via. This means that board needs to have enough unconnected pads on the perimeter, so that traces from the inside can be routed. Fortunately the processor has many general I/O- pins that are left unconnected.

... but trace doesn't fit between two vias without violating the design rules. CAS trace doesn't have enough room to fit between DQM0 and D15 vias.

Manufacturing concerns solved, it's time to start thinking what components should be on the board. I don't really care about the usefulness of the board and this whole project is more of a learning experience. To keep the costs down board size needs to be kept small. This means that there won't be space for any extra interfaces like ethernet, serial ports or SD-card.

Besides the processor and RAM other essential components are: mass memory, voltage regulators and supervisor circuit for handling the reset of the chip. Processor can boot from the NAND, but I decided to also have Dataflash for the bootloader just in case, though this ended up being left unpopulated. For mass memory NAND flash is a good choice because it's cheap with big capacity. Having it also in a BGA package would have been cheaper, but I already have enough trouble with two BGA packages, so I decided to use 4Gb NAND in a 48 pin TSOP package.

Connecting the components is explained well in the processor's datasheet, but because the document is over 1000 pages long it might be hard to find all the details. Atmel also publishes schematics of the evaluation boards that are very helpful when designing a board.

Some freedom needs to be taken with DDR2 traces. Normally traces should be length matched, have controlled impedance and terminating or series resistors. Reference design from the development board uses series resistors in all of the DDR2's signals. I don't have enough room for them so I just decided to leave them out. Impedance isn't 50 ohms either, because I had to use smaller traces to fit everything. I hoped that because RAM is so close to the processor, lack of series resistors and impedance mismatch wouldn't matter. All traces from CPU to RAM are about 25 mm long. Usual rule of thumb is that if trace length is over 10 % of wavelength of the signal then transmission line effects should be taken into account. In this case it would mean that frequency should be roughly above 1 GHz. RAM clock frequency is only 133 MHz and even the first few harmonics are under the 1 GHz, which suggests that this should work fine. Just to be sure I matched the trace lengths within few millimeters, but this might have been unnecessary.

Voltage supplies are bit complicated. Processors core voltage is 1.0 V, RAM needs 1.8 V and NAND needs 3.3 V. Because the input voltage is 5 V from USB, board needs to have three different voltage regulators. Normally it would be good to reserve one layer on the board for power supplies and keep it free from signal traces to lower power supply impedance, but board only has four layers and one of them is used for ground plane. This would mean that there would be only two layers left for signals which wasn't enough. So I didn't have separate power supply layer and instead made several planes in different layers for different power supplies.

Losses with linear regulators would have been too big for USB powered applications in the worst case, so I decided to have 3.3 V regulator as a more efficient switching regulator. 1.0 V and 1.8 V regulators are linear regulators that have 3.3 V as input voltage. Because linear regulators losses depend on the difference between input and output voltage using 3.3 V as input instead of 5 V increases the efficiency.

Schematic. Click for a pdf version.

PCB layout. Copper pours are not filled.

Soldering

Bare boards.

Solder paste and components applied to backside. Paste was applied manually with a toothpick. Components here are 1 mm (0.04 in) long. I put only the passives first to see how they would reflow and if the results weren't good I could use another board.

Reflowing with a toaster oven and custom controller.

After reflowing. Three components were soldered in a wrong place. I just ended up taking them out, there are enough decoupling capacitors on the board and if few are missing there is not any negative effects. I also mistakenly placed a capacitor on the upper left footprint, but there should be a resistors instead.

For the topside I got a stencil from OSH stencils so I wouldn't need to apply paste manually to BGA footprints. I secured the board and stencil by taping them to the table.

Stencil lines up very well.

This might look excessive, but almost all of the solder paste is recovered. Some extra solder paste is needed to get even fill.

After spreading the paste and removing the stencil. Much better than the backside paste I applied by hand.

I started with non-BGA components. They are placed manually with tweezers and a steady hand.

CPU and my fingertips. Ball spacing is 0.8 mm. Many new BGAs use even smaller 0.5 mm spacing.

BGAs placed on the board. Component needs to be placed with less than 0.4 mm error or otherwise it might be soldered with one row offset and because solder balls are under the chip alignment cannot be checked. Without border draw on the silk screen it would have been almost impossible to place with required precision, with the silk screen it's easy to just line up the component border to the silk screen border.

Reflowing the top side. Geiger counter PCB are used to raise the PCB so that bottom side components don't touch anywhere. Solder surface tension will keep the bottom side from falling.

After the oven. Solder joints look very nice and all components are still at their places.

Soldering the NAND Flash. My solder iron tip is bigger than the pins and soldering one pin at a time was too difficult. Easier way is to flood pins with solder and then take the extra out with solder wick.

After removing the excess solder the solder joints are very high quality.

Finished board after adding the headers for power supply and debug serial port.

Final product with hand for scale.

The other side. Empty footprint is for Dataflash.

After plugging the USB cable to the USB device port nothing has exploded and I can see that a new serial port /dev/ttyACM0 has appeared and opening it with SAM-BA program, which is used to program the bootloader and kernel, everything seems to be working. Many people say that soldering BGAs is hard but based on this experience I can't agree. Maybe I just got lucky but I didn't have any problems with them.

Opening SAM-BA. At91sam9n12ek is Atmel's development kit for this processor and its configuration also works for this board.

DDR2 works, executing programs work and writing to NAND works. In other words everything works.

Software

Boot flow graph

Bootloader

Boot process start with internal ROM bootloader trying to find a valid program in different memories. It probes SPI flash, SD card, NAND flash, second SPI flash and I2C EEPROM for a valid program. If it finds one it starts it, otherwise it enters into SAM-BA monitor, which is debugging mode where processor listens serial and USB port for commands. This mode allows programming the bootloader.

ROM boot can't boot Linux kernel directly so a second stage bootloader is needed. It will initialize RAM and clocks and then load the Linux kernel. AT91 Bootstrap is a ready made bootloader that does all of this. It will be stored at the beginning of the NAND flash it could also be placed to Dataflash if I had populated its footprint. Even though AT91 bootstrap could boot Linux directly it's more useful for debugging to have U-boot bootloader after it. U-boot is its own mini operating system with command line and it can read USB sticks, use ethernet, write and read from NAND and of course boot Linux. Using U-boot makes it easier for example to erase NAND or change Linux boot parameters.

To compile the bootloader an ARM cross compiler is needed. I'm using Sourcery codebench lite edition, because it's very easy to set up and works well. It's easiest to first load AT91SAM9N12EK development board configuration file. Modifying this configuration file saves a lot of trouble compared to writing new configuration file from scratch.

To make it work for this custom board some changed are needed: RAM size needs to be configured to 64MB, number of banks changed to 4 and some latencies tweaked (Evaluation board has 128MB of RAM with 8 banks). NAND initialization function also needs to be modified, this board has NAND flash connected to different place than the development board and it's necessary to tell the bootloader about it.

U-boot

Configuring U-boot is very straightforward now that AT91 bootstrap has initialized the hardware. It also has configuration file for at91sam9n12ek, but it's set up to boot from SD-card by default. There aren't many other required changes to make, because hardware is already configured. Some optional ones are enabling UBIFS tools for creating and editing partitions on NAND flash and enabling support for reading ext4 formatted USB sticks. USB support makes it possible to boot Linux kernel from USB stick, which makes easy to experiment with different kernel configurations.

Linux and root filesystem

Installing Linux isn't as easy as with a regular x86 PC. Kernel needs to be configured to support all required devices and the root filesystem image needs to be built. This could be done by hand, but it's easier to use buildroot which is a set of makefiles for building the root filesystem and kernel. Still the process can be a little hard because of massive number of different options in kernel and buildroot.

Buildroot doesn't have configuration file for at91sam9n12ek development board, but it has a file for other Atmel board, at91sam9260ek. Using this configuration as a base will makes the configuration little easier. These settings can be loaded with "make at91sam9260ek_defconfig".

First we want to have a relatively new kernel version, because there have been some minor changes related to this processor. So let's use version 3.15.3, which was released last week.

Linux is configured in buildroot with "make linux-menuconfig" command. It will open the usual Linux menu configuration window. Most important configuration in the kernel is system type configuration menu. We need to check AT91SAM9N12 support and "Atmel AT91SAM Evaluation Kits with device-tree support" option. Device-tree is an external binary file that is loaded with the kernel which describes the hardware available on the board. This makes it possible to use same kernel with different boards and device configuration for different boards can be made using text files instead of writing slightly different C-files for every board. Luckily there exists a device tree file for at91sam9n12ek which can be used as a base. Required changes are basically removing the devices that are not available. Path to this modified device tree file needs to be added to buildroot configuration so that it knows to compile and build an image out of it.

Other options worth enabling in the kernel are:

USB host support AT91_USB.
NAND flash support and processor's internal NAND ECC controller (PMECC) support.
Support for UBIFS, which is going to be used as a root filesystem.
Ext4 support for reading USB stick.

In buildroot configuration we need to choose what programs we want to have on root filesystem and the options for generating the root filesystem image. This board has raw NAND memory without controller so usual desktop filesystems such as ext4 can't be used. UBIFS is the usual choice and it's also fine in this case.

UBIFS has several options that depend on the NAND flash type and if they are wrong Linux can't read the resulting filesystem. These options could be figured out from the NAND flash datasheet, but easier way is to boot Linux from USB stick and create ubi partition from there. Or alternatively use U-boot's "ubi info" command which will read the NAND and output the required configuration values.

After typing "make". Buildroot will download cross compiler, Linux kernel and all of the other packages; build them and output kernel, device tree and root filesystem images. These can be transferred to board with SAM-BA program. Same program is needed to program NAND ECC controller parameters. Same parameters should be also configured to AT91 bootstrap, U-boot and Linux kernel or otherwise they will report that NAND is corrupted. In this case NAND has 2048 byte pages with 512 byte sectors and ECC should be able to correct 4 bits per sector. NAND addresses to store all the images can be found in AT91 bootstrap and U-boot configuration files.

After programming the board and resetting, the ROM bootloader should find the AT91 bootstrap on the NAND and start the boot process:

If you are interested in learning more, hardware and software files are available at github.

UPDATE 2014-08-10:

I ran memory test on the RAM and there were some errors. Rarely, about once per loop on memtest, there was an error on 19th bit on random write test.

# memtester 20M
memtester version 4.3.0 (32-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffff000
want 20MB (20971520 bytes)
got  20MB (20971520 bytes), trying mlock ...locked.
Loop 1:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : ok         
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : ok
  16-bit Writes       : ok

Loop 2:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : ok         
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : |FAILURE: 0xfdb30157 != 0xfdbb0157 at offset 0x001892e4.
  16-bit Writes       : ok

Loop 3:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : ok         
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : ok
  16-bit Writes       : |FAILURE: 0x7df4005f != 0x7dfc005f at offset 0x000b7778.

This probably means that the error is with D3 trace. It's the longest data trace and runs parallel to some address lines for short runs. Hard to say for sure what the cause is, but it is related to signal integrity as this chip should be able to run at 133MHz reliably. Despite the errors, this configuration worked well enough that no kernel panics or anything related to bad memory was observed during the normal operation. Issues seems to arise only when memory is stressed.

If clock speed is decreased to 100MHz memory seems to work fine. I have run memtester for over 10 hours consecutively and no errors have been found.