Index

Arduino Multi-Processor

Hardware Design

Bus Organisation

Library Software

  1. Port Registers

The Bus Controller

Node Identifiers

Miscellaneous Functions

Programming Techniques

Schematics

Firmware

images/1-1.png

16 Processor Arduino Compatible Multi-Processor



images/1-2.png

Summary



This article describes a 16 processor, Arduino compatible multi-processor. Programs designed to run on multi-processor systems are called multi-programs. This multi-processor is designed to support experiments with multi-programs that organise many processors cooperating on the solution of a single problem. It allows a multi-program to be uploaded to define the operation of the whole system as easily as loading a single program into a single processor, i.e. it is not necessary to write 16 separate programs and upload them into the 16 processors individually, as this would be a tedious process.

The 16 processors are ATmega328s, running at 16MHz, as used in the Arduino UNO. These communicate via an 8-bit, interrupt driven, communication bus administered by a further 328 based bus controller. Each processor can receive data on one of four ports, and can send data to a specific port belonging to any other processor. Software can be uploaded using the Arduino IDE, or any other compatible software. A simple library implements send and receive functions for a range of data sizes, byte, int, float etc., so that programs can be written without having to understand how the communication bus works.

The following illustrates how easy it is to upload software into the multi-processor. Using the standard Arduino IDE, the ubiquitous “blink” example can be loaded into the IDE with a serial line connecting the host computer to the multi-processor. Clicking the upload button in the IDE will cause the blink example to be compiled and uploaded to all 16 processors. The LEDs on all the boards will then all blink in synchronism. No changes to the blink example or the IDE are required to make this possible. All that is required to run a real multi-program, where the individual processors perform different tasks, is a mechanism which allows each processor to determine independently which part of the uploaded software to execute. The simple mechanism that does this is described later in the article.

The overall specifications of the system are as follows.

• Processing nodes 16.
• Program space 32k bytes.
• RAM space 32k bytes equally divided between the 16 processing nodes.
• Communication bus speed > 400K bits per second.
• Supply current approximately 300 milliamps.

An example program



The following program implements the Sieve of Eratosthenes for finding prime numbers. It does not represent a very useful prime number finder, but it is simple enough to demonstrate how the multi-processor may be used. One processor will be used to generate a sequence of successive numbers starting from two and a further processor will be used to print all the prime numbers discovered on a serial monitor. All of the other processors will be used to “sieve out” non-prime numbers. The first “sieve” will receive its numbers from the number generator processor. All the other sieves will be organised in a pipeline such that the output from one is the input to its successor. As each sieve finds a new prime number, it will send it to the output processor.

images/1-3.png
Figure 1

When a “sieve” processor is sent its first number, it can be assumed to be prime because it will not have been found to be divisible by any of the primes found by preceding sieves. A sieve will remember the first number sent to it and only send on further numbers via its output if they are not divisible by the initial prime. The definition of a sieve is shown below.

#include <DataFlow.h>

typedef class sieve: public node
{
  public:
    const byte input_port = 1;
    connection output, printer;

    sieve(connection sieve_output, connection printer_output): 
      output(sieve_output), printer(printer_output) 
    {
    }
    
    void run()
    {
      byte prime = receive_on(input_port);
      send_to(prime, printer);
      while (true)
      {
        byte value = receive_on(input_port);
        if (value % prime != 0 && this_node() < max_node) send_to(value, output);
      }
    }
};


The action to be performed by each processor is described by such a class definition. The first line includes the support library called “Dataflow”. This library defines the communication operations and the mechanism used to assign a specific task, such as that show above, to a specific processor. Classes to be used to describe the behaviour of a processor must inherit from the class node defined by the library.

typedef class node
{
  public:
    node(){}
    virtual void initialise(){}
    virtual void run(){}
};


In the case of the sieve class, no initialisation is required, but the run function is overridden to define the behaviour of a sieve. The constructor for the sieve class allows the output connections of the sieve to be defined. This will be described in more detail later.

The run function first arranges to receive a byte from its input port, port 1 in this case. The value received can be assumed to be prime as described above. This prime is output straight away via the connection to the printer. The sieve then repeatedly receives further numbers and tests to see if they are divisible by the original prime. If they are, they are ignored as non-prime, otherwise they are sent via the output connection to the next sieve in the pipeline. The exception to this is if the sieve is the last in the chain, in which case there is no further sieve to send to. The this_node function returns the processor number of the processor executing the run function, and it can be compared with the max_node constant defined by the DataFlow library to find out if this sieve is at the end of the pipeline.

The following class describes the number generator.

#include <DataFlow.h>

typedef class numbers: public node
{
  public:
    connection output;

    numbers(connection numbers_output): output(numbers_output) {} 
    
    void run()
    {
      byte number = 2;
      while (true)
      {
        send_to(number, output);
        number = number + 1;
      }
    }
};


Again, this has a constructor for defining its output, and a run function which simply sends out successive numbers starting from 2 via the output connection. It will send out numbers for as long as the output keeps accepting them.

The final class describes the behaviour of the printer.

#include <DataFlow.h>

typedef class printer: public node
{
  public:
    const byte input_port = 1;
    
    void initialise()
    {
      Serial.begin(9600);
      while (!Serial);
    }
    
    void run()
    {
      Serial.println("Primes");
      while (true)
      {
        byte prime = receive_on(input_port);
        Serial.println(prime);
      }
    }
};


This initialises the serial connection to operate at 9600 baud using the standard Arduino Serial library, and waits for it to start. The run function then prints out the string “Primes” before repeatedly receiving byte values and printing them. It can be seen from this that the functions initialise and run defined by the node class have much the same roles here as do the setup and loop functions used in a standard Arduino sketch.

One thing remains to be done to complete the description of the primes system, and that is to assign the operations described by the preceding class definitions to specific processors. The main program used to do this has the same structure as a standard Arduino sketch and is shown below.

#include <DataFlow.h>

#include "numbers.hpp"
#include "sieve.hpp"
#include "printer.hpp"

const byte printer_node =  min_node + 0;
const byte numbers_node =  min_node + 1;
const byte first_sieve  =  min_node + 2;

void setup()
{
  init_dataflow();
  switch (this_node())
  {
    case printer_node: activate_node(new printer());                                        break;
    case numbers_node: activate_node(new numbers(link(first_sieve)));                       break;
    default          : activate_node(new sieve(link(this_node() + 1), link(printer_node))); break;
  }
}

void loop()
{
}


This description should be stored in a primes.ino file to be opened using the Arduino IDE. It assumes that the previous class definitions are stored in files called numbers.hpp, sieve.hpp and printer.hpp [1]. If all these files are stored in a folder called primes, opening the primes.ino file with the IDE will cause all the files to open in separate tabs, and the program can be uploaded to all the processors simultaneously via a single serial connection, usually via a suitable USB adaptor. The same serial connection can be used with a serial monitor to see the list of primes printed by the printer class. In this case it will output the primes 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41 and 43 before running out of sieves[2].

The first few lines in the program are used to include the Dataflow library and the numbers, sieve and printer classes. To understand what the rest of the program does, it is necessary to know that each processor has a unique number called its node identifier. The 16 processors in the current hardware have node identifiers from 0 to 15. The DataFlow library defines this range of node identifiers using the constants min_node and max_node. For the purposes of the primes system, processors will be allocated to tasks starting from min_node. The node identifiers are associated with the 3 constants at the top of the program.

The setup and loop functions are those used by a standard Arduino sketch. The loop function is not used by the primes system, but its presence is required by the Arduino IDE.

The setup function must first initialise the DataFlow library by calling its init_dataflow operation. The same setup function will run in all 16 processors simultaneously when they are reset, so what happens next depends on each processor being able to read its own node identifier using the this_node function. The switch statement is used to activate different activities on different processors. If the setup function is running on the processor designated to be the printer, a new instance of the printer class is created and passed to the DataFlow library's activate_node function. This will first call its initialise function and then its run function. If the run function terminates, the main program will eventually call the standard loop function, which in this case does nothing. It is quite normal for individual processors to execute operations that do not terminate as is the case for all the components of the primes system.

If the processor executing setup is the processor designated to run the number generator, it will apply activate_node to a new instance of the numbers class passing it a link to the first_sieve so that it knows where to send its output. The full version of the link function takes 2 arguments, and combines them to form a connection structure. The first argument is a node identifier and the second is a port number. If the port number is omitted, port 1 is assumed.

As all the processors other than the printer and the number generator are to be used as sieves, the default clause in the switch statement applies activate_node to new instances of the sieve class. These are passed 2 connections each. The first is to the following sieve in the pipeline and the second is to the printer.

Dynamic allocation of class objects is not often employed on the Arduino because of the very limited amount of RAM available. If many objects are allocated and deallocated in an arbitrary fashion, it is too easy for the RAM to become so fragmented that no further allocations are possible. The way dynamic allocation is used here does not cause this problem because the new class instances are only created once by the setup function and remain allocated for the duration of the system's execution. The advantage of dynamic allocation here is that if all the working variables for a particular activity are contained within a class, the RAM of the processor allocated to that task only has to support those variables, and not the variables of all the other classes in the system, i.e. although every processor must have a copy of the code for the entire system, it only needs to allocate variables for the task it is to perform. A 328 processor only has 2k of RAM, but the 16 processor system may use all 32k of RAM available across the 16 processors.

Download the Complete Sieve Program

Next: Hardware Design
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. .hpp files are used here because they are more than just simple header files, The Arduino IDE will however cope if .h files are used instead, and there is no need to use include guards because the files will only ever be included once in the main program.

2. Although the prime number sieve program presented here is not very useful, because it can only find the first 14 primes, it is fairly straightforward to extend it. Each sieve could be made to store primes it finds in an array. Each potential prime sent to a sieve by its predecessor can then be tested against the primes in the array, and only when the array is full do further potential primes get sent on to on its successor. If each sieve “sieves out” just 100 primes, that would enable the system presented here to find the first 1,400 primes much faster than a single node. Also a single node could not store 1,400 primes in its limited RAM as each prime would have to be represented as an integer, i.e.2 bytes. This illustrates one strategy for dealing with larger problems in a way that exploits multiple processors. Download the Upgraded Sieve Program