yes | sudo apt-get install g++ libgmp3-dev libmpfr-dev libxml2-dev bison libmpfi-dev flex cmake libboost-all-dev libgsl0-dev && wget http://perso.ens-lyon.fr/damien.stehle/downloads/libfplll-3.0.12.tar.gz && tar xzf libfplll-3.0.12.tar.gz && cd libfplll-3.0.12/ && ./configure && make -j2 && sudo make install && cd .. && wget https://gforge.inria.fr/frs/download.php/28571/sollya-3.0.tar.gz && tar xzf sollya-3.0.tar.gz && cd sollya-3.0/ && ./configure && make -j2 && sudo make install && cd .. && wget https://gforge.inria.fr/frs/download.php/31858/flopoco-2.4.0.tgz && gunzip < flopoco-2.4.0.tgz | tar xvf - && cd flopoco-2.4.0/ && cmake . && make -j2 && ./flopoco
We would welcome similar installation scripts for other platforms.FloPoCo may be compiled either using CMake or the autotools. The recommended way is CMake, which is available for most Unix and Linux distributions and for Windows. If you prefer to use the autotools read the README.autotools file.
CMake is included in mainstream Linux/Unix distributions, and is available for other operating systems, including Windows.
FloPoCo also depends on the MPFR library, on the C++ interface to GMP (which may or may not be a dependency of MPFR) and on flex++, all of which are probably available in your favourite Linux/Unix distribution.
You really want to link FloPoCo against Sollya. This enables the more advanced operators (HOTBM, FunctionEvaluator). For this purpose, you must download, compile and install Sollya. FloPoCo is demonstrated to work with version 3.0 of Sollya and should work with future releases.
Compilation is a two-step process:
cmake .
make
The adventurous may get FloPoCo from its subversion repository.
FloPoCo is a command-line tool. The general syntax is
flopoco <options> <operator specification
list>
FloPoCo will generate a single VHDL file (named by default
flopoco.vhdl
) containing synthesisable descriptions of all the
operators listed in <operator specification list>
, plus
possibly sub-operators instanciated by them. To use these operators in your
design, just add this generated file to your project.
FloPoCo will also issue a report with useful information about the generated operators, such as the pipeline depth. In addition, three levels of verbosity are available.
./flopoco IntConstMult 16 12345
produces a file flopoco.vhdl
containing a single operator for
the integer multiplication of an input 16-bit number by the constant 12345.
The VHDL entity is named after the operator specification, here
IntConstMult_16_12345
.
./flopoco IntConstMult 16 12345 IntConstMult 16 54321
produces a file flopoco.vhdl
containing two VHDL entities and
their architectures, for the two given constant multipliers.
./flopoco FPConstMult 8 23 8 23 0 -50 1768559438007110
produces a file flopoco.vhdl containing two VHDL entities, one for the specified constant floating-point multiplier by 1768559438007110x2-50, and the other one for a needed sub-component (an integer multiplier for the significand multiplication).
Several transversal options are available and will typically change the
operators occuring after them in the list. For
instance -frequency=300
sets the target
frequency. The -name=UserProvidedName
option replaces the
(ugly and parameter-dependent) name generated by FloPoCo for the next
operator with a user-provided one. This allows in particular to change
parameters while keeping the same entity name, so that these changes
are transparent to the rest of the project. Options related to pipelining are reviewed below.
The -target
option selects the target FPGA family. We
try to optimize for the highest speed grade available for this
family (see below for pipelining options).
To obtain a concise list of the available operators and options, simply type
./flopoco
In addition this help may be more in sync with the code than this file, especially if you are using a svn snapshot.
The FloPoCo distributions also includes useful programs for converting the
binary string of a floating-point number to human-readable form
(bin2fp
) and back (fp2bin
). The
longacc2fp
utility converts the fixed-point output of the
LongAcc operator (see below) to human-readable form.
The floating-point format used in FloPoCo is identical to the one used in FPLibrary. It is inspired from the IEEE-754 standard.
An FP number is a bit vector consisting of 4 fields. From left to right:
The format is therefore parameterized by to positive integers wE and wF which define the sizes of the exponent and fraction fields respectively.
The utilities fp2bin
and bin2fp
will allow
you to get familiar with the format and set up test benches.
There are two main differences between the format (wE=8, wF=23) and the IEEE-754 single precision format (the same holds for double).
Exceptional cases (zeroes, infinities and Not a Number or NaN) are encoded as separate bits in FloPoCo, instead of being encoded as special exponent values in IEEE-754. This saves quite a lot of decoding/encoding logic. The main drawback of this format is when results have to be stored in memory, where they consume two more bits. However, FPGA embedded memory can accomodate 36-bit data, so adding two bits to a 32-bit IEEE-754 format is harmless as long as data resides within the FPGA.
As a side effect, the exponent can take two more values in FloPoCo than in IEEE-754 (one for very large numbers, one for very small ones).
Note that anyway, FloPoCo provides conversion operators from and to IEEE-754 formats (single and double precision).
Numbers in the Logarithm Number System used in FloPoCo have an encoding similar to the floating-point format. It is also the same as the one used in FPLibrary.
Its fields are:
Reasonable values are 4 to 8 for wE, and 8 to 20 for wF. Other values are still allowed, including negative wE. Use at your own risk.
An operator may be combinatorial, or pipelined. A combinatorial operator has pipeline depth 0. An operator of pipeline depth 1 is obtained by inserting one and only one register on any path from an input to an output. Hopefully, this divides the critical path delay by almost 2. An operator of pipeline depth 2 is obtained by inserting two register levels, etc.
It should be noted that, according to this definition, pipelined operators usually do not directly buffer neither their inputs nor their outputs. For instance, connecting the input of a 400MHz operator to the output of another 400MHz operator may well lead to a circuit working at 200MHz only. It is the responsibility of the user or calling program to insert one more level of registers between two FloPoCo operators. This convention may be felt as a burden to the user, but it is the most sensible choice. It makes it possible to assemble sub-component without inserting registers in many situations, thus reducing the latency of complex components. Besides, different application contexts may have different policies (registers on output, or registers on input).
Two command-line options control the pipelining of the FloPoCo operators that follow them.
-pipeline=[yes|no]
(default yes)-frequency=[frequency in MHz]
(default 300)-pipeline
option is set, then FloPoCo will try to pipeline the operator to
the given frequency. It will report a warning if it fails -- or if
frequency-directed pipelining is not yet implemented for this
operator.
no
, the operator will be combinatorial. If yes
, registers
may be inserted if needed to reach the target frequency.
The philosophy of FloPoCo's approach to pipelining is the following:
flopoco -frequency=200 FPAdder 11 52
-frequency=300 FPMultiplier 8 23
-frequency
option may save resources.-frequency
accordingly.Note that not all operators support pipelining (utimately they all will). They are mentionned in the command-line help.
Here is the list of operators that can be generated by FloPoCo. This list may not be fully up-to date... the code is the reference.
LeftShifter wIn MaxShift
RightShifter wIn MaxShift
LZOC wIn wOut
LZOCShifterSticky wIn wOut computeSticky countType
IntAdder wIn
MyIntAdder wIn optimizeType srl implementation bufferedInputs
optimizeType=<0,1,2,3>
where 0=LUT 1=REG 2=SLICE 3=LATENCY
allows selecting the different
optimization criteria. srl=<0,1>
allows generating architectures optimized
for the use of hardware shift registers. The architecture can also adapt if inputs of the
adder are already buffered or not using the option bufferedInputs=<0,1>
.
Automatic design space exploration is performed by setting implementation=-1
.
Forcing architecture selection can be done by setting implementation=<0,1,2>
where 0=Classical, 1=Alternative, 2=Short-Latency
. Please check out this
article
for more details.
IntDualSub wIn opType
IntNAdder wIn N
IntCompressorTree wIn N
IntMultiplier wInX wInY wOut signedIO ratio enableSupertiles
IntSquarer wInX wInY
IntKaratsuba wIn
FixComplexAdder wI wO
FixComplexMultiplier wI wO ratio
FPMultiplier wE wF
FPAdder wE wF
FPAdderDualPath wE wF
FPAdder3Input wE wF
FPAddSub wE wF
FPDiv wE wF
FPSqrt wE wF
FPSqrtPoly wE wF CorrectlyRounded Degree
FPSquarer wE wF
FPPipeline filename wE wF
/* Jacobi1D: */
|
/* Horner: */
|
/* 2D Norm: */
|
LongAcc wE_in wF_in MaxMSB_in LSB_acc MSB_acc
MaxMSB_in
, LSB_acc
and MSB_acc
parameters to a given application, it
allows one to bring rounding error to a provably arbitrarily small
level (and in some case to avoid any rounding), for a very small hardware cost compared to using a
floating-point adder for accumulation.
DotProduct wE_in wF_X wF_Y MaxMSB_in LSB_acc MSB_acc
LongAcc2FP MaxMSB_in LSB_acc MSB_acc wE_out wF_out
We provide two techniques for building a multiplier by a constant. One is the good old KCM technique described by Chapman in 1994. It builds an operator whose size is independent on the constant, but grows with the size of the input. It is very efficient for very small input bit sizes and arbitrary constants. The other one is based on shift-and-add graphs, and is described in all the gory details in this article. It will be more efficient for some constants. Some day we will be able to provide a uniform interface for these two families, in between you may want to try and synthesize both and pick up the best.
IntConstMult w c
IntIntKCM w c signedInput
FixRealKCM lsbIn msbIn signedInput lsbOut constant
FPConstMult wE_in wF_in wE_out wF_out cst_sgn cst_exp
cst_int_sig
FPConstMultParser wE_in wF_in wE_out wF_out wF_C constant_expr
FPRealKCM wE wF constantExpression
FPConstMultRational wE_in wF_in wE_out wF_out a b
IntConstDiv w d alpha
FPConstDiv wE wF d
FPConstDivExpert wE wF d e alpha
FixSinCos w
FixSinOrCos w d
CordicSinCos wIn wOut reduced
FixFIR p useBitheap taps [coeff list]
FixDCT p taps current_index
FunctionEvaluator function wI wO degree
HOTBM function wI wO degree
FloPoCo provides two generic operators, HOTBM
and FunctionEvaluator
, for evaluating an arbitrary
function in fixed point. They offer (almost) the same interface: the
description of a function between quotes (like "sin(x)^2"
ior "sqrt(1+x)"
), input and output precisions (with a
difference of interpretation for outputs), and a polynomial degree. The function is assumed to have its input on [0,1], if you need a function on a different domain, you need to scale the input, e.g. use "sin(pi/2*x)"
for a sine between 0 and pi/2.
Both methods use piecewise polynomial approximation, the polynomial being "computed just right". The differences are the following.
HOTBM
(best described in
this article) evaluates the polynomial in parallel, and precomputes as much of the computation as it can, to tabulate it. As a consequence the operator has a very short delay, but doesn't scale well beyond 20 bits. In addition HOTBM is actually pre-FloPoCo code and the resulting VHDL is not pipelined. FunctionEvaluator
(best described in
this article) evaluates the polynomial sequentially, using the Horner scheme. The latency is larger, but it scales to much larger precisions (64 bits), making efficient use of the embedded multipliers through the IntMultiplier*
classes. The resulting operator is pipelined and can run to high frequencies.As both methods use a polynomial approximation, they work well for functions which are regular enough. In mathematical terms, they should be defined and n-times continuously differentiable on [0,1]. The code is well tested for monotonic functions only. Do not hesitate to contact us for help on a given function.
For both HOTBM and FunctionEvaluator the input operand is interpreted as a positive fixed-point number, with the point before the leftmost bit. The function to be implemented is assumed to be well defined in [0;1[.
For HOTBM the output is a fixed-point number, where the first bit
is the sign and the point is placed right after it. Note that the
output is in fact wO+1
bits wide. For
FunctionEvaluator, wO
defines the weight of the least
significant bit of the result, and the actual output size depends on
the range of the function on [0,1]. Just try it.
Example:
flopoco HOTBM "sin(x*Pi/2)" 16 16 3
flopoco FunctionEvaluator "sin(x*Pi/2)" 32 32 4
HOTBMFX func wE_in wF_in wE_out wF_out degree
HOTBMRange func wI wO degree xmin xmax scale
wE_in
, wF_in
, wE_out
, wF_out
are the width of the integral and fractional parts of the input and the output,
respectively.
[xmin
, xmax
] is the input domain of the
function, and scale
is a scaling factor to be applied to
the output.
HOTBMFX version allows to select arbitrary fixed-point representations for the
input and output. Negative values are allowed.
HOTBMRange uses HOTBM after mapping [xmin
,xmax
[ to [0,1[, then multiplies the output by the scaling factor scale
.
Example:
flopoco HOTBMFX "log2(1+2^(-x))" 2 8 -1 8 1
Note that the HOTBM*
operators perform an exploration process that typically takes a few minutes for 16 bits, and may take hours for 24 bits.
FPExp wE wF
wE
bits exponent and
wF
bits significand.
FPLog wE wF TableInsize
wE
bits exponent and
wF
bits significand. The third allows for performance tuning. In doubt, set it to 0, which will default to something sensible. Otherwise, it defines the input size of the tables used by the operator, and should be between 6 and 15. See
this article.
FPPow wE wF
wE
bits exponent and wF
bits significand. This function (including its exceptional case management) is fully compatible with the C99 standard, hence current default libms..
FPPowr wE wF
wE
bits exponent and
wF
bits significand. This function is a novelty of IEEE 754-2008, the difference with good old pow
is that it is purely defined as powr(x,y)=e^(y*ln(x)), in particular in the definition of its exceptional cases. For instance, pow
is defined for negative integer x, while powr
returns NaN in such cases.
Fix2FP LSB MSB wE wF
Fix2FP 0 31 8 23
converts an input integer into a single-precision number.
InputIEEE wEI wFI wEO wFO
InputIEEE 8 23 wEO
wFO
to convert from single-precision (or binary32) format,
or InputIEEE 11 52 wEO wFO
to convert from
double-precision (or binary64) format. You may convert to a larger internal format or to a narrower one. Conversions are always correctly rounded.
OurputIEEE wEI wFI wEO wFO
FP2FP wEI wFI wEO wFO
Collision wE wF
These operators compute in the Logarithmic Number System. They are mostly useful for low-precisions systems performing few additions and many multiplications, divisions or square roots.
LNSAddSub wE wF
wE
integral bits and wF
fractional
bits in their exponents.
LNSMul wE wF
wE
integral bits and wF
fractional
bits in their exponents.
LNSDiv wE wF
wE
integral bits and wF
fractional
bits in their exponents.
LNSSqrt wE wF
wE
integral bits and wF
fractional
bits in their exponents.
The implementation of LNS addition/subtraction is based on HOTBM to compute sums, and uses cotransformation to evaluate differences. It is described in this article.
The TestBench
and TestBenchfile
operators generate a test bench for
the operator which precedes it in the command line. The test vectors are generated from the specification of the operator (see the developer documentation of the Operator::emulate()
method).
Test cases include both standard tests and random tests. The single parameter n
specifies the number of random tests to generate. The pseudo-random number generator is initialised with n
as the seed, so that the test bench will be deterministic for a given n
.
We strongly advise that you test operators before using them, and we await your bug reports.
TestBench n
Example:
flopoco HOTBM "sin(x*Pi/2)" 16 16 3 TestBench 1000
TestBenchFile n
TestBenchFile
is similar to TestBench
but moves the test vectors to a separate file called test.input
. Thus the VHDL itself is very short, as is the compilation time. The simulation time is proportional to the number of tests. This scales to millions of test vectors, but is slightly less convenient in the debugging phase.
If n=-2, and exhaustive test is generated. If n=-1, no file is generated.
Example:
flopoco FPAdder 8 16 TestBenchFile 20000
Wrapper