Also, make sure you have serialization disabled (Set
ICTRL
to 0x7).
To get maximum performance, you need to enable copyback data cache. This can be
disabled in order to make the standard Linux/PPC libraries work without
recompiling. If you build your own glibc as described under
RuntimeLibrary, you can enable copyback. Look for a
make config
option, or
grep for
DC_SFWT
in
arch/ppc/kernel/head.S
and change the
#if 0
to
#if 1
.
The
BogoMIPS? value on 8xx processors should be within 1% or so of the actual
CPU core frequency, allowing for rounding and minor timing calculation errors.
This makes it a useful sanity check to verify that the internal clock
multiplier is set correctly and that the I-cache is turned on. However, note
that the calculation of the
BogoMIPS? value is still tied to the external clock
source and internal prescaler settings, so it shouldn't be solely relied on to
verify that the core frequency really is what you think it should be. A simple
cross-check is to perform a 'sleep 10' at the shell prompt, and time it with a
watch to check that you're at least in the ballpark. It's wise to measure your
system more accurately than this with a CRO at least once.
Also, beware that the
BogoMIPS? rating should not be used as a general
CPU
performance measure. See
http://linuxdoc.org/HOWTO/mini/BogoMips.html
There are numerous options available for system profiling, depending on what
you wish to measure, and how invasive you are prepared to be.
/proc/profile
is a standard kernel feature which provides simple kernel
profiling based on Instruction Pointer sampling in the periodic timer interrupt
routine. It's simplistic but effective, and low overhead since the interrupt is
going to happen anyway. The data is processed with
readprofile
which looks up
the
System.map
to show which kernel functions are using the most
CPU time. It
doesn't work for modules yet so at present you need to compile them in for
profiling.
You need to enable this at boot time by passing
profile=2
on the command
line. The number gives the power of 2 granularity used for the counters -- 2
will give you a seperate counter for each
PowerPC instruction (each 4 bytes).
Higher numbers consume less memory and give less precise results. The data from
/proc/profile
will be in target byte order, so if you're cross-developing you
may need to either byte swap it, or compile
readprofile
to run on your
target.
The
PowerPC branch of the Linux kernel has been slow to implement the
Instruction Pointer sampling function necessary to generate the
/proc/profile
data. If it isn't implemented in your kernel, you'll see that
readprofile
always shows zero time for every kernel function.
http://www.opersys.com/LTT
The Linux Trace Toolkit works with an instrumented Linux kernel by saving
time-stamped records of important kernel events to a binary data file. A data
decoder converts the binary data to text and calculates statistical summaries,
such as percent processor utilization by each process. The toolkit also
includes an integrated environment that graphically displays the results and
provides search capability.
A version for embedded
PowerPC targets is now available from
ftp://ftp.mvista.com/pub/LTT
All the usual Linux user mode profiling tools like
gprof
are available.
This project aims to make full
gprof
profiling available for the kernel.
However, it hasn't been ported to the
PowerPC architecture yet.
Beware that
IDMA on the 860 is not designed for high performance,
and the
CPU gets better throughput with explicit cache-bursted
programmed I/O. Search for
IDMA
for more discussion.
Confusion sometimes arises because
[DMA transfers in most systems are
faster than
CPU transfers, whereas here the reverse is generally
true. Furthermore,
IDMA transfers eat into
CPM processing
time, limiting throughput on other communications modules at the same time.
To get good TCP/IP performance, you need a fast
CPU. Using the
FEC, a 50 MHz 860P will run about 30 Mbits/sec TCP/IP, and a 100 MHz
860P will run about 60 Mbits/sec TCP/IP. The bottleneck is the protocol and
application processing in the PPC core. The performance of a TCP/IP connection
scales nearly linearly with the processor speed.
If you need to go faster, use the 8260.
Optimizing everything for space using gcc's
-Os
option is likely to provide
both the smallest code size and best performance, because it inhibits loop
unrolling optimisation which tends to have a negative effect on embedded
processors with relatively small cache sizes. Furthermore,
PowerPC processors
can speculatively execute branches overlapped with other loop instructions,
making the branch effectively execute in zero cycles so loop unrolling is
unnecessary in many circumstances.