Skip to main content.
Navigation:
DENX
>
PPCEmbedded
>
Performance
Translations:
Edit
|
Attach
|
Raw
|
Ref-By
|
Printable
|
More
PPCEmbedded
Sections of this site:
DENX Home
|
DULG
|
ELDK-5
|
Know
|
Training
|
U-Boot
|
U-Bootdoc
Topics
PPCEmbedded Home
Changes
Index
Search
Go
List of pages in PPCEmbedded
Search
%SECTION0{name=Performance}% Performance %SECTION1{name=CPUcore}% CPU core %SECTION2{name=Cache}% Cache Also, make sure you have serialization disabled (Set =ICTRL= to 0x7). To get maximum performance, you need to enable copyback data cache. This can be disabled in order to make the standard Linux/PPC libraries work without recompiling. If you build your own glibc as described under [[RuntimeLibrary]], you can enable copyback. Look for a =make config= option, or *grep* for =DC_SFWT= in =arch/ppc/kernel/head.S= and change the =#if 0= to =#if 1=. %SECTION2{name=BogoMIPS}% BogoMIPS The BogoMIPS value on 8xx processors should be within 1% or so of the actual CPU core frequency, allowing for rounding and minor timing calculation errors. This makes it a useful sanity check to verify that the internal clock multiplier is set correctly and that the I-cache is turned on. However, note that the calculation of the BogoMIPS value is still tied to the external clock source and internal prescaler settings, so it shouldn't be solely relied on to verify that the core frequency really is what you think it should be. A simple cross-check is to perform a 'sleep 10' at the shell prompt, and time it with a watch to check that you're at least in the ballpark. It's wise to measure your system more accurately than this with a CRO at least once. Also, beware that the BogoMIPS rating should not be used as a general CPU performance measure. See [[http://linuxdoc.org/HOWTO/mini/BogoMips.html]] %SECTION1{name=Profiling}% Profiling There are numerous options available for system profiling, depending on what you wish to measure, and how invasive you are prepared to be. %SECTION2{name=ProcProfile}% /proc/profile =/proc/profile= is a standard kernel feature which provides simple kernel profiling based on Instruction Pointer sampling in the periodic timer interrupt routine. It's simplistic but effective, and low overhead since the interrupt is going to happen anyway. The data is processed with =readprofile= which looks up the =System.map= to show which kernel functions are using the most CPU time. It doesn't work for modules yet so at present you need to compile them in for profiling. You need to enable this at boot time by passing =profile<nop>=2= on the command line. The number gives the power of 2 granularity used for the counters -- 2 will give you a seperate counter for each PowerPC instruction (each 4 bytes). Higher numbers consume less memory and give less precise results. The data from =/proc/profile= will be in target byte order, so if you're cross-developing you may need to either byte swap it, or compile =readprofile= to run on your target. The PowerPC branch of the Linux kernel has been slow to implement the Instruction Pointer sampling function necessary to generate the =/proc/profile= data. If it isn't implemented in your kernel, you'll see that =readprofile= always shows zero time for every kernel function. %SECTION2{name=LinuxTraceToolkit}% Linux Trace Toolkit * [[http://www.opersys.com/LTT]] The Linux Trace Toolkit works with an instrumented Linux kernel by saving time-stamped records of important kernel events to a binary data file. A data decoder converts the binary data to text and calculates statistical summaries, such as percent processor utilization by each process. The toolkit also includes an integrated environment that graphically displays the results and provides search capability. A version for embedded PowerPC targets is now available from [[ftp://ftp.mvista.com/pub/LTT]] %SECTION2{name=gprof}% gprof All the usual Linux user mode profiling tools like =gprof= are available. %SECTION2{name=kernprof}% kernprof * [[http://oss.sgi.com/projects/kernprof]] This project aims to make full =gprof= profiling available for the kernel. However, it hasn't been ported to the PowerPC architecture yet. %SECTION1{name=IDMA}% IDMA Beware that [[IDMA]] on the 860 is not designed for high performance, and the [[PowerPCCPU][CPU]] gets better throughput with explicit cache-bursted programmed I/O. Search for [[http://lists.linuxppc.org/cgi-bin/wilma/wilma_glimpse/linuxppc-embedded?query=IDMA][IDMA]] for more discussion. Confusion sometimes arises because [[DMA]] transfers in most systems are faster than [[PowerPCCPU][CPU]] transfers, whereas here the reverse is generally true. Furthermore, [[IDMA]] transfers eat into [[CPM]] processing time, limiting throughput on other communications modules at the same time. %SECTION1{name=Network}% Network To get good TCP/IP performance, you need a fast [[PowerPCCPU][CPU]]. Using the [[FEC]], a 50 MHz 860P will run about 30 Mbits/sec TCP/IP, and a 100 MHz 860P will run about 60 Mbits/sec TCP/IP. The bottleneck is the protocol and application processing in the PPC core. The performance of a TCP/IP connection scales nearly linearly with the processor speed. If you need to go faster, use the 8260. %SECTION1{name=Optimization}% Optimization Optimizing everything for space using gcc's =-Os= option is likely to provide both the smallest code size and best performance, because it inhibits loop unrolling optimisation which tends to have a negative effect on embedded processors with relatively small cache sizes. Furthermore, PowerPC processors can speculatively execute branches overlapped with other loop instructions, making the branch effectively execute in zero cycles so loop unrolling is unnecessary in many circumstances.
18. Debugging
1. Introduction
20. Common Mistakes and Problems
Prev
Home
Next