• C++ Programming for Financial Engineering
    Highly recommended by thousands of MFE students. Covers essential C++ topics with applications to financial engineering. Learn more Join!
    Python for Finance with Intro to Data Science
    Gain practical understanding of Python to read, understand, and write professional Python code for your first day on the job. Learn more Join!
    An Intuition-Based Options Primer for FE
    Ideal for entry level positions interviews and graduate studies, specializing in options trading arbitrage and options valuation models. Learn more Join!

Counting CPU Clock Cycles

Can anyone recommend a way to count CPU cycles? Should we be using some facilities in std::chrono instead of the TSC register?

I've read about the TSC register that can be read using the RDTSC instruction. However, I see on Wikipedia and Stackoverflow that RDTSC has a lot of flaws on modern CPUs. For example this quote:

"The Time Stamp Counter was once an excellent high-resolution, low-overhead way for a program to get CPU timing information. With the advent of multi-core/hyper-threaded CPUs, systems with multiple CPUs, and hibernating operating systems, the TSC cannot be relied upon to provide accurate results"
 
Can anyone recommend a way to count CPU cycles? Should we be using some facilities in std::chrono instead of the TSC register?

I've read about the TSC register that can be read using the RDTSC instruction. However, I see on Wikipedia and Stackoverflow that RDTSC has a lot of flaws on modern CPUs. For example this quote:

"The Time Stamp Counter was once an excellent high-resolution, low-overhead way for a program to get CPU timing information. With the advent of multi-core/hyper-threaded CPUs, systems with multiple CPUs, and hibernating operating systems, the TSC cannot be relied upon to provide accurate results"
Bumping this thread in hopes of getting more visibility for a response.
 
I don't do much of this in general apart from coarse-grained parallelism ball park stuff ...

What about Boost? just a guess?


Plan B ... a commercial product .. doesn't grow on trees.
 
Last edited:
I don't do much of this in general apart from coarse-grained parallelism ball park stuff ...

What about Boost? just a guess?


Plan B ... a commercial product .. doesn't grow on trees.
This could potentially work, thanks for the link. I’ll dive more into it to make sure it’s useful for my requirements.
 
Any chance you doing the build in linux? If you are you can use perf


you can install with the below command:

sudo apt-get install linux-tools-common linux-tools-generic

say your executable can be called by doing
./mike_program

You can run:

perf stat ./mike_program

and this will give you something like the below:

$ perf stat ./mike_program

Performance counter stats for 'mike_program':

5,834,104 cycles # 0.733 GHz
2,973,423 instructions # 0.51 insn per cycle
634,204 cache-references
14,305 cache-misses # 2.255 % of all cache refs
10,502 branch-misses

0.007991768 seconds time elapsed

0.006552000 seconds user
0.001444000 seconds sys

If you want just the cycles and nothing else you can do:


perf stat -e cycles ./mike_program

Just a quick rundown of what the above output is:


  • Cycles: The number of CPU cycles consumed by the mike_program command. This is a direct measure of CPU work.
  • Instructions: The number of instructions executed.
  • Insn per cycle: The average number of instructions executed for each cycle, an indicator of how efficiently the CPU is being used.
  • Cache-references: The number of times the CPU accessed the cache.
  • Cache-misses: The number of cache accesses that were not satisfied by the cache.
  • Branch-misses: The number of times the CPU mispredicted the target of a branch instruction.
The final part of the output gives the time elapsed, user time, and system time for the command.

Hope it is useful! Our systems programmers and assembly champs at work love using perf.
 
Are you using fine-grained or coarse-grained parallelism?
If anything, course grained. Although I am not even sure what I am doing would count for that.

Effectively, I am designing a system around what I know as an “Engine Model”. Where effectively the entire system is parallelized. Each thread (aka Engine) is responsible for processing events within a given symbol/instrument range.

So the system receives a book update from a venue and Engine 1 (aka thread 1) processes it because it was for AAPL. It receives another book update and Engine 2 (aka thread 2) processes it because it was for MSFT
 
If anything, course grained. Although I am not even sure what I am doing would count for that.

Effectively, I am designing a system around what I know as an “Engine Model”. Where effectively the entire system is parallelized. Each thread (aka Engine) is responsible for processing events within a given symbol/instrument range.

So the system receives a book update from a venue and Engine 1 (aka thread 1) processes it because it was for AAPL. It receives another book update and Engine 2 (aka thread 2) processes it because it was for MSFT
Is this a compute-intense parallel program or some kind of asych event flow app, like a data feed pipeline?
So, what is the _problem_ ?

edit: is throughput more important; thread start/stop is costly.
 
Last edited:
Back
Top