I save my employer millions using technique I learned in QuantNet Advanced C++ course. AMA

I recently completed the QuantNet Advanced C++ course and shared my testimonial here. I have several members expressing interest in how I apply specific modern techniques to my job. I hope this will spur more interest in the course and explain things that you can do to generate savings in latency, and running costs which in turn result in huge cost savings.

I work for a large advertising business on the auction side. We received a series of bids from the advertisers and have to show ads to help advertisers and creators while not being overly annoying to the users.

The business has giant scales with hundreds of billions of ad queries daily. The latency requirement is not quite as crazy as HFT, but ads still have to be served within 1 second, and within that 1 second, a lot of memory, CPU, and GPU are used. Because of the giant scale of the business, every tiny resource-saving of just milliseconds gets scaled by hundreds of billions

We have a system to calculate several statistics of the ad candidate relating to the history of other ads that the users have seen in the past (to avoid ads being repeated). A lot of these counts are inefficient for loops that were sufficient when the system did not include crazy ML models but are now a significant latency hog. I transformed all for loops into ranges-based and devised a thread-based scheme to calculate these stats in parallels (because these stats are independent across different candidates). Doing so helped reduce the latency, which saved several years of SWEs for the company and allowed more ms for another more resource-intensive project to launch.

There are a lot of other smaller savings as well. One time, I literally just changed a struct copy to a reference. Because that structure is so big and so popular, doing so ends up being a huge savings. Or the other time I changed the key from string concatenation to tuple. The savings are really easy to monitor because the company did an excellent job tracking resource usage, down to every individual function call. Now that I know all these excellent C++ features, I expect many more savings in the future.

I hope the examples above illustrate the numerous ways one can leverage the modern C++ technique to improve legacy code in significant ways. Even outside of quant, learning C++ optimization techniques are still extremely useful, especially for large-scale systems that compute a lot of small tasks.

I'm happy to answer any questions you have about these features or other parts of my work.
software-engineering-end.png
 
Last edited by a moderator:
Thank you for sharing your story @sonmanutd
It's great to learn how you are able to make a real impact in your work from what you learned in our courses. From reading this, it looks like every a small change to make the code more efficient can make a big difference the larger the code base.
And it also seems like you only touch the tip of the iceberg.
Would love to hear as you go through the system and find ways to optimize things.
 
Thank you for sharing your story @sonmanutd
It's great to learn how you are able to make a real impact in your work from what you learned in our courses. From reading this, it looks like every a small change to make the code more efficient can make a big difference the larger the code base.
And it also seems like you only touch the tip of the iceberg.
Would love to hear as you go through the system and find ways to optimize things.

These big advertising platforms tend to have great tools for tracking the efficiency of each piece of code. There are platform showing how much a function contributes to latency/resource usage, so it is really easy to see which piece of code is inefficient. Opportunities for gains basically boil down to

1. How familiar you are with the code base
2. How well tested the code you plan to change is
3. How would the code change create any inconvenience for developments?

I am lucky that in my case, I owned the piece of code so it is easier to institute change. As for number 2, if the code has a lot of unit tests, then the owners of the code would be more comfortable allowing the change. Also, if a feature is under development, then usually we don't optimize the code right away to allow the project owner to develop quickly.

There are a lot of things to consider, but I have often found my changes accepted in a well-tested piece of code that has been running the system for a long time but is not yet up-to-date with the newest techs. I made a lot of small gains just by converting inefficient loops into ranges.
 
I need to figure out what the range based process is, because I was asked an array sorting question for a technical screening and even though my multiple for loops would have worked for all cases it wouldn't compile in time for three of them (under two seconds).

This gives a great practical reason to move into the Advanced course when I'm able/finished with the first, thanks.

As for a question, what was your:
favorite/least favorite sections in the advanced course
hardest/easiest
most useful/most niche
 
I need to figure out what the range based process is, because I was asked an array sorting question for a technical screening and even though my multiple for loops would have worked for all cases it wouldn't compile in time for three of them (under two seconds).
What was the technical interview question that you had? I don't think ranges actually help with time complexity. It is just that it takes advantage of certain aspects of the loop to make the code slightly more optimized. For array sorting, I think it is probably that your algorithm is actually not optimal. I am really curious about your screening question

favorite/least favorite sections in the advanced course
I think Level 3 about Threads and Processes is great. Especially helpful and powered much of the improvement that I did. The template part is great and comprehensive too. Regex is the part that is quite tedious and too technical in my opinion.

hardest/easiest
Exercise 7 of Module 7 was quite challenging and took me a bit of time to complete (though I liked the challenge).
No parts feel really easy in my opinion

most useful/most niche
From the context of working in a big corporation, I definitely say ranges and threads. There is so much room to make improvements both big and small using this knowledge. The most niche is probably the use of concepts. It is quite interesting and I am still yet to grasp the full benefits of it. But the ability to classify by their functions and, I think, allow the method to be determined at compile time due to the use of template (@Daniel Duffy please correct me on this), means that you will save instructions and jumps, and enable a lot of time savings when these benefits scale up.
 
What was the technical interview question that you had? I don't think ranges actually help with time complexity. It is just that it takes advantage of certain aspects of the loop to make the code slightly more optimized. For array sorting, I think it is probably that your algorithm is actually not optimal. I am really curious about your screening question
I was given an array of positively numbered integers and had to count the number of pairs. So less sorting than matching.
I believe my method was to take each element of the array and then loop through the rest of the array beginning at that element (testing each new element with a boolean). If I found a match I assigned the matched element the value -1 since I knew that wouldn't match anything else and mess me up later. After either finding a pair or not, each new element was checked for positive value otherwise it was skipped. Pretty straightforward.

I did something like that anyway. The speed improvement I can think of right now would be to remove the element that paired with the current element when I found it, so that I'm not looping through, checking, and then moving on from a bunch of -1's. But I didn't think of a way to do that during the screening. Afterward I have seen that you can just reassign the next element of the array to the element before it starting from that point on but I would think that would take more time than my method anyway. It was a standard array, and not an STL class or something, so I don't know if there is a function like remove() out there that I could have utilized.
 
I was given an array of positively numbered integers and had to count the number of pairs. So less sorting than matching.
I believe my method was to take each element of the array and then loop through the rest of the array beginning at that element (testing each new element with a boolean). If I found a match I assigned the matched element the value -1 since I knew that wouldn't match anything else and mess me up later. After either finding a pair or not, each new element was checked for positive value otherwise it was skipped. Pretty straightforward.

I did something like that anyway. The speed improvement I can think of right now would be to remove the element that paired with the current element when I found it, so that I'm not looping through, checking, and then moving on from a bunch of -1's. But I didn't think of a way to do that during the screening. Afterward I have seen that you can just reassign the next element of the array to the element before it starting from that point on but I would think that would take more time than my method anyway. It was a standard array, and not an STL class or something, so I don't know if there is a function like remove() out there that I could have utilized.
Wouldn't that be O(N^2)?

You can count the existence of each number first, and put them into hash map.
{1: 5, 2: 6, 3: 8, 4: 1} for example, in which the key is the number, and the value is the count.

Making this hashmap would be O(N)
And for each number, you just divide by 2 to get the pair.

Let's bring this convo to DM :D
 
Wouldn't that be O(N^2)?

You can count the existence of each number first, and put them into hash map.
{1: 5, 2: 6, 3: 8, 4: 1} for example, in which the key is the number, and the value is the count.

Making this hashmap would be O(N)
And for each number, you just divide by 2 to get the pair.

Let's bring this convo to DM :D
Average or worst-case O(..)?

std::unordered_map is an associative container that contains key-value pairs with unique keys. Search, insertion, and removal of elements have average constant-time complexity.

Internally, the elements are not sorted in any particular order, but organized into buckets. Which bucket an element is placed into depends entirely on the hash of its key. Keys with the same hash code appear in the same bucket. This allows fast access to individual elements, since once the hash is computed, it refers to the exact bucket the element is placed into.
 
Average or worst-case O(..)?

std::unordered_map is an associative container that contains key-value pairs with unique keys. Search, insertion, and removal of elements have average constant-time complexity.

Internally, the elements are not sorted in any particular order, but organized into buckets. Which bucket an element is placed into depends entirely on the hash of its key. Keys with the same hash code appear in the same bucket. This allows fast access to individual elements, since once the hash is computed, it refers to the exact bucket the element is placed into.
Average O(N).

Worst case is O(N^2), but is only the case if every element inserted hashed to the same key, meaning that we have to iterate the bucket. But that is probably really rare.
 
Average O(N).

Worst case is O(N^2), but is only the case if every element inserted hashed to the same key, meaning that we have to iterate the bucket. But that is probably really rare.
??
that contradicts the official documentation.
 

Official documentation doesn't say anything about worst case complexity



I searched several source and they all said the inseration worst time is O(N). Now, you have to insert N time for each element of the array. That would make it O(N^2)
 
An unordered_map is a data structure that stores data in the form of key-value pairs. The best case and the average case complexity for all the operations in an unordered_map is O(1). While in the worst case, the time complexity for all the operations in an unordered_map is O(n).
 
An unordered_map is a data structure that stores data in the form of key-value pairs. The best case and the average case complexity for all the operations in an unordered_map is O(1). While in the worst case, the time complexity for all the operations in an unordered_map is O(n).
I think we are saying the same thing. I was referring to O(N^2) as the complexity of worst case of the whole algorithm needed to solve the counting pairs, and not just the insertion to unordered_map part (you need a for loop to insert all the elements to unordered_map too)

I agree that on average, insertion is O(1). But because insert N times, it makes the whole problem O(N) on average.
Worse case, insertion is O(N). But we insert N times. So in the off case, the whole problem is O(N^2) at worst.

I guess we are turning this thread into leet code lol.
 
Last edited:
Are there other new C++2X features that you are interested in implementing at work?
I already mentioned concepts and range-based. There are still much to optimize with those features.

For other features like spaceship operator, map contains, strings prefix and suffix or calendar libraries, I have found that these have very incremental benefits. While it may be true, often time there are just too many instances of the old code, that the owner often unwilling to make one exception if it hurts clarity of the code or create different looking codes that do the same things. For example, if the string prefix is checked in a certain way in the code base, then the owner would like to either keep it the old way, or replace every code instances with the new method for the sake of "consistency". In terms of clarity, I have found that people are really receptive of std::accumulate, but is a bit afraid of using transform-reduce. It is a constant battle to decide between better code, impact and suitability to the organization.

Finally, after this round of optimization, I actually plan to focus more on parallelization of offline pipelines that involve processing of logging. Because the logging is so big, there are pipelines that run everyday cost a lot of resources. These will have less latency requirement but more memory requirement. The use of threads will be really useful here.
 
Finally, after this round of optimization, I actually plan to focus more on parallelization of offline pipelines that involve processing of logging. Because the logging is so big, there are pipelines that run everyday cost a lot of resources. These will have less latency requirement but more memory requirement. The use of threads will be really useful here.
One funny story. Someone forgot to turn off a pipeline that is no longer useful. The pipeline keeps running and cost resources. When someone found out, we turned off the gas stove and save resource equals to a few percentage of our annual saving goals lol.
 
How did you go about calculating all these millions saved and developer years saved? I’d be curious to learn how to quantify this better for my own purposes as it seems like a valuable career thing to be able to measure concretely.
The company has an excellent infrastructure to measure the gains. Every binary job has a good record keeping of CPU, RAM, GPU, and TPU usage, from when to when down to the nanoseconds. Based on this massive amount of data, an entire team is dedicated to data analysis, understanding the types of jobs that tend to perform poorly / hog resources. They will then issue directives and policies on log and resource costs and provide tools to help other Engs measure their gains. Obviously, one of these guidelines was how to convert CPU, RAM, GPU, and TPU usage into a single unit SWE years.

In my own case, because the parallelization of the ad candidate process makes the query run some tens of milliseconds faster, the load balancer was able to push more queries into a server because throughput increases. Therefore, fewer servers were needed for specific locales of the world. We added up all the reductions in servers, RAM, TPU, and usage, and that amounted to dozens of SWE years, hence millions.

Even the compiler was modified to have some in-house use cases. The binary also has something like a DEBUG mode on steroids, where it stores the clock time usage of every single function. A small percentage of real-world queries were reduplicated and rerouted to these binaries for the sole purpose of investigating resource hogging. This helps investigate and see which function is costing a significant amount of resources. For smaller individual projects, something similar to this, where a simple record keeping of function total CPU time could be pretty useful.
 
I need to figure out what the range based process is, because I was asked an array sorting question for a technical screening and even though my multiple for loops would have worked for all cases it wouldn't compile in time for three of them (under two seconds).

This gives a great practical reason to move into the Advanced course when I'm able/finished with the first, thanks.

As for a question, what was your:
favorite/least favorite sections in the advanced course
hardest/easiest
most useful/most niche
Learning algorithm designs techniques like dynamic programming, max flow/min cut, greedy algorithms, divide and conquer, etc is really helpful for finding optimized solutions.
 
Last edited:
Learning algorithm designs techniques like dynamic programming, max flow/min cut, greedy algorithms, divide and conquer, etc is really helpful for finding optimized solutions.

Yes. This is why to this date, I still encourage people to actually practice often on LeetCode.

A lot of people often laugh at LeetCode for asking way too specific algorithm questions, while in their real job they never got to use them. But that being said, I still do think that practicing LeetCode (upto a certain point) is a great way to understand algorithm complexity, and has very regular practice with it. This really helps people out in identifying these inefficiencies and contribute positively to whatever companies you work for.
 
Back
Top