Performance and Scalability in Software Engineering
What’s the difference?
“Performance is an indication of the responsiveness of a system to execute any action within a given time interval, while scalability is the ability of a system either to handle increases in load without impact on performance or for the available resources to be readily increased. Cloud applications typically encounter variable workloads and peaks in activity. Predicting these, especially in a multi-tenant scenario, is almost impossible. Instead, applications should be able to scale out within limits to meet peaks in demand, and scale in when demand decreases. Scalability concerns not just compute instances, but other elements such as data storage, messaging infrastructure, and more.” -Microsoft
Performance
What are some things to look for?
No one can create a blanket rule for all software that says, “if x process takes longer than y seconds, it’s too slow.” Instead think about what type of tool is being built, and who the intended audience is. I like to think about how I feel when using a web site, watching TV, waiting at a traffic light, or any activity where I am forced to wait.
How long does it take before becoming distracted? Depending on the application, that distraction point may be a good place to start when considering an entire workflow, like navigating from one page to the next. Then, work backwards to individual components that enable that workflow to happen and decide how each should perform.
Use logging systems like Application Insights, Google Analytics, and others to give an overall picture. These tools can help identify bottle necks in your software.
I personally like to identify at least one performance concern per sprint to see if it can be improved in a meaningful way. Depending on the concern to be addressed, a small gain of only a few milliseconds can have a significant performance impact over the course of a day, week, or month.
I have a performance problem now and can’t wait for a software solution.
This unfortunately happens a lot. A few questions I like to ask myself are:
· Is the problem transient, and will it solve itself in short order?
High network volume, or other network issues is causing a temporary bottle neck.
Hard disk failure resulting in an array being rebuilt. i.e. temporary high disk IO.
· Can I throw hardware at the problem as a short-term solution?
If you are in the cloud or virtualized environment, it may be relatively simple to increase the scale of a resource for a few days to buy the engineering team the time they need.
· Can I disable the feature?
It’s becoming more common for features to be wrapped in on/off switches. If this is the case, can you disable the feature temporarily? Often it may be better to not have a feature than to have a poorly behaving one.
· Is a rollback to a previous version an option?
This of course assumes that the problem occurred because of a new software release.
· Is the problem environmental?
Is the system properly scaled?
Was there a security update that didn’t install correctly?
Was there a 3rd party software update?
Did a system get an OS update, virtualized, move to the cloud, etc…
· Was there a new marketing campaign and a previously under-utilized feature is now gaining prominence?
Obviously, this is not an exhaustive list, but it illustrates that there are a lot of questions that need to be asked. Is this a technical problem? Will it resolve itself? Is there a temporary solution? Does this need longer term engineering solutions? What is the impact to my reputation worth?
Scaling
My application performs terribly in production, but when the engineering teams tests it, it meets or exceeds expectations.
This is a very common scenario and is a good lead into scalability. Another way of saying this is, “I can’t reproduce these problems in my test environment because I can’t create enough activity.”
This is where good logging is critical, and testing has room for improvement. This is also not to say that your test cases are bad, but that there should be more focus made on these non-functional test cases.
There are a lot of things to look for, so as explored in performance above, we will share a number of considerations, but perhaps not a comprehensive list.
In this situation, I like to work is from the bottom up. The lower in the stack that a fix can be made, the more impactful it tends to be.
Some of the issues will look like performance fixes, and they may impact performance. The most important thing is to keep a careful eye on object scoping and reducing the amount of time a process is waiting. Or, more importantly, making good use of your available threads. That will ultimately improve the number of requests the application can service without degrading performance.
· Database
Check for blocking transactions. Large tables with a lot of activity tend to get blocked. Use the slimmest lock you can and keep in mind how the database server organizes data. Page locks sound better, but might not be, if the most active data is all stored in the same page. CONSULT YOUR DBA.
Sometimes a dirty read is acceptable. If this is the case, using query hints like NOLOCK, or READUNCOMMITTED, can tell SQL server to bypass a lot of internal checks and just give you what’s in the table now. This can reduce the impact any transactions or other blocks may have on your ability to read from the database.
What technology is your application using to interact with the database?
ORM tools are great, but may not give you the level of control you need for tuning your database.
A great example of this is Entity Framework. As of this writing, if you insert a lot of data, EF will insert that data one row at a time, which is very inefficient in a large/busy table.
Server configuration
If you are managing the database server yourself, paying attention to best practices for configuration can make a big difference. Again, consult with your DBA.
· Coding Practices (Did I do that?)
If you are working with C#, or really any development language that exposes something like the async/await pattern, a rule I like to follow is “if you need to ‘await’, it shouldn’t be in a loop.”
The reasoning is that async/away is generally there because you are trying to access some external resource that can block. “Await” signals the application to wait for the response before moving on, and if you’re in a loop, how many times will you be waiting? I can’t tell you how many times this loop is small in my test environment, and large to monumental in production.
So, what to do? Go to the external resource before your loop and get all the data you need ahead of time and put it into a collection of some type. (I’ll talk collections next, they are a pet peeve of mine). One trip to get 100 records may be 500ms, but 100 individual trips adds overhead - of opening, closing, disposing, async/await scaffolding, TCP/IP, etc. - to each call. Pay that penalty as few times as possible.
You mentioned collections?
Yep, collections are awesome if you use them correctly. .NET exposes more collections than you can shake the proverbial stick at, and each is tuned for a specific purpose.
List<T> - Unordered list - Great for adding things, or just looping through the whole list, but searching for something… not so much.
ISet<T>, or HashSet<T> - Great for reading and searching, but adding is expensive the bigger the collection is because the hash/index needs to be rebuilt each item that is added.
Dictionary<TKey, TValue> - Adding and reading are quick, but watch out for restrictions like a unique key, and exceptions being thrown if you don’t check for existence correctly.
Anecdote – I worked on a project once where performance and scale were critical. I spent days optimizing looking for a handful of milliseconds where I could find them. One thing I found was it was several times faster to create a List<T> and add all my records, then convert that List<T> to HashSet<T> to be read downstream than it was to start with a HashSet<T> or leave the collection as a List<T>. The difference was something like 10x the cost of converting between object types. So, to me it’s always worth it to know the collection type you are using, why your using it, and how it will be used downstream.
Oh, one last note… watch out for linq. It’s incredibly powerful, but is limited by the underlying structures you are writing queries on. A List<T>.Where or .FirstOrDefault() will perform differently than a HashSet<T> of the same query.
· Caching
Are you using any? If not, can you?
Be careful with the type of cache you are using. A Dictionary can be nice, but staying in sync can get complicated when you scale.
Distributed caching is ideal. Modern solutions like ScaleOut State Server or Redis are highly optimized and can be economical solutions.
I like to start my caching with data that doesn’t change much and then fine tune from there based on the type of application I am working with. Countries, states, Web page output, etc.
· Async/Await or Asynchronous Programming
Using async/await is not about improving performance at all, although you do get that as a side benefit if done right. Instead, it’s about making better use of the hosting platform’s available resources.
In .NET, and especially if your using IIS, async/await will take an async process and will put that process into a different “bucket” of threads. This frees up the available pool of resources that are servicing active requests to either pick up a new request or resume a task that is in the “waiting” bucket. This means that it’s possible that individual calls to an async/await process may be a tad slower because of thread management; however, your application is making much better use of the available resources and can now juggle considerably more requests than before while maintaining a reasonably consistent response rate.
I would encourage you to read up on the details of whatever programming language you are using. Each is a bit different, and the descriptions in this document barely scratch the surface.
· Web Server Farm
Being able to have multiple web servers running your application code has a lot of advantages.
CPU/Memory bound processes have more CPU’s/Memory to work with.
Maintenance becomes easier because you can take a server down without also taking your application down.
Application deployments are easier because you can deploy new code, move users to the new code, and finish deploying on the remaining systems.
There are some challenges as well to running multiple servers.
If you must maintain session state, then you will need an external system to do that, or configure AAR Affinity to keep requests “sticky” to the same server for the duration of the session.
If you have improperly scoped variables, like class level or static, that your application is using, these can be out of sync between running instances.
What to do?
First, start in your test environment and set up a 2nd or 3rd server. Figure out what breaks and start to fix it.
Conduct a code review if time allows. As you walk through your code, as yourself, “What happens if my user’s next request is on a different machine?”
Are you using a design pattern or architecture that lends itself to being able to scale?
Microservices architecture is a great example of a design that, when done right, can and does scale well.
Containerization is also a good way to get started and to quickly roll out, or destroy, additional instances when necessary.
Is a readonly database available and can I take advantage of that system to improve performance?
Azure SQL makes this very easy when using their premium tier. You just have to use a different connection string that “tells” SQL the ApplicationIntent, and it will handle routing for you.
Start looking for bottlenecks
Poorly sized components
Single points of failure.
High latency components.
Spend time on resources that need to scale. Don’t get stuck working on a component that wakes up every few hours, runs for a minute and goes back to sleep just because it’s an interesting technical challenge.