HPC student here, still learning. If I understood your problem statement, users of the cluster reserve resources far greater than what was needed for their computation. They fear that if the allocated resources are not enough, then their program will crash and lose partial results.
Can you give an example of typical execution on the cluster? Is it a problem of number of hours allocated or number of compute cores?
If I'm running a PDE simulation, and I allocate n machines I want to use all of them, so there is no risk of idle machines. It's not trivial to estimate a priori the amount of time required for my simulation to complete, so I overestimate. But when the simulation is complete (even before the deadline), the resources get freed and can be used right away for another job
Maybe the problem is when many users are greedy. Also MPI simulations are difficult (if not impossible, correct me) to change dynamically: when a simulation is started with that number of ranks, I can't add new ranks at will if the resources are available
Thank you for the patience for everyone that answers
mbreese 6 hours ago [-]
Many HPC jobs aren’t simulations that are CPU bound. In fact, most of the jobs on the clusters I’ve used have been single-node jobs (so technically HTC, but that term is rarely used).
I do genomics work and my jobs tend to be bursty. They may use a lot of CPU initially, but the second half of the job is writing results. This takes only one core, but still the max amount of memory. Or, I can have jobs that are CPU light, but need the max amount of memory for only a fraction of their wall-time.
Here is an example for you. Let’s say I’m processing a genome sequencing experiment. This requires about 8 different steps between preprocessing the data, alignment, post filtering, QC stat collection, etc. These are large input files, so my jobs end up being IO bound. If I were reading and writing at each step, it would add days of time to the pipeline. Instead, what we do is read the data once, and pipe the data from program to program. But each program has different CPU and memory requirements. We need to reserve the $MAX requirements for each. As the data moves through the pipeline though, we eventually end up with max utilization for only a portion of the walltime. If I optimize for efficient walltime, I leave CPUs and memory idle for a large portion of the job.
People also tend to like to manage fewer jobs. So instead of splitting a job into multiple dependent submissions that are tailored to each program, people will write a bash script that runs, but is not efficient.
Many times, these patterns are difficult to predict and you can’t submit a job that says I need 20 cores for an hour, but only 2 for the last two hours. It’s difficult to balance utilization vs wall-time. No one likes waiting, but there is usually little incentive to have high utilization rates. And sometimes the balance is total walltime. Sometimes it’s execution complexity.
This is the problem this group is trying to solve - dynamically adapting the scheduler to know when a job isn’t going to use its full allocation of resources. I’m not sure there is a good way to do it. HPC users are concerned only with getting their jobs done fast. HPC admins want to see resources used efficiently. This is a classic pipelining problem: do you optimize for individual task time or overall system throughput?
I think the only way to really do this well is to make HPC jobs a market system where resources cost money to the users. When money is involved, people are incentivized to optimize their workloads. But that’s rarely the case for large HPC clusters and I’d personally hate it if I had to deal with a HPC processing budget.
In lieu of this, a common way to handle this lack of efficiency is to do “fair share” scheduling. This means that a users prior work load is taken into account when prioritizing their queue position. So, if I did a lot of work last week, jobs for a user that didn’t run jobs last week would get a priority boost over me. This doesn’t address the utilization efficiency directly, but it does make access to the cluster seem more “fair”.
ray__ 1 days ago [-]
This is a cool idea—I know from snooping on sumbit scripts and node utilization on the HPC that I use at my institution that most submissions leave some compute on the table (and many of them are egregiously bad). I'd probably vote in favor of sending every submitted sbatch script through an LLM (at least for everyone else, I'd would prefer tuning my own usage myself :) ).
Presumably the underlying model here is also an LLM? To what degree is it "fine-tuned", or is it just given a set of tools to build a good picture of cluster usage?
ismaeel_bashir 1 days ago [-]
Nope :) the core model isn’t an LLM. It’s a custom architecture built from the ground up. We natively accept multimodal inputs such as source code, submission scripts and hardware topologies. The LLMs in the post are the baselines we beat.
This is also why fine-tuning matters for us. We train a cluster-specific model that gets better as more jobs run on your cluster, because the same code behaves differently on different topology. An LLM reasons about code/script in a vacuum with no native sense of how your nodes actually perform
DoctorOetker 14 hours ago [-]
What kind of non-LLM machine learning is applied to source code? number of lines and other facets?
Without language modeling (rendering it an xLM) how does one process computer language files (source code)? Or are you saying its an SLM not an LLM?
ray__ 24 hours ago [-]
I see, very interesting, thanks!
iroddis 21 hours ago [-]
This is really cool, and definitely needed.
Do you do any tracking of resource consumption over the runtime of a job? We have many jobs that use the requested memory only for a portion of the runtime, and are otherwise compute bound. It would be nice to be able to learn the profiles through time of jobs and layer them to get better resource utilization.
ismaeel_bashir 20 hours ago [-]
Yes :)
This is actually a really cool feature of the platform. We ingest DCGM, CUPTI, and cgroups to give users granular telemetry of what exactly is going on in the hardware they allocated when running jobs on it.
We also have profiler that has single digit overhead to correlate stack frames with hardware metrics. What this means is not only will you be able to see if you job was compute bound or memory bound at time x, but also you will be able to correlate this to areas in your code [currently only supported in python - other languages coming soon :) ]
Would love to show you a demo of this live. Feel free to email me at ismaeel@expanse.org.uk
flounder3 1 days ago [-]
One traditional enterprise goal of 40% utilization was to cover DR/failovers, so one region could take on 100% of traffic from another, with 20% headroom.
I'm curious about the granularity of contracts around granting/selling excess capacity. Are they short term? Can the owner evict those workloads (with a penalty)?
ismaeel_bashir 1 days ago [-]
Good point - people do set capacity aside, reserving it for later.
But our utilisation measurements are from waste within a users allocation. It’s waste of what users are actually requesting and running, not from any reserved idle capacity.
For now we sit only on the prediction/intelligence layer; we don’t do any scheduling. We don’t grant or sell capacity, we just tell the scheduler (and user) what a job actually needs.
boringperson 1 days ago [-]
> Datacenters run at roughly 30% to 40% effective utilisation
I wonder what is stopping datacenters from passing this benefit to customers by launching better tuned plans. For example, t series EC2 instances on AWS.
aleksiy123 1 days ago [-]
Isn’t the fact that you just referenced it indicate that they do?
I feel like it’s probably just complexity.
Different workloads benefit from specific types of optimisations.
nostrebored 16 hours ago [-]
this is true of most actual clouds/neoclouds. oversubscription and intelligent placement of workloads is already something they do. I’ve known a few people at AWS who have offset unbelievable amounts of cost by optimizing placement.
keremimo 1 days ago [-]
Greed
rjpruitt16 1 days ago [-]
I have been working on open source traffic shaper for agents. I think it may help you better with prediction if requests don’t stampede you
I'm writing book on perf optimization, love to ask you questions sometime. email me (in my bio here) if interested. thanks!
ismaeel_bashir 1 days ago [-]
Sure would be happy to :)
I’ll send you an email, good luck with the book!
mike_d 19 hours ago [-]
Your "OS Wastage Scanner" is grammatically incorrect. It's "waste."
mike_d 19 hours ago [-]
From a security perspective this is a non-starter. If you leave your MongoDB instance open and I steal the telemetry you are collecting, I can reverse engineer the data into meaningful insights into cluster workloads. So all your potential national security customers or IP sensitive customers (finance, biotech, etc) are immediately out.
Any competent enterprise risk team is going to give a hard no to a SaaS application being in the critical path for on-prem business critical workloads. So there goes Fortune 100 too.
If you are successful and better schedule workloads you are just deferring upgrades and expansions. The customers Dell/HPE/etc. sales rep is going to freak out, some vice presidents are going to go golfing together, and all the remaining high value customers don't renew.
What you are really left with is the "small and medium business" clusters that are purpose specific. They are running 100% on a handful of tasks that can probably be hand tuned.
This sounds like really cool technology, I just don't see the business. Hopefully you'll consider open sourcing it soon.
ismaeel_bashir 18 hours ago [-]
Thanks, the security point is valid, so let me be specific about how deployment works for us!
There's no telemetry egress. Deployments are air-gapped and run in the customer's VPC, on their own hardware. We don't ship telemetry out to a SaaS backend to reverse-engineer; the data never leaves their environment, and for on-prem/air-gapped customers there's zero egress and full audit logging. We are doing all this because finance, biotech, and national-scale customers are the design target for us - we all worked in the space and understand what security measures need to be in place for this to work.
For example, the "open MongoDB" failure you mentioned isn't something that would concern us, because there's no central store of their data to leak.
On "SaaS being in the critical path": we agree, and that's why we're not in it. We're not a scheduler or a runtime. Our daemon is passive and if it falls over, jobs still submit and run exactly as they do today. We sit alongside as a prediction/recommendation layer, not in the path that has to be up for the cluster to work
For upgrades and expansions with increasing utilisation, most large scale compute users are capacity constrained and growing faster than they can buy GPUs. If anything we are delaying the expansion not killing it. In terms of unit economics, being able to serve more users with tighter user allocations is a net positive for cloud providers and is something they actively try and pursue :)
mike_d 16 hours ago [-]
Probably the most helpful advice I can give you is pointing out that I wrote my comment after reading your homepage and docs. :)
I used to run security for building size computers if you want any feedback. My email is in my profile.
Of course, it's impossible to know for sure what was LLM processed or not, but your posts are getting classified that way and, on inspection, it does seem justified.
Of course, it's impossible to know for sure what was LLM processed or not, but your posts are getting classified that way and, on inspection, this does seem justified.
Rendered at 17:47:19 GMT+0000 (Coordinated Universal Time) with Vercel.
Can you give an example of typical execution on the cluster? Is it a problem of number of hours allocated or number of compute cores?
If I'm running a PDE simulation, and I allocate n machines I want to use all of them, so there is no risk of idle machines. It's not trivial to estimate a priori the amount of time required for my simulation to complete, so I overestimate. But when the simulation is complete (even before the deadline), the resources get freed and can be used right away for another job
Maybe the problem is when many users are greedy. Also MPI simulations are difficult (if not impossible, correct me) to change dynamically: when a simulation is started with that number of ranks, I can't add new ranks at will if the resources are available
Thank you for the patience for everyone that answers
I do genomics work and my jobs tend to be bursty. They may use a lot of CPU initially, but the second half of the job is writing results. This takes only one core, but still the max amount of memory. Or, I can have jobs that are CPU light, but need the max amount of memory for only a fraction of their wall-time.
Here is an example for you. Let’s say I’m processing a genome sequencing experiment. This requires about 8 different steps between preprocessing the data, alignment, post filtering, QC stat collection, etc. These are large input files, so my jobs end up being IO bound. If I were reading and writing at each step, it would add days of time to the pipeline. Instead, what we do is read the data once, and pipe the data from program to program. But each program has different CPU and memory requirements. We need to reserve the $MAX requirements for each. As the data moves through the pipeline though, we eventually end up with max utilization for only a portion of the walltime. If I optimize for efficient walltime, I leave CPUs and memory idle for a large portion of the job.
People also tend to like to manage fewer jobs. So instead of splitting a job into multiple dependent submissions that are tailored to each program, people will write a bash script that runs, but is not efficient.
Many times, these patterns are difficult to predict and you can’t submit a job that says I need 20 cores for an hour, but only 2 for the last two hours. It’s difficult to balance utilization vs wall-time. No one likes waiting, but there is usually little incentive to have high utilization rates. And sometimes the balance is total walltime. Sometimes it’s execution complexity.
This is the problem this group is trying to solve - dynamically adapting the scheduler to know when a job isn’t going to use its full allocation of resources. I’m not sure there is a good way to do it. HPC users are concerned only with getting their jobs done fast. HPC admins want to see resources used efficiently. This is a classic pipelining problem: do you optimize for individual task time or overall system throughput?
I think the only way to really do this well is to make HPC jobs a market system where resources cost money to the users. When money is involved, people are incentivized to optimize their workloads. But that’s rarely the case for large HPC clusters and I’d personally hate it if I had to deal with a HPC processing budget.
In lieu of this, a common way to handle this lack of efficiency is to do “fair share” scheduling. This means that a users prior work load is taken into account when prioritizing their queue position. So, if I did a lot of work last week, jobs for a user that didn’t run jobs last week would get a priority boost over me. This doesn’t address the utilization efficiency directly, but it does make access to the cluster seem more “fair”.
Presumably the underlying model here is also an LLM? To what degree is it "fine-tuned", or is it just given a set of tools to build a good picture of cluster usage?
This is also why fine-tuning matters for us. We train a cluster-specific model that gets better as more jobs run on your cluster, because the same code behaves differently on different topology. An LLM reasons about code/script in a vacuum with no native sense of how your nodes actually perform
Without language modeling (rendering it an xLM) how does one process computer language files (source code)? Or are you saying its an SLM not an LLM?
Do you do any tracking of resource consumption over the runtime of a job? We have many jobs that use the requested memory only for a portion of the runtime, and are otherwise compute bound. It would be nice to be able to learn the profiles through time of jobs and layer them to get better resource utilization.
This is actually a really cool feature of the platform. We ingest DCGM, CUPTI, and cgroups to give users granular telemetry of what exactly is going on in the hardware they allocated when running jobs on it.
We also have profiler that has single digit overhead to correlate stack frames with hardware metrics. What this means is not only will you be able to see if you job was compute bound or memory bound at time x, but also you will be able to correlate this to areas in your code [currently only supported in python - other languages coming soon :) ]
Would love to show you a demo of this live. Feel free to email me at ismaeel@expanse.org.uk
I'm curious about the granularity of contracts around granting/selling excess capacity. Are they short term? Can the owner evict those workloads (with a penalty)?
But our utilisation measurements are from waste within a users allocation. It’s waste of what users are actually requesting and running, not from any reserved idle capacity.
For now we sit only on the prediction/intelligence layer; we don’t do any scheduling. We don’t grant or sell capacity, we just tell the scheduler (and user) what a job actually needs.
I wonder what is stopping datacenters from passing this benefit to customers by launching better tuned plans. For example, t series EC2 instances on AWS.
I feel like it’s probably just complexity.
Different workloads benefit from specific types of optimisations.
https://www.linkedin.com/posts/rahmi-pruitt-a1bb4a127_agentn...
I’ll send you an email, good luck with the book!
Any competent enterprise risk team is going to give a hard no to a SaaS application being in the critical path for on-prem business critical workloads. So there goes Fortune 100 too.
If you are successful and better schedule workloads you are just deferring upgrades and expansions. The customers Dell/HPE/etc. sales rep is going to freak out, some vice presidents are going to go golfing together, and all the remaining high value customers don't renew.
What you are really left with is the "small and medium business" clusters that are purpose specific. They are running 100% on a handful of tasks that can probably be hand tuned.
This sounds like really cool technology, I just don't see the business. Hopefully you'll consider open sourcing it soon.
There's no telemetry egress. Deployments are air-gapped and run in the customer's VPC, on their own hardware. We don't ship telemetry out to a SaaS backend to reverse-engineer; the data never leaves their environment, and for on-prem/air-gapped customers there's zero egress and full audit logging. We are doing all this because finance, biotech, and national-scale customers are the design target for us - we all worked in the space and understand what security measures need to be in place for this to work.
For example, the "open MongoDB" failure you mentioned isn't something that would concern us, because there's no central store of their data to leak.
On "SaaS being in the critical path": we agree, and that's why we're not in it. We're not a scheduler or a runtime. Our daemon is passive and if it falls over, jobs still submit and run exactly as they do today. We sit alongside as a prediction/recommendation layer, not in the path that has to be up for the cluster to work
For upgrades and expansions with increasing utilisation, most large scale compute users are capacity constrained and growing faster than they can buy GPUs. If anything we are delaying the expansion not killing it. In terms of unit economics, being able to serve more users with tighter user allocations is a net positive for cloud providers and is something they actively try and pursue :)
I used to run security for building size computers if you want any feedback. My email is in my profile.
Of course, it's impossible to know for sure what was LLM processed or not, but your posts are getting classified that way and, on inspection, it does seem justified.
Of course, it's impossible to know for sure what was LLM processed or not, but your posts are getting classified that way and, on inspection, this does seem justified.