Dedicated Endpoints
Endpoints are GPU instances enabling AI models to be deployed on dedicated hardware . This enables private inferences that automatically scale up or down based on traffic.
Overview
By using Endpoints, Segmind users can unlock the full potential of AI models, enabling them to build innovative applications that drive business growth and solve real-world challenges easily and reliably.
Blazingly fast: Endpoints with 1+ baseline GPUs are the fastest way to inference on Segmind platform for any models.
Private by default: Endpoints allows developers to perform private inference on their models. This ensures that latency and other runtime metrics are not affected by other users’ requests, unlike the public endpoints, which utilise a shared queue for all users.
Scalable by design: Endpoints offer scalable inference by handling large volumes of requests concurrently. Developers can specify up to 10 max workers to process multiple requests simultaneously. Endpoints also provide full control over scaling policies, enabling developers to manage complex and high-throughput workloads efficiently.
Cost-effective: With an option to specify a maximum limit for autoscaling workers, developers can set predictable costing for their endpoints. For 0-n setups, there is no upfront commitment, allowing seamless scale with increasing traffic, on a pay-as-you-go setup.
Choose your own hardware : Developers have the option to choose among a wide variety of GPU options for balancing price and performance. See https://docs.segmind.com/home/dedicated-endpoints#pricing for a list of available GPUs.
How it works
For creating an Endpoint, you specify the model, the GPU type and a URL slug for exposing your endpoint. Once created, Segmind allocates a cluster of GPU instance(s) for the specified model. These GPU instances act as dedicated computing resources responsible solely for executing inference requests on the specified slug.
When an inference request is received, Endpoint efficiently routes it to the appropriate instance in a round robin fashion, ensuring fast and reliable processing.
All user endpoints are private, meaning only the user who created the endpoint can inference on the endpoint using their API key. Segmind Models are all hosted on public endpoints.
Pricing
Endpoints are charged hourly depending on total usage - baseline + autoscaling. There is no additional cost for inference requests to endpoints.
GPU | Baseline | Autoscaling |
---|---|---|
L40 | $2.67/hr | $0.00424/s |
A40 | $1.73/hr | $0.00272/s |
A100 | $4.32/hr | $0.00608/s |
H100 | $9/hr | $0.01240/s |
24 hours of runtime cost is required as credit balance for creating/updating endpoints. Users receive a warning mail when their balance drops below 10 hours of runtime. And all endpoints are turned off once user credit balance goes below $10.
Autoscaling
While creating endpoints, number of Baseline and Autoscaling GPUs need to be specified.
Baseline instances are always-on machines, providing a solid low-latency starting point for handling consistent workloads.
Autoscaling instances, on the other hand, dynamically spin up and down based on demand, ensuring that resources are allocated only when necessary, albeit with a cold-start latency of around 5-500s.
Baseline workers specify minimum number of GPUs, that are always on. Autoscaling workers are turned on only whenever required, and are billed by second. To determine your required configuration, we suggest starting with 1-2 setup, and do a load test to check out response times with increasing traffic.
Advanced Configuration
Endpoints offers advanced autoscaling capabilities through idle timeouts, execution timeout and scaling policy.
Idle timeout: How much time should the scaled worker continue to run after it has processed request. Defaults to 10s. This can be increased to 300s to avoid cold starts.
Execution timeout: Max time for an inference request, after which the request is terminated. Default is 600s.
This scaling approach ensures efficient resource utilisation, minimizing costs while maintaining peak performance, all while giving control to the users for managing their costs and usage.
Examples
Development/Testing (0-1 setup): Basic serverless setup, good for testing models. High latency, as almost all requests require cold start. Low and predictable cost (0 commitment, 1 max)
Serverless auto scaling setup (0-5 setup): Good for cases where cold starts are not a concern, like batch jobs. Can scale upto 5 machines. Low cost (0 commitment, 5 max)
Production setup (2-5) : Production grade auto-scaling inference setup. 2 dedicated GPU machines running all the time, which scales up to 5 during peak traffic. Low latency.
RPM calculation
Consider faceswap-v2 for example. The average latency for this model is ~6s. The average RPM supported by each machine can be estimated as 60/6 = 10. So 1 machines can support 10 requests every minute, and each autoscaling workers can handle 10 additional requests per minute on demand.
Assume an app expects 20 requests per minute, that can go upto 100 during peak times. Then the minimum recommended configuration is 2-10. For even lower response times, we suggest setting a higher number of baseline GPUs. It should be noted that 0-10 would also work fine, though the response times would be higher for endpoints without any baseline GPU.
Colocation
Contact Segmind support to deploy multiple models on a single endpoint or increase the max limits for workers.
Getting started
Head over to https://docs.segmind.com/home/dedicated-endpoints/getting-started for detailed instructions regarding creating and managing your endpoints.
Last updated