Dedicated Endpoints

Endpoints are GPU instances enabling AI models to be deployed on dedicated hardware . This enables private inferences that automatically scale up or down based on traffic.

Overview

By using Endpoints, Segmind users can unlock the full potential of AI models, enabling them to build innovative applications that drive business growth and solve real-world challenges easily and reliably.

  • Private by default: Endpoints allows developers to perform private inference on their models. This ensures that latency and other runtime metrics are not affected by other users’ requests, unlike Segmind public endpoints, which utilise a shared queue for all users.

  • Scalable by design: Endpoints offer scalable inference by handling large volumes of requests concurrently. Developers can specify up to 10 max workers to process multiple requests simultaneously. Endpoints provide full control over scaling policies, enabling developers to manage complex and high-throughput workloads efficiently.

  • Cost-effective: With an option to specify a maximum limit for autoscaling workers, developers can set predictable costing for their endpoints. For 0-n setups, there is no upfront commitment, enabling seamless scale with increasing traffic, on a pay-as-you-go setup.

How it works

For creating an Endpoint, you specify the model, the GPU type and a URL slug for exposing your endpoint. Once created, Segmind allocates a cluster of GPU instance(s) for the specified model. These GPU instances act as dedicated computing resources responsible solely for executing inference requests on the specified slug.

When an inference request is received, Endpoint efficiently routes it to the appropriate instance in a round robin fashion, ensuring fast and reliable processing.

All user endpoints are private, meaning only the user who created the endpoint can inference on the endpoint using their API key. Segmind Models are all hosted on public endpoints.

Pricing

Endpoints are charged hourly depending on total usage - baseline + autoscaling. There is no additional cost for inference requests to endpoints.

GPUBaselineAutoscaling

L40

$2.67/hr

$0.00424/s

A40

$1.73/hr

$0.00272/s

A100

$4.32/hr

$0.00608/s

H100

$9/hr

$0.01240/s

24 hours of runtime cost is requied as credit balance for creating/updating endpoints. Users receive a warning mail when their balance drops below 10 hours of runtime. And all endpoints are turned off once user credit balance goes below $10.

Autoscaling

While creating endpoints, number of Baseline and Autoscaling GPUs need to be specified.

Baseline instances are always-on machines, providing a solid low-latency starting point for handling consistent workloads.

Autoscaling instances, on the other hand, dynamically spin up and down based on demand, ensuring that resources are allocated only when necessary, albeit with a cold-start latency of around 5-500s.

Baseline workers specify minimum number of GPUs, that are always on. Autoscaling workers are turned on only whenever required, and are billed by second. To determine your required configuration, we suggest starting with 0-2 setup, and do a load test to check out response times with increasing traffic.

Advanced Configuration

Endpoints offers advanced autoscaling capabilities through idle timeouts, execution timeout and scaling policy.

  • Queue Delay: Scale 1 instance up as soon as queue delay time reaches the time specified (in seconds). default 4s

  • Request Count: Scale 1 instance up as soon as number of queued requests reaches the value specified. default 1

  • Idle timeout: How much time should the scaled worker continue to run after it has processed request. Defaults to 10s

  • Execution timeout: Max time for an inference request, after which the request is terminated. Default is 600s.

This scaling approach ensures efficient resource utilisation, minimizing costs while maintaining peak performance, all while giving control to the users for managing their costs and usage.

Examples

Development/Testing (0-1 setup): Basic serverless setup, good for testing models. High latency, as almost all requests require cold start. Low and predictable cost (0 commitment, 1 max)

Serverless auto scaling setup (0-5 setup): Good for cases where cold starts are not a concern, like batch jobs. Can scale upto 5 machines. Low cost (0 commitment, 5 max)

Production setup (2-5) : Production grade auto-scaling inference setup. 2 dedicated GPU machines running all the time, which scales up to 5 during peak traffic. Low latency.

RPM calculation

Consider faceswap-v2 for example. The average latency for this model is ~6s. The average RPM supported by each machine can be estimated as 60/6 = 10. So 1 machines can support 10 requests every minute, and each autoscaling workers can handle 10 additional requests per minute on demand.

Assume an app expects 10 requests per minute, that can go upto 100 during peak times. Then the minimum recommended configuration is 1-10. For even lower response times, we suggest setting a higher number of baseline GPUs. It should be noted that 0-10 would also work fine, although the response times are significantly higher for endpoints without any baseline GPU.

Colocation

Contact Segmind support to deploy multiple models on a single endpoint or increase the max limits for workers.

Getting started

Head over to https://docs.segmind.com/home/dedicated-endpoints/getting-started for detailed instructions regarding creating and managing your endpoints.

Last updated