HTTP Client¶

The SDK uses a unified HttpClient abstraction for all HTTP communication. This allows you to customize authentication, rate limiting, and testing.

Architecture¶

┌─────────────────────────────────────────────────────────────────────┐
│                          Your Application                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   RemoteQuickCommand ─────┐                                         │
│                           ├──► HttpClient ──► StackSpot AI API     │
│   Agent ──────────────────┘                                         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Built-in Implementations¶

EnvironmentAwareHttpClient (Default)¶

Automatically detects the runtime environment and uses the appropriate client:

CLI available → Uses StkCLIHttpClient
Credentials configured → Uses StandaloneHttpClient
Neither → Raises ValueError with clear instructions

from stkai import RemoteQuickCommand

# Works automatically in any environment
rqc = RemoteQuickCommand(slug_name="my-command")

The detection happens lazily on the first request, allowing you to call STKAI.configure() after import.

Zero Configuration

With EnvironmentAwareHttpClient, you don't need to worry about which client to use:

Development: Install CLI and run stk login
Production/CI: Set STKAI_AUTH_CLIENT_ID and STKAI_AUTH_CLIENT_SECRET

StkCLIHttpClient¶

Explicitly delegates authentication to the StackSpot CLI (oscli):

from stkai import RemoteQuickCommand, StkCLIHttpClient

# Explicit CLI usage
rqc = RemoteQuickCommand(
    slug_name="my-command",
    http_client=StkCLIHttpClient(),
)

Requirements

StackSpot CLI must be installed and authenticated:

stk login

StandaloneHttpClient¶

Explicitly uses client credentials for environments without StackSpot CLI:

from stkai import (
    RemoteQuickCommand,
    StandaloneHttpClient,
    ClientCredentialsAuthProvider,
)

# Create auth provider
auth_provider = ClientCredentialsAuthProvider(
    client_id="your-client-id",
    client_secret="your-client-secret",
)

# Create HTTP client
http_client = StandaloneHttpClient(auth_provider=auth_provider)

# Use with RQC
rqc = RemoteQuickCommand(
    slug_name="my-command",
    http_client=http_client,
)

Or use the global configuration:

from stkai import STKAI, create_standalone_auth, StandaloneHttpClient

# Configure credentials globally
STKAI.configure(
    auth={
        "client_id": "your-client-id",
        "client_secret": "your-client-secret",
    }
)

# Create client using global config
auth_provider = create_standalone_auth()
http_client = StandaloneHttpClient(auth_provider=auth_provider)

TokenBucketRateLimitedHttpClient¶

Wraps another client with Token Bucket rate limiting:

from stkai import TokenBucketRateLimitedHttpClient, EnvironmentAwareHttpClient

http_client = TokenBucketRateLimitedHttpClient(
    delegate=EnvironmentAwareHttpClient(),
    max_requests=30,      # Requests per window
    time_window=60.0,     # Window in seconds
)

AdaptiveRateLimitedHttpClient¶

Adds adaptive rate control with AIMD algorithm:

from stkai import AdaptiveRateLimitedHttpClient, EnvironmentAwareHttpClient

http_client = AdaptiveRateLimitedHttpClient(
    delegate=EnvironmentAwareHttpClient(),
    max_requests=100,
    time_window=60.0,
    min_rate_floor=0.1,      # Never below 10%
    penalty_factor=0.2,      # Reduce by 20% on 429
    recovery_factor=0.05,    # Increase by 5% on success
)

429 Handling

When the server returns HTTP 429, AdaptiveRateLimitedHttpClient applies the AIMD penalty (reduces rate) and raises ServerSideRateLimitError. The actual retry logic is handled by the Retrying class, which respects the Retry-After header.

Rate Limiting¶

Terminology: Rate Limiting vs Throttling¶

The SDK uses "rate limiting" terminology, but the actual behavior is closer to throttling:

Concept	Side	Behavior	Philosophy
Rate Limiting	Server	Rejects requests exceeding the limit (HTTP 429)	Reactive/Punitive
Throttling	Client	Delays requests to stay under the limit	Proactive/Preventive

Why "Rate Limiting" terminology?

Industry convention: AWS SDK, Google Cloud, and other popular SDKs use "rate limit" for client-side features
Discoverability: Developers search for "rate limiting" when facing quota issues
Problem alignment: The problem being solved is "don't violate the server's rate limit"

The actual behavior is hybrid:

┌─────────────────────────────────────────────────────────────────────────────┐
│                       Client-Side Rate Control Behavior                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   1. THROTTLING (Primary)                                                   │
│      Delays requests by waiting for tokens in a queue.                      │
│      Requests are NOT rejected — they wait their turn.                      │
│                                                                              │
│   2. REJECTION (Secondary)                                                  │
│      Raises exceptions when:                                                │
│      • Wait time exceeds max_wait_time → TokenAcquisitionTimeoutError       │
│      • Server returns HTTP 429 → ServerSideRateLimitError (adaptive only)  │
│                                                                              │
│   This means: requests wait in queue first, exceptions are last resort.     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

This hybrid approach maximizes successful requests while providing clear failure modes when limits can't be respected.

Why Rate Limiting Matters for StackSpot AI¶

Agents and Remote Quick Commands (RQCs) make calls to LLM models that have shared quotas and costs per request. Without proper rate control:

HTTP 429 errors flood your logs and waste retry cycles
Service degradation affects other teams sharing the same quota
Unexpected costs from runaway batch jobs
Thundering herd when multiple processes start simultaneously

Rate limiting is especially important for:

Scenario	Risk Without Rate Limiting
Batch processing (e.g., processing 1000 files)	Burst of requests exhausts quota in seconds
Multiple CI/CD pipelines	Pipelines compete for shared quota
Microservices with multiple replicas	Each replica thinks it has full quota
Development + Production sharing quota	Dev experiments impact production

Understanding 429 Errors: Why They Still Happen¶

Important: Rate Limiting Does Not Eliminate 429 Errors

Even with perfectly configured rate limiting, your application will still receive HTTP 429 responses. This is normal and expected behavior, not a bug.

Why 429s are inevitable:

Factor	Explanation
Fixed vs Sliding Windows	Server resets quota in fixed intervals (e.g., every 60s), while client token bucket refills continuously. Timing mismatches cause 429s at window boundaries.
Clock Skew	Client and server clocks are never perfectly synchronized. A request sent "within quota" may arrive after the server's window reset.
Network Latency	Variable network delays mean requests don't arrive in the order or timing they were sent.
Burst Patterns	Even with correct average rate, Poisson-distributed arrivals have bursts that temporarily exceed limits.
Shared Quotas	Multiple processes/services sharing a quota cannot perfectly coordinate without a central arbiter.

The correct mental model:

┌─────────────────────────────────────────────────────────────────────────────┐
│                     Rate Limiting Mental Model                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ❌ WRONG: "Rate limiting prevents 429 errors"                             │
│                                                                              │
│   ✅ RIGHT: "Rate limiting REDUCES 429 errors and makes retry effective"    │
│                                                                              │
│   Without rate limiting:  429 rate = 300%+  (chaos, retry exhaustion)       │
│   With rate limiting:     429 rate = 5-10%  (manageable, retry succeeds)    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

This is why the SDK combines three mechanisms:

Rate Limiting — Reduces 429 frequency from catastrophic to manageable
Retry with Backoff — Recovers from the 429s that still occur
Retry-After Header — Server tells client exactly when to retry

Industry perspective: This behavior is well-documented in distributed systems literature:

"We design our systems to reduce the probability of failure, but impossible to build systems that never fail. [...] Retries allow clients to survive these random partial failures and short-lived transient failures."

— AWS Builders Library: Timeouts, retries, and backoff with jitter

"Race conditions everywhere. Two requests at the exact same millisecond? You're in trouble. [...] ~5% error margin compared to true sliding window."

— The Hidden Complexity of Distributed Rate Limiting

Key insight: Rate limiting algorithms (token bucket, sliding window) are deterministic in isolation, but become non-deterministic in distributed systems due to race conditions, clock skew, and network jitter. Practical implementations accept a ~5% error margin, trading perfect accuracy for performance. This is why rate limiting is best understood as "best-effort" rather than a guarantee — the goal is not zero 429s, but graceful handling of inevitable 429s through retry with exponential backoff and jitter.

Practical Implication

Don't disable retry logic thinking "my rate limiting is perfect." Always keep retry enabled — it's your safety net for the 429s that will inevitably occur.

Strategies: Token Bucket vs Adaptive¶

The SDK offers two rate limiting strategies, each suited for different scenarios.

Token Bucket Strategy¶

When to use:

You have a fixed, known quota (e.g., "100 requests/minute")
Your process runs alone or has a dedicated quota allocation
You want predictable, simple rate limiting

How it works:

┌──────────────────────────────────────────────────────────────────┐
│                      Token Bucket Algorithm                       │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│   Bucket: [●●●●●○○○○○]  (5 tokens available, 5 used)             │
│                                                                   │
│   • Tokens refill over time at: max_requests / time_window       │
│   • Each POST request consumes 1 token                           │
│   • When empty, requests wait until tokens available             │
│   • If waiting exceeds max_wait_time → TokenAcquisitionTimeoutError     │
│   • GET requests (polling) pass through without consuming tokens │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

Points of attention:

Does not react to HTTP 429 from the server — it only controls outgoing rate
If your quota is shared with other processes, you may still get 429s
Best combined with retry logic (which the SDK provides automatically)

Adaptive Strategy (AIMD)¶

When to use:

Multiple processes share the same quota
Your quota is unpredictable or varies over time
You want the SDK to automatically adjust based on server feedback

How it works:

┌──────────────────────────────────────────────────────────────────┐
│                         AIMD Algorithm                            │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│   On SUCCESS:                                                     │
│     effective_rate += max_requests × recovery_factor × jitter    │
│     (gradual increase, jittered ±20%)                            │
│                                                                   │
│   On HTTP 429:                                                    │
│     effective_rate *= (1 - penalty_factor × jitter)              │
│     (aggressive decrease, jittered ±20%)                         │
│     raise ServerSideRateLimitError → Retrying handles retry       │
│                                                                   │
│   On TOKEN WAIT:                                                  │
│     sleep(wait_time × jitter)                                     │
│     (sleep jitter ±20% to spread workers)                        │
│                                                                   │
│   Constraints:                                                    │
│     • Floor: effective_rate ≥ max_requests × min_rate_floor      │
│     • Ceiling: effective_rate ≤ max_requests                     │
│                                                                   │
│   Anti-Thundering Herd:                                           │
│     • Structural jitter: penalty/recovery vary ±20% per process  │
│     • Sleep jitter: token wait varies ±20%                       │
│     • Deterministic RNG per process (hostname+pid seed)          │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

Points of attention:

Convergence time: After a 429, it takes several successful requests to recover full rate
Cold start: Starts at max_requests and decreases on first 429 — may cause initial burst
recovery_factor too low = slow recovery after 429 spike
penalty_factor too low = doesn't back off enough, keeps hitting 429s

429 Handling: When the server returns HTTP 429 (Too Many Requests), the AdaptiveRateLimitedHttpClient:

Applies AIMD penalty - Reduces effective rate by penalty_factor
Raises ServerSideRateLimitError - Contains the response for Retry-After parsing

The Retrying class then handles the retry:

Parses Retry-After header - If present and ≤ 60s, uses it as wait time
Calculates wait time - Uses max(Retry-After, exponential backoff)
Adds jitter (0-30%) - Prevents thundering herd
Retries the request - Up to max_retries times

Protection against abusive Retry-After

The Retrying class ignores Retry-After values greater than 60 seconds to protect against buggy or malicious servers. In such cases, it falls back to exponential backoff.

Strategy Comparison¶

Scenario	Recommended Strategy	Why
Single process, known quota	`token_bucket`	Simple, predictable, no overhead
Multiple processes sharing quota	`adaptive`	Automatically adjusts based on 429s + jitter prevents sync
API returns 429 frequently	`adaptive`	Learns from server feedback
Stable workload, dedicated quota	`token_bucket`	No need for dynamic adjustment
CI/CD with variable load	`adaptive`	Handles concurrent pipeline runs with jitter desync
Server degrades gracefully (latency before 429s)	`adaptive` + `CongestionAwareHttpClient`	Latency-based concurrency control

Congestion Aware (EXPERIMENTAL)¶

Experimental Feature

CongestionAwareHttpClient is experimental. In most scenarios, the adaptive rate limiter alone provides equivalent or better results. This decorator MAY be useful in specific edge cases described below.

When to Consider¶

Server degrades gracefully: If your server's latency increases significantly before returning 429s, latency-based detection can provide earlier backpressure.
Standalone concurrency control: If you don't need rate limiting but want to prevent overwhelming a slow server.
Long-running requests: For workflows where concurrency matters more than rate (e.g., Agent::chat() with 10-30s requests).

Why It Often Doesn't Help¶

In most API scenarios with quotas:

The server returns 429s quickly (before latency degrades noticeably)
The adaptive rate limiter reacts to 429s faster than latency-based detection
Combining both provides minimal additional benefit

Simulations showed that adaptive alone achieves similar or better success rates than adaptive + CongestionAwareHttpClient.

How It Works¶

Uses Little's Law (pressure = throughput × latency) to detect congestion:

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CongestionAwareHttpClient                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   CONCURRENCY (Semaphore)                                                   │
│   ──────────────────────                                                    │
│   Limits in-flight requests. Adjusts based on pressure:                     │
│                                                                              │
│       pressure = throughput × latency (Little's Law)                        │
│                                                                              │
│   • pressure > threshold → reduce concurrency                               │
│   • pressure < threshold → cautiously increase concurrency                  │
│                                                                              │
│   LATENCY (EMA Tracking)                                                    │
│   ──────────────────────                                                    │
│   Tracks latency via Exponential Moving Average for stable signal.          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Composition Pattern¶

CongestionAwareHttpClient is designed to be composed with rate limiters:

from stkai import STKAI, RemoteQuickCommand, EnvironmentAwareHttpClient
from stkai._rate_limit import CongestionAwareHttpClient, AdaptiveRateLimitedHttpClient

# Disable global rate limiting (we'll configure manually)
STKAI.configure(rate_limit={"enabled": False})

# Layer 1: Base HTTP client
base = EnvironmentAwareHttpClient()

# Layer 2: Concurrency control (inner)
congestion = CongestionAwareHttpClient(
    delegate=base,
    max_concurrency=8,       # Max in-flight requests
    pressure_threshold=2.0,  # Backpressure when pressure > 2.0
)

# Layer 3: Rate limiting (outer) - optional
client = AdaptiveRateLimitedHttpClient(
    delegate=congestion,
    max_requests=100,
    time_window=60.0,
)

# Use with RQC
rqc = RemoteQuickCommand(slug_name="my-command", http_client=client)

Configuration Parameters¶

Parameter	Type	Default	Description
`max_concurrency`	`int`	`8`	Maximum concurrent in-flight requests
`pressure_threshold`	`float`	`2.0`	Backpressure when pressure exceeds this
`latency_alpha`	`float`	`0.2`	EMA smoothing factor (lower = more stable)
`growth_probability`	`float`	`0.3`	Probability of increasing concurrency

Recommendation¶

For most use cases, use the adaptive strategy alone:

Scenario	Recommendation
RQC, any number of processes	`adaptive` (balanced preset)
Agent::chat()	`adaptive` (balanced preset)
Server with graceful degradation	Consider `adaptive` + `CongestionAwareHttpClient`
Experimentation/research	`CongestionAwareHttpClient` standalone

Presets (Adaptive Strategy)¶

Presets provide pre-tuned configurations for the adaptive strategy. Instead of manually tuning penalty_factor, recovery_factor, etc., choose a preset that matches your use case:

from dataclasses import asdict
from stkai import STKAI, RateLimitConfig

# Conservative: stability over throughput
STKAI.configure(rate_limit=asdict(RateLimitConfig.conservative_preset(max_requests=20)))

# Balanced: sensible middle-ground (recommended for most cases)
STKAI.configure(rate_limit=asdict(RateLimitConfig.balanced_preset(max_requests=50)))

# Optimistic: throughput over stability
STKAI.configure(rate_limit=asdict(RateLimitConfig.optimistic_preset(max_requests=80)))

Preset Comparison¶

Preset	`max_wait_time`	`min_rate_floor`	`penalty_factor`	`recovery_factor`
`conservative_preset()`	120s (patient)	0.05 (5%)	0.5 (aggressive)	0.02 (slow)
`balanced_preset()`	45s	0.1 (10%)	0.3 (moderate)	0.05 (medium)
`optimistic_preset()`	20s	0.3 (30%)	0.15 (light)	0.1 (fast)

Validated via Simulations

These presets were tuned and validated via simulations for both workloads. Key difference: RQC (~200ms latency) benefits from rate limiting at 3+ processes, while Agent (~15s latency) only needs it at 7+ processes due to natural throughput limits. See simulations/results/rqc/reference/ and simulations/results/agent/reference/ for detailed findings.

When to Use Each Preset¶

Conservative — Stability over throughput

Critical batch jobs that cannot fail
Many concurrent processes (5+) sharing a tight quota
Jobs that run overnight and can afford to be slow
When 429 errors have significant business impact

Balanced — Sensible default

General batch processing
2-5 concurrent processes sharing quota
When you want reasonable throughput with good stability
Recommended starting point for most applications

Optimistic — Throughput over stability

Interactive CLI tools that need fast feedback
Single process with dedicated quota
When you have external retry logic or can tolerate failures
Short-lived scripts where waiting is unacceptable

Calculating max_requests¶

Presets accept max_requests and time_window as parameters. Calculate based on your environment:

max_requests = (API quota) / (number of concurrent processes)

Example: Your team has a quota of 100 req/min. You run 3 batch jobs concurrently:

# Each process gets ~33 req/min
STKAI.configure(rate_limit=asdict(
    RateLimitConfig.balanced_preset(max_requests=33)
))

Be conservative with the division

It's better to underestimate than overestimate. If unsure, divide by a higher number. You can always increase later.

Practical Scenarios¶

Scenario 1: CI/CD Pipeline (Single Process)¶

A GitHub Actions job that processes code files. Runs alone, predictable workload.

from stkai import STKAI

# Simple token bucket - we know we're the only consumer
STKAI.configure(
    rate_limit={
        "enabled": True,
        "strategy": "token_bucket",
        "max_requests": 60,      # Our full quota
        "time_window": 60.0,
        "max_wait_time": 120.0,  # Patient - job can wait
    }
)

Three Python processes running simultaneously, each processing different data. They share a 100 req/min quota.

from dataclasses import asdict
from stkai import STKAI, RateLimitConfig

# Adaptive with conservative settings - let processes coordinate via 429s
STKAI.configure(rate_limit=asdict(
    RateLimitConfig.conservative_preset(
        max_requests=30,  # 100 / 3 ≈ 33, round down for safety
    )
))

Scenario 3: Interactive CLI Tool¶

A developer tool that needs fast feedback. User is waiting for response.

from dataclasses import asdict
from stkai import STKAI, RateLimitConfig

# Optimistic - fail fast, let user retry manually
STKAI.configure(rate_limit=asdict(
    RateLimitConfig.optimistic_preset(
        max_requests=50,
    )
))

Scenario 4: Batch Processing with execute_many()¶

Processing 500 files using execute_many() with 8 workers. Note that all workers share the same rate limiter.

from dataclasses import asdict
from stkai import STKAI, RateLimitConfig, RemoteQuickCommand

STKAI.configure(rate_limit=asdict(
    RateLimitConfig.balanced_preset(max_requests=40)
))

rqc = RemoteQuickCommand(
    slug_name="analyze-code",
    max_workers=8,  # 8 threads, but all share the same rate limiter
)

# The rate limiter ensures we don't exceed 40 req/min
# regardless of how many workers are active
responses = rqc.execute_many(requests)

Important Considerations¶

Rate Limiting Only Applies to POST Requests¶

The SDK only rate-limits POST requests (which create executions). GET requests (used for polling) are not rate-limited:

POST /executions → Rate limited (consumes token)
GET /executions/{id} → NOT rate limited (polling is free)

This means your effective quota consumption depends on how many new executions you create, not on polling frequency.

Multiple Processes = Divide the Quota¶

Rate limiting is per-process. If you have 3 processes, each needs its own allocation:

# WRONG: Each process thinks it has full quota
STKAI.configure(rate_limit={"max_requests": 100})  # All 3 processes do this!

# RIGHT: Divide quota among processes
STKAI.configure(rate_limit={"max_requests": 33})   # 100 / 3

max_wait_time Can Block Your Application¶

If max_wait_time is too high, threads may block for a long time waiting for tokens:

# This can block a thread for up to 5 minutes!
STKAI.configure(rate_limit={"max_wait_time": 300})

# Better: fail faster and let retry logic handle it
STKAI.configure(rate_limit={"max_wait_time": 30})

Rate Limiter is Per-Instance¶

Each RemoteQuickCommand or Agent instance has its own rate limiter (via its HttpClient). They don't share state by default:

# These have SEPARATE rate limiters - combined they may exceed quota!
rqc1 = RemoteQuickCommand(slug_name="command-1")
rqc2 = RemoteQuickCommand(slug_name="command-2")
agent = Agent(agent_id="my-agent")

To share a rate limiter, pass the same HttpClient instance:

from stkai import RemoteQuickCommand, Agent, EnvironmentAwareHttpClient

# Create a single HTTP client (rate limiter included via STKAI.configure)
shared_client = EnvironmentAwareHttpClient()

# All instances share the same rate limiter
rqc1 = RemoteQuickCommand(slug_name="command-1", http_client=shared_client)
rqc2 = RemoteQuickCommand(slug_name="command-2", http_client=shared_client)
agent = Agent(agent_id="my-agent", http_client=shared_client)

Decentralized Best-Effort Approach¶

All rate limiting strategies in this SDK are decentralized — each process operates independently with minimal or no coordination between them.

How It Works¶

Each process maintains its own local rate limiter state
No shared state or communication between processes
AIMD adjusts rate based on local observations (429 responses)
Jitter (±20%) helps desynchronize processes sharing a quota

Trade-offs¶

Aspect	Decentralized (this SDK)	Centralized (Redis, etc.)
Coordination	None — best effort	Full — precise quota enforcement
Complexity	Simple — no infrastructure	Complex — requires Redis/DB
Latency	Zero overhead	Network round-trip per request
Failure mode	Graceful — continues if others fail	Dependent — fails if Redis fails
Accuracy	Approximate — processes may overshoot	Exact — global counter

This is one of the reasons why 429 errors are inevitable even with rate limiting enabled.

When to Consider Centralized Solutions¶

If your scenario requires precise quota enforcement across many processes, consider centralized rate limiting with:

Redis — INCR + EXPIRE for sliding window, or Redis Cell module
Database — Postgres advisory locks or atomic counters
Distributed rate limiters — Envoy, Kong, or cloud API gateways

The SDK's decentralized approach is ideal for:

Single process applications
Small clusters (2-5 processes) with shared quotas
Scenarios where simplicity and resilience matter more than precision

Practical Guidance

For most StackSpot AI SDK use cases, the decentralized approach is sufficient. The adaptive strategy with AIMD and jitter handles multi-process scenarios well without requiring external infrastructure. Only consider centralized solutions if you have strict compliance requirements or consistently observe quota violations despite proper configuration.

Global Configuration (Recommended)¶

The easiest way to enable rate limiting is via STKAI.configure(). When enabled, EnvironmentAwareHttpClient automatically wraps requests with rate limiting:

from stkai import STKAI, RemoteQuickCommand, Agent

# Enable rate limiting globally
STKAI.configure(
    rate_limit={
        "enabled": True,
        "strategy": "token_bucket",
        "max_requests": 30,
        "time_window": 60.0,
    }
)

# All clients now use rate limiting automatically
rqc = RemoteQuickCommand(slug_name="my-command")
agent = Agent(agent_id="my-agent")

Configuration via Code¶

from stkai import STKAI

# Token Bucket (simple, predictable)
STKAI.configure(
    rate_limit={
        "enabled": True,
        "strategy": "token_bucket",
        "max_requests": 30,
        "time_window": 60.0,
        "max_wait_time": 60.0,  # Timeout after 60s waiting
    }
)

# Adaptive (dynamic, handles 429)
STKAI.configure(
    rate_limit={
        "enabled": True,
        "strategy": "adaptive",
        "max_requests": 100,
        "time_window": 60.0,
        "min_rate_floor": 0.1,       # Never below 10%
        "penalty_factor": 0.2,       # Reduce by 20% on 429
        "recovery_factor": 0.05,     # Increase by 5% on success
    }
)

# Unlimited wait time (wait indefinitely for token)
STKAI.configure(rate_limit={"max_wait_time": None})  # or "unlimited"

Configuration via Environment Variables¶

STKAI_RATE_LIMIT_ENABLED=true
STKAI_RATE_LIMIT_STRATEGY=adaptive
STKAI_RATE_LIMIT_MAX_REQUESTS=50
STKAI_RATE_LIMIT_TIME_WINDOW=60.0
STKAI_RATE_LIMIT_MAX_WAIT_TIME=unlimited  # or "none", "null"
STKAI_RATE_LIMIT_MIN_RATE_FLOOR=0.1
STKAI_RATE_LIMIT_PENALTY_FACTOR=0.2
STKAI_RATE_LIMIT_RECOVERY_FACTOR=0.05

RateLimitConfig Fields¶

Field	Type	Default	Description
`enabled`	`bool`	`False`	Enable rate limiting
`strategy`	`"token_bucket"` \| `"adaptive"`	`"token_bucket"`	Rate limiting algorithm
`max_requests`	`int`	`100`	Max requests per time window
`time_window`	`float`	`60.0`	Time window in seconds
`max_wait_time`	`float` \| `None`	`45.0`	Max wait for token (None = unlimited)
`min_rate_floor`	`float`	`0.1`	(adaptive) Min rate as fraction of max
`penalty_factor`	`float`	`0.3`	(adaptive) Rate reduction on 429
`recovery_factor`	`float`	`0.05`	(adaptive) Rate increase on success

Manual Configuration¶

For more control, you can manually create rate-limited clients:

from stkai import TokenBucketRateLimitedHttpClient, EnvironmentAwareHttpClient, RemoteQuickCommand

# Create rate-limited client manually
http_client = TokenBucketRateLimitedHttpClient(
    delegate=EnvironmentAwareHttpClient(),
    max_requests=30,
    time_window=60.0,
)

# Use with specific client
rqc = RemoteQuickCommand(
    slug_name="my-command",
    http_client=http_client,
)

Exception Hierarchy¶

The SDK provides a clear exception hierarchy for rate limiting errors:

RetryableError (base - automatically retried)
├── ClientSideRateLimitError      # Base for client-side rate limit errors
│   └── TokenAcquisitionTimeoutError     # Timeout waiting for token (max_wait_time exceeded)
└── ServerSideRateLimitError      # HTTP 429 from server (contains response for Retry-After)

Exception	Raised By	When
`TokenAcquisitionTimeoutError`	`TokenBucketRateLimitedHttpClient`, `AdaptiveRateLimitedHttpClient`	Token wait exceeds `max_wait_time`
`ServerSideRateLimitError`	`AdaptiveRateLimitedHttpClient`	Server returns HTTP 429
`requests.HTTPError`	Direct from `requests`	HTTP 429 without Adaptive strategy

All exceptions inherit from RetryableError, which the Retrying class automatically retries with exponential backoff.

Timeout Handling¶

Both strategies raise TokenAcquisitionTimeoutError when a thread waits too long for a token:

from stkai import TokenBucketRateLimitedHttpClient, TokenAcquisitionTimeoutError, StkCLIHttpClient

http_client = TokenBucketRateLimitedHttpClient(
    delegate=StkCLIHttpClient(),
    max_requests=10,
    time_window=60.0,
    max_wait_time=45.0,  # Give up after 45 seconds
)

try:
    response = http_client.post(url, data=payload)
except TokenAcquisitionTimeoutError as e:
    print(f"Timeout after {e.waited:.1f}s (max: {e.max_wait_time}s)")
    # Handle timeout: retry later, skip request, or fail gracefully

Value	Behavior
`60.0` (default)	Wait up to 60 seconds for a token
`None` or `"unlimited"`	Wait indefinitely (no timeout)
`0.1`	Fail-fast mode (almost immediate timeout)

Choosing max_wait_time

A good rule of thumb is to set max_wait_time equal to time_window. This ensures at least one full rate limit cycle can complete before timing out.

Thread Safety¶

Both rate-limiting strategies are thread-safe and work correctly with:

execute_many() concurrent workers
Multi-threaded applications
Shared client instances

from stkai import STKAI, RemoteQuickCommand

STKAI.configure(
    rate_limit={"enabled": True, "max_requests": 30}
)

rqc = RemoteQuickCommand(
    slug_name="my-command",
    max_workers=16,  # 16 concurrent workers, still rate-limited
)

Custom HTTP Client¶

Implement the HttpClient interface for custom behavior:

from typing import Any
import requests
from stkai import HttpClient

class MyCustomHttpClient(HttpClient):
    def get(
        self,
        url: str,
        headers: dict[str, str] | None = None,
        timeout: int = 30,
    ) -> requests.Response:
        # Custom GET logic
        return requests.get(url, headers=headers, timeout=timeout)

    def post(
        self,
        url: str,
        data: dict[str, Any] | None = None,
        headers: dict[str, str] | None = None,
        timeout: int = 30,
    ) -> requests.Response:
        # Custom POST logic
        return requests.post(url, json=data, headers=headers, timeout=timeout)

# Use custom client
rqc = RemoteQuickCommand(
    slug_name="my-command",
    http_client=MyCustomHttpClient(),
)

Decorator Pattern¶

Rate limiting clients use the decorator pattern - they wrap another client:

from stkai import (
    AdaptiveRateLimitedHttpClient,
    TokenBucketRateLimitedHttpClient,
    StandaloneHttpClient,
    ClientCredentialsAuthProvider,
)

# Build a decorated chain
auth_provider = ClientCredentialsAuthProvider(
    client_id="id",
    client_secret="secret",
)

# Base client with authentication
base_client = StandaloneHttpClient(auth_provider=auth_provider)

# Add fixed rate limiting
rate_limited = TokenBucketRateLimitedHttpClient(
    delegate=base_client,
    max_requests=50,
    time_window=60.0,
)

# Add adaptive rate limiting on top
adaptive_client = AdaptiveRateLimitedHttpClient(
    delegate=rate_limited,  # Wrap the rate-limited client
    max_requests=100,
    time_window=60.0,
)

Testing with Mock Client¶

Create a mock client for testing:

from unittest.mock import Mock, MagicMock
import requests
from stkai import HttpClient, RemoteQuickCommand

# Create mock
mock_client = Mock(spec=HttpClient)

# Configure POST response
mock_response = MagicMock(spec=requests.Response)
mock_response.status_code = 200
mock_response.json.return_value = {"execution_id": "exec-123"}
mock_response.raise_for_status.return_value = None
mock_client.post.return_value = mock_response

# Configure GET response
get_response = MagicMock(spec=requests.Response)
get_response.status_code = 200
get_response.json.return_value = {
    "progress": {"status": "COMPLETED"},
    "result": {"data": "test"},
}
get_response.raise_for_status.return_value = None
mock_client.get.return_value = get_response

# Use in tests
rqc = RemoteQuickCommand(
    slug_name="test-command",
    http_client=mock_client,
)

Thread Safety¶

All built-in HTTP clients are thread-safe:

EnvironmentAwareHttpClient - Delegates to thread-safe clients
StkCLIHttpClient - Stateless, safe
StandaloneHttpClient - Auth provider handles token caching
TokenBucketRateLimitedHttpClient - Uses threading.Lock()
AdaptiveRateLimitedHttpClient - Uses threading.Lock()

Safe to share across threads and with execute_many():

# Thread-safe: shared client with concurrent workers
http_client = TokenBucketRateLimitedHttpClient(...)

rqc = RemoteQuickCommand(
    slug_name="my-command",
    http_client=http_client,
    max_workers=16,  # 16 concurrent threads
)

responses = rqc.execute_many(requests)

Next Steps¶

RQC Rate Limiting - Detailed rate limiting examples for RQC
Configuration - Global SDK configuration
Getting Started - Quick setup guide

HTTP Client¶

Architecture¶

Built-in Implementations¶

EnvironmentAwareHttpClient (Default)¶

StkCLIHttpClient¶

StandaloneHttpClient¶

TokenBucketRateLimitedHttpClient¶

AdaptiveRateLimitedHttpClient¶

Rate Limiting¶

Terminology: Rate Limiting vs Throttling¶

Why Rate Limiting Matters for StackSpot AI¶

Understanding 429 Errors: Why They Still Happen¶

Strategies: Token Bucket vs Adaptive¶

Token Bucket Strategy¶

Adaptive Strategy (AIMD)¶

Strategy Comparison¶

Congestion Aware (EXPERIMENTAL)¶

When to Consider¶

Why It Often Doesn't Help¶

How It Works¶

Composition Pattern¶

Configuration Parameters¶

Recommendation¶

Presets (Adaptive Strategy)¶

Preset Comparison¶

When to Use Each Preset¶

Calculating max_requests¶

Practical Scenarios¶

Scenario 1: CI/CD Pipeline (Single Process)¶

Scenario 2: Multiple Batch Jobs Sharing Quota¶

Scenario 3: Interactive CLI Tool¶

Scenario 4: Batch Processing with execute_many()¶

Important Considerations¶

Rate Limiting Only Applies to POST Requests¶

Multiple Processes = Divide the Quota¶

max_wait_time Can Block Your Application¶

Rate Limiter is Per-Instance¶

Decentralized Best-Effort Approach¶

How It Works¶

Trade-offs¶

When to Consider Centralized Solutions¶

Global Configuration (Recommended)¶

Configuration via Code¶

Configuration via Environment Variables¶

RateLimitConfig Fields¶

Manual Configuration¶

Exception Hierarchy¶

Timeout Handling¶

Thread Safety¶

Custom HTTP Client¶

Decorator Pattern¶

Testing with Mock Client¶

Thread Safety¶

Next Steps¶