HTTP Client¶
The SDK uses a unified HttpClient abstraction for all HTTP communication. This allows you to customize authentication, rate limiting, and testing.
Architecture¶
┌─────────────────────────────────────────────────────────────────────┐
│ Your Application │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ RemoteQuickCommand ─────┐ │
│ ├──► HttpClient ──► StackSpot AI API │
│ Agent ──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Built-in Implementations¶
EnvironmentAwareHttpClient (Default)¶
Automatically detects the runtime environment and uses the appropriate client:
- CLI available → Uses
StkCLIHttpClient - Credentials configured → Uses
StandaloneHttpClient - Neither → Raises
ValueErrorwith clear instructions
from stkai import RemoteQuickCommand
# Works automatically in any environment
rqc = RemoteQuickCommand(slug_name="my-command")
The detection happens lazily on the first request, allowing you to call STKAI.configure() after import.
Zero Configuration
With EnvironmentAwareHttpClient, you don't need to worry about which client to use:
- Development: Install CLI and run
stk login - Production/CI: Set
STKAI_AUTH_CLIENT_IDandSTKAI_AUTH_CLIENT_SECRET
StkCLIHttpClient¶
Explicitly delegates authentication to the StackSpot CLI (oscli):
from stkai import RemoteQuickCommand, StkCLIHttpClient
# Explicit CLI usage
rqc = RemoteQuickCommand(
slug_name="my-command",
http_client=StkCLIHttpClient(),
)
StandaloneHttpClient¶
Explicitly uses client credentials for environments without StackSpot CLI:
from stkai import (
RemoteQuickCommand,
StandaloneHttpClient,
ClientCredentialsAuthProvider,
)
# Create auth provider
auth_provider = ClientCredentialsAuthProvider(
client_id="your-client-id",
client_secret="your-client-secret",
)
# Create HTTP client
http_client = StandaloneHttpClient(auth_provider=auth_provider)
# Use with RQC
rqc = RemoteQuickCommand(
slug_name="my-command",
http_client=http_client,
)
Or use the global configuration:
from stkai import STKAI, create_standalone_auth, StandaloneHttpClient
# Configure credentials globally
STKAI.configure(
auth={
"client_id": "your-client-id",
"client_secret": "your-client-secret",
}
)
# Create client using global config
auth_provider = create_standalone_auth()
http_client = StandaloneHttpClient(auth_provider=auth_provider)
TokenBucketRateLimitedHttpClient¶
Wraps another client with Token Bucket rate limiting:
from stkai import TokenBucketRateLimitedHttpClient, EnvironmentAwareHttpClient
http_client = TokenBucketRateLimitedHttpClient(
delegate=EnvironmentAwareHttpClient(),
max_requests=30, # Requests per window
time_window=60.0, # Window in seconds
)
AdaptiveRateLimitedHttpClient¶
Adds adaptive rate control with AIMD algorithm:
from stkai import AdaptiveRateLimitedHttpClient, EnvironmentAwareHttpClient
http_client = AdaptiveRateLimitedHttpClient(
delegate=EnvironmentAwareHttpClient(),
max_requests=100,
time_window=60.0,
min_rate_floor=0.1, # Never below 10%
penalty_factor=0.2, # Reduce by 20% on 429
recovery_factor=0.05, # Increase by 5% on success
)
429 Handling
When the server returns HTTP 429, AdaptiveRateLimitedHttpClient applies the AIMD penalty (reduces rate) and raises ServerSideRateLimitError. The actual retry logic is handled by the Retrying class, which respects the Retry-After header.
Rate Limiting¶
Terminology: Rate Limiting vs Throttling¶
The SDK uses "rate limiting" terminology, but the actual behavior is closer to throttling:
| Concept | Side | Behavior | Philosophy |
|---|---|---|---|
| Rate Limiting | Server | Rejects requests exceeding the limit (HTTP 429) | Reactive/Punitive |
| Throttling | Client | Delays requests to stay under the limit | Proactive/Preventive |
Why "Rate Limiting" terminology?
- Industry convention: AWS SDK, Google Cloud, and other popular SDKs use "rate limit" for client-side features
- Discoverability: Developers search for "rate limiting" when facing quota issues
- Problem alignment: The problem being solved is "don't violate the server's rate limit"
The actual behavior is hybrid:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Client-Side Rate Control Behavior │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. THROTTLING (Primary) │
│ Delays requests by waiting for tokens in a queue. │
│ Requests are NOT rejected — they wait their turn. │
│ │
│ 2. REJECTION (Secondary) │
│ Raises exceptions when: │
│ • Wait time exceeds max_wait_time → TokenAcquisitionTimeoutError │
│ • Server returns HTTP 429 → ServerSideRateLimitError (adaptive only) │
│ │
│ This means: requests wait in queue first, exceptions are last resort. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
This hybrid approach maximizes successful requests while providing clear failure modes when limits can't be respected.
Why Rate Limiting Matters for StackSpot AI¶
Agents and Remote Quick Commands (RQCs) make calls to LLM models that have shared quotas and costs per request. Without proper rate control:
- HTTP 429 errors flood your logs and waste retry cycles
- Service degradation affects other teams sharing the same quota
- Unexpected costs from runaway batch jobs
- Thundering herd when multiple processes start simultaneously
Rate limiting is especially important for:
| Scenario | Risk Without Rate Limiting |
|---|---|
| Batch processing (e.g., processing 1000 files) | Burst of requests exhausts quota in seconds |
| Multiple CI/CD pipelines | Pipelines compete for shared quota |
| Microservices with multiple replicas | Each replica thinks it has full quota |
| Development + Production sharing quota | Dev experiments impact production |
Understanding 429 Errors: Why They Still Happen¶
Important: Rate Limiting Does Not Eliminate 429 Errors
Even with perfectly configured rate limiting, your application will still receive HTTP 429 responses. This is normal and expected behavior, not a bug.
Why 429s are inevitable:
| Factor | Explanation |
|---|---|
| Fixed vs Sliding Windows | Server resets quota in fixed intervals (e.g., every 60s), while client token bucket refills continuously. Timing mismatches cause 429s at window boundaries. |
| Clock Skew | Client and server clocks are never perfectly synchronized. A request sent "within quota" may arrive after the server's window reset. |
| Network Latency | Variable network delays mean requests don't arrive in the order or timing they were sent. |
| Burst Patterns | Even with correct average rate, Poisson-distributed arrivals have bursts that temporarily exceed limits. |
| Shared Quotas | Multiple processes/services sharing a quota cannot perfectly coordinate without a central arbiter. |
The correct mental model:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Rate Limiting Mental Model │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ❌ WRONG: "Rate limiting prevents 429 errors" │
│ │
│ ✅ RIGHT: "Rate limiting REDUCES 429 errors and makes retry effective" │
│ │
│ Without rate limiting: 429 rate = 300%+ (chaos, retry exhaustion) │
│ With rate limiting: 429 rate = 5-10% (manageable, retry succeeds) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
This is why the SDK combines three mechanisms:
- Rate Limiting — Reduces 429 frequency from catastrophic to manageable
- Retry with Backoff — Recovers from the 429s that still occur
- Retry-After Header — Server tells client exactly when to retry
Industry perspective: This behavior is well-documented in distributed systems literature:
"We design our systems to reduce the probability of failure, but impossible to build systems that never fail. [...] Retries allow clients to survive these random partial failures and short-lived transient failures."
— AWS Builders Library: Timeouts, retries, and backoff with jitter
"Race conditions everywhere. Two requests at the exact same millisecond? You're in trouble. [...] ~5% error margin compared to true sliding window."
Key insight: Rate limiting algorithms (token bucket, sliding window) are deterministic in isolation, but become non-deterministic in distributed systems due to race conditions, clock skew, and network jitter. Practical implementations accept a ~5% error margin, trading perfect accuracy for performance. This is why rate limiting is best understood as "best-effort" rather than a guarantee — the goal is not zero 429s, but graceful handling of inevitable 429s through retry with exponential backoff and jitter.
Practical Implication
Don't disable retry logic thinking "my rate limiting is perfect." Always keep retry enabled — it's your safety net for the 429s that will inevitably occur.
Strategies: Token Bucket vs Adaptive¶
The SDK offers two rate limiting strategies, each suited for different scenarios.
Token Bucket Strategy¶
When to use:
- You have a fixed, known quota (e.g., "100 requests/minute")
- Your process runs alone or has a dedicated quota allocation
- You want predictable, simple rate limiting
How it works:
┌──────────────────────────────────────────────────────────────────┐
│ Token Bucket Algorithm │
├──────────────────────────────────────────────────────────────────┤
│ │
│ Bucket: [●●●●●○○○○○] (5 tokens available, 5 used) │
│ │
│ • Tokens refill over time at: max_requests / time_window │
│ • Each POST request consumes 1 token │
│ • When empty, requests wait until tokens available │
│ • If waiting exceeds max_wait_time → TokenAcquisitionTimeoutError │
│ • GET requests (polling) pass through without consuming tokens │
│ │
└──────────────────────────────────────────────────────────────────┘
Points of attention:
- Does not react to HTTP 429 from the server — it only controls outgoing rate
- If your quota is shared with other processes, you may still get 429s
- Best combined with retry logic (which the SDK provides automatically)
Adaptive Strategy (AIMD)¶
When to use:
- Multiple processes share the same quota
- Your quota is unpredictable or varies over time
- You want the SDK to automatically adjust based on server feedback
How it works:
┌──────────────────────────────────────────────────────────────────┐
│ AIMD Algorithm │
├──────────────────────────────────────────────────────────────────┤
│ │
│ On SUCCESS: │
│ effective_rate += max_requests × recovery_factor × jitter │
│ (gradual increase, jittered ±20%) │
│ │
│ On HTTP 429: │
│ effective_rate *= (1 - penalty_factor × jitter) │
│ (aggressive decrease, jittered ±20%) │
│ raise ServerSideRateLimitError → Retrying handles retry │
│ │
│ On TOKEN WAIT: │
│ sleep(wait_time × jitter) │
│ (sleep jitter ±20% to spread workers) │
│ │
│ Constraints: │
│ • Floor: effective_rate ≥ max_requests × min_rate_floor │
│ • Ceiling: effective_rate ≤ max_requests │
│ │
│ Anti-Thundering Herd: │
│ • Structural jitter: penalty/recovery vary ±20% per process │
│ • Sleep jitter: token wait varies ±20% │
│ • Deterministic RNG per process (hostname+pid seed) │
│ │
└──────────────────────────────────────────────────────────────────┘
Points of attention:
- Convergence time: After a 429, it takes several successful requests to recover full rate
- Cold start: Starts at
max_requestsand decreases on first 429 — may cause initial burst recovery_factortoo low = slow recovery after 429 spikepenalty_factortoo low = doesn't back off enough, keeps hitting 429s
429 Handling: When the server returns HTTP 429 (Too Many Requests), the AdaptiveRateLimitedHttpClient:
- Applies AIMD penalty - Reduces effective rate by
penalty_factor - Raises
ServerSideRateLimitError- Contains the response forRetry-Afterparsing
The Retrying class then handles the retry:
- Parses
Retry-Afterheader - If present and ≤ 60s, uses it as wait time - Calculates wait time - Uses max(Retry-After, exponential backoff)
- Adds jitter (0-30%) - Prevents thundering herd
- Retries the request - Up to
max_retriestimes
Protection against abusive Retry-After
The Retrying class ignores Retry-After values greater than 60 seconds to protect against buggy or malicious servers. In such cases, it falls back to exponential backoff.
Strategy Comparison¶
| Scenario | Recommended Strategy | Why |
|---|---|---|
| Single process, known quota | token_bucket |
Simple, predictable, no overhead |
| Multiple processes sharing quota | adaptive |
Automatically adjusts based on 429s + jitter prevents sync |
| API returns 429 frequently | adaptive |
Learns from server feedback |
| Stable workload, dedicated quota | token_bucket |
No need for dynamic adjustment |
| CI/CD with variable load | adaptive |
Handles concurrent pipeline runs with jitter desync |
| Server degrades gracefully (latency before 429s) | adaptive + CongestionAwareHttpClient |
Latency-based concurrency control |
Congestion Aware (EXPERIMENTAL)¶
Experimental Feature
CongestionAwareHttpClient is experimental. In most scenarios, the adaptive rate limiter alone provides equivalent or better results. This decorator MAY be useful in specific edge cases described below.
When to Consider¶
-
Server degrades gracefully: If your server's latency increases significantly before returning 429s, latency-based detection can provide earlier backpressure.
-
Standalone concurrency control: If you don't need rate limiting but want to prevent overwhelming a slow server.
-
Long-running requests: For workflows where concurrency matters more than rate (e.g., Agent::chat() with 10-30s requests).
Why It Often Doesn't Help¶
In most API scenarios with quotas:
- The server returns 429s quickly (before latency degrades noticeably)
- The
adaptiverate limiter reacts to 429s faster than latency-based detection - Combining both provides minimal additional benefit
Simulations showed that adaptive alone achieves similar or better success rates than adaptive + CongestionAwareHttpClient.
How It Works¶
Uses Little's Law (pressure = throughput × latency) to detect congestion:
┌─────────────────────────────────────────────────────────────────────────────┐
│ CongestionAwareHttpClient │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ CONCURRENCY (Semaphore) │
│ ────────────────────── │
│ Limits in-flight requests. Adjusts based on pressure: │
│ │
│ pressure = throughput × latency (Little's Law) │
│ │
│ • pressure > threshold → reduce concurrency │
│ • pressure < threshold → cautiously increase concurrency │
│ │
│ LATENCY (EMA Tracking) │
│ ────────────────────── │
│ Tracks latency via Exponential Moving Average for stable signal. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Composition Pattern¶
CongestionAwareHttpClient is designed to be composed with rate limiters:
from stkai import STKAI, RemoteQuickCommand, EnvironmentAwareHttpClient
from stkai._rate_limit import CongestionAwareHttpClient, AdaptiveRateLimitedHttpClient
# Disable global rate limiting (we'll configure manually)
STKAI.configure(rate_limit={"enabled": False})
# Layer 1: Base HTTP client
base = EnvironmentAwareHttpClient()
# Layer 2: Concurrency control (inner)
congestion = CongestionAwareHttpClient(
delegate=base,
max_concurrency=8, # Max in-flight requests
pressure_threshold=2.0, # Backpressure when pressure > 2.0
)
# Layer 3: Rate limiting (outer) - optional
client = AdaptiveRateLimitedHttpClient(
delegate=congestion,
max_requests=100,
time_window=60.0,
)
# Use with RQC
rqc = RemoteQuickCommand(slug_name="my-command", http_client=client)
Configuration Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
max_concurrency |
int |
8 |
Maximum concurrent in-flight requests |
pressure_threshold |
float |
2.0 |
Backpressure when pressure exceeds this |
latency_alpha |
float |
0.2 |
EMA smoothing factor (lower = more stable) |
growth_probability |
float |
0.3 |
Probability of increasing concurrency |
Recommendation¶
For most use cases, use the adaptive strategy alone:
| Scenario | Recommendation |
|---|---|
| RQC, any number of processes | adaptive (balanced preset) |
| Agent::chat() | adaptive (balanced preset) |
| Server with graceful degradation | Consider adaptive + CongestionAwareHttpClient |
| Experimentation/research | CongestionAwareHttpClient standalone |
Presets (Adaptive Strategy)¶
Presets provide pre-tuned configurations for the adaptive strategy. Instead of manually tuning penalty_factor, recovery_factor, etc., choose a preset that matches your use case:
from dataclasses import asdict
from stkai import STKAI, RateLimitConfig
# Conservative: stability over throughput
STKAI.configure(rate_limit=asdict(RateLimitConfig.conservative_preset(max_requests=20)))
# Balanced: sensible middle-ground (recommended for most cases)
STKAI.configure(rate_limit=asdict(RateLimitConfig.balanced_preset(max_requests=50)))
# Optimistic: throughput over stability
STKAI.configure(rate_limit=asdict(RateLimitConfig.optimistic_preset(max_requests=80)))
Preset Comparison¶
| Preset | max_wait_time |
min_rate_floor |
penalty_factor |
recovery_factor |
|---|---|---|---|---|
conservative_preset() |
120s (patient) | 0.05 (5%) | 0.5 (aggressive) | 0.02 (slow) |
balanced_preset() |
45s | 0.1 (10%) | 0.3 (moderate) | 0.05 (medium) |
optimistic_preset() |
20s | 0.3 (30%) | 0.15 (light) | 0.1 (fast) |
Validated via Simulations
These presets were tuned and validated via simulations for both workloads. Key difference: RQC (~200ms latency) benefits from rate limiting at 3+ processes, while Agent (~15s latency) only needs it at 7+ processes due to natural throughput limits. See simulations/results/rqc/reference/ and simulations/results/agent/reference/ for detailed findings.
When to Use Each Preset¶
Conservative — Stability over throughput
- Critical batch jobs that cannot fail
- Many concurrent processes (5+) sharing a tight quota
- Jobs that run overnight and can afford to be slow
- When 429 errors have significant business impact
Balanced — Sensible default
- General batch processing
- 2-5 concurrent processes sharing quota
- When you want reasonable throughput with good stability
- Recommended starting point for most applications
Optimistic — Throughput over stability
- Interactive CLI tools that need fast feedback
- Single process with dedicated quota
- When you have external retry logic or can tolerate failures
- Short-lived scripts where waiting is unacceptable
Calculating max_requests¶
Presets accept max_requests and time_window as parameters. Calculate based on your environment:
Example: Your team has a quota of 100 req/min. You run 3 batch jobs concurrently:
# Each process gets ~33 req/min
STKAI.configure(rate_limit=asdict(
RateLimitConfig.balanced_preset(max_requests=33)
))
Be conservative with the division
It's better to underestimate than overestimate. If unsure, divide by a higher number. You can always increase later.
Practical Scenarios¶
Scenario 1: CI/CD Pipeline (Single Process)¶
A GitHub Actions job that processes code files. Runs alone, predictable workload.
from stkai import STKAI
# Simple token bucket - we know we're the only consumer
STKAI.configure(
rate_limit={
"enabled": True,
"strategy": "token_bucket",
"max_requests": 60, # Our full quota
"time_window": 60.0,
"max_wait_time": 120.0, # Patient - job can wait
}
)
Scenario 2: Multiple Batch Jobs Sharing Quota¶
Three Python processes running simultaneously, each processing different data. They share a 100 req/min quota.
from dataclasses import asdict
from stkai import STKAI, RateLimitConfig
# Adaptive with conservative settings - let processes coordinate via 429s
STKAI.configure(rate_limit=asdict(
RateLimitConfig.conservative_preset(
max_requests=30, # 100 / 3 ≈ 33, round down for safety
)
))
Scenario 3: Interactive CLI Tool¶
A developer tool that needs fast feedback. User is waiting for response.
from dataclasses import asdict
from stkai import STKAI, RateLimitConfig
# Optimistic - fail fast, let user retry manually
STKAI.configure(rate_limit=asdict(
RateLimitConfig.optimistic_preset(
max_requests=50,
)
))
Scenario 4: Batch Processing with execute_many()¶
Processing 500 files using execute_many() with 8 workers. Note that all workers share the same rate limiter.
from dataclasses import asdict
from stkai import STKAI, RateLimitConfig, RemoteQuickCommand
STKAI.configure(rate_limit=asdict(
RateLimitConfig.balanced_preset(max_requests=40)
))
rqc = RemoteQuickCommand(
slug_name="analyze-code",
max_workers=8, # 8 threads, but all share the same rate limiter
)
# The rate limiter ensures we don't exceed 40 req/min
# regardless of how many workers are active
responses = rqc.execute_many(requests)
Important Considerations¶
Rate Limiting Only Applies to POST Requests¶
The SDK only rate-limits POST requests (which create executions). GET requests (used for polling) are not rate-limited:
POST /executions → Rate limited (consumes token)
GET /executions/{id} → NOT rate limited (polling is free)
This means your effective quota consumption depends on how many new executions you create, not on polling frequency.
Multiple Processes = Divide the Quota¶
Rate limiting is per-process. If you have 3 processes, each needs its own allocation:
# WRONG: Each process thinks it has full quota
STKAI.configure(rate_limit={"max_requests": 100}) # All 3 processes do this!
# RIGHT: Divide quota among processes
STKAI.configure(rate_limit={"max_requests": 33}) # 100 / 3
max_wait_time Can Block Your Application¶
If max_wait_time is too high, threads may block for a long time waiting for tokens:
# This can block a thread for up to 5 minutes!
STKAI.configure(rate_limit={"max_wait_time": 300})
# Better: fail faster and let retry logic handle it
STKAI.configure(rate_limit={"max_wait_time": 30})
Rate Limiter is Per-Instance¶
Each RemoteQuickCommand or Agent instance has its own rate limiter (via its HttpClient). They don't share state by default:
# These have SEPARATE rate limiters - combined they may exceed quota!
rqc1 = RemoteQuickCommand(slug_name="command-1")
rqc2 = RemoteQuickCommand(slug_name="command-2")
agent = Agent(agent_id="my-agent")
To share a rate limiter, pass the same HttpClient instance:
from stkai import RemoteQuickCommand, Agent, EnvironmentAwareHttpClient
# Create a single HTTP client (rate limiter included via STKAI.configure)
shared_client = EnvironmentAwareHttpClient()
# All instances share the same rate limiter
rqc1 = RemoteQuickCommand(slug_name="command-1", http_client=shared_client)
rqc2 = RemoteQuickCommand(slug_name="command-2", http_client=shared_client)
agent = Agent(agent_id="my-agent", http_client=shared_client)
Decentralized Best-Effort Approach¶
All rate limiting strategies in this SDK are decentralized — each process operates independently with minimal or no coordination between them.
How It Works¶
- Each process maintains its own local rate limiter state
- No shared state or communication between processes
- AIMD adjusts rate based on local observations (429 responses)
- Jitter (±20%) helps desynchronize processes sharing a quota
Trade-offs¶
| Aspect | Decentralized (this SDK) | Centralized (Redis, etc.) |
|---|---|---|
| Coordination | None — best effort | Full — precise quota enforcement |
| Complexity | Simple — no infrastructure | Complex — requires Redis/DB |
| Latency | Zero overhead | Network round-trip per request |
| Failure mode | Graceful — continues if others fail | Dependent — fails if Redis fails |
| Accuracy | Approximate — processes may overshoot | Exact — global counter |
This is one of the reasons why 429 errors are inevitable even with rate limiting enabled.
When to Consider Centralized Solutions¶
If your scenario requires precise quota enforcement across many processes, consider centralized rate limiting with:
- Redis —
INCR+EXPIREfor sliding window, or Redis Cell module - Database — Postgres advisory locks or atomic counters
- Distributed rate limiters — Envoy, Kong, or cloud API gateways
The SDK's decentralized approach is ideal for:
- Single process applications
- Small clusters (2-5 processes) with shared quotas
- Scenarios where simplicity and resilience matter more than precision
Practical Guidance
For most StackSpot AI SDK use cases, the decentralized approach is sufficient. The adaptive strategy with AIMD and jitter handles multi-process scenarios well without requiring external infrastructure. Only consider centralized solutions if you have strict compliance requirements or consistently observe quota violations despite proper configuration.
Global Configuration (Recommended)¶
The easiest way to enable rate limiting is via STKAI.configure(). When enabled, EnvironmentAwareHttpClient automatically wraps requests with rate limiting:
from stkai import STKAI, RemoteQuickCommand, Agent
# Enable rate limiting globally
STKAI.configure(
rate_limit={
"enabled": True,
"strategy": "token_bucket",
"max_requests": 30,
"time_window": 60.0,
}
)
# All clients now use rate limiting automatically
rqc = RemoteQuickCommand(slug_name="my-command")
agent = Agent(agent_id="my-agent")
Configuration via Code¶
from stkai import STKAI
# Token Bucket (simple, predictable)
STKAI.configure(
rate_limit={
"enabled": True,
"strategy": "token_bucket",
"max_requests": 30,
"time_window": 60.0,
"max_wait_time": 60.0, # Timeout after 60s waiting
}
)
# Adaptive (dynamic, handles 429)
STKAI.configure(
rate_limit={
"enabled": True,
"strategy": "adaptive",
"max_requests": 100,
"time_window": 60.0,
"min_rate_floor": 0.1, # Never below 10%
"penalty_factor": 0.2, # Reduce by 20% on 429
"recovery_factor": 0.05, # Increase by 5% on success
}
)
# Unlimited wait time (wait indefinitely for token)
STKAI.configure(rate_limit={"max_wait_time": None}) # or "unlimited"
Configuration via Environment Variables¶
STKAI_RATE_LIMIT_ENABLED=true
STKAI_RATE_LIMIT_STRATEGY=adaptive
STKAI_RATE_LIMIT_MAX_REQUESTS=50
STKAI_RATE_LIMIT_TIME_WINDOW=60.0
STKAI_RATE_LIMIT_MAX_WAIT_TIME=unlimited # or "none", "null"
STKAI_RATE_LIMIT_MIN_RATE_FLOOR=0.1
STKAI_RATE_LIMIT_PENALTY_FACTOR=0.2
STKAI_RATE_LIMIT_RECOVERY_FACTOR=0.05
RateLimitConfig Fields¶
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool |
False |
Enable rate limiting |
strategy |
"token_bucket" | "adaptive" |
"token_bucket" |
Rate limiting algorithm |
max_requests |
int |
100 |
Max requests per time window |
time_window |
float |
60.0 |
Time window in seconds |
max_wait_time |
float | None |
45.0 |
Max wait for token (None = unlimited) |
min_rate_floor |
float |
0.1 |
(adaptive) Min rate as fraction of max |
penalty_factor |
float |
0.3 |
(adaptive) Rate reduction on 429 |
recovery_factor |
float |
0.05 |
(adaptive) Rate increase on success |
Manual Configuration¶
For more control, you can manually create rate-limited clients:
from stkai import TokenBucketRateLimitedHttpClient, EnvironmentAwareHttpClient, RemoteQuickCommand
# Create rate-limited client manually
http_client = TokenBucketRateLimitedHttpClient(
delegate=EnvironmentAwareHttpClient(),
max_requests=30,
time_window=60.0,
)
# Use with specific client
rqc = RemoteQuickCommand(
slug_name="my-command",
http_client=http_client,
)
Exception Hierarchy¶
The SDK provides a clear exception hierarchy for rate limiting errors:
RetryableError (base - automatically retried)
├── ClientSideRateLimitError # Base for client-side rate limit errors
│ └── TokenAcquisitionTimeoutError # Timeout waiting for token (max_wait_time exceeded)
└── ServerSideRateLimitError # HTTP 429 from server (contains response for Retry-After)
| Exception | Raised By | When |
|---|---|---|
TokenAcquisitionTimeoutError |
TokenBucketRateLimitedHttpClient, AdaptiveRateLimitedHttpClient |
Token wait exceeds max_wait_time |
ServerSideRateLimitError |
AdaptiveRateLimitedHttpClient |
Server returns HTTP 429 |
requests.HTTPError |
Direct from requests |
HTTP 429 without Adaptive strategy |
All exceptions inherit from RetryableError, which the Retrying class automatically retries with exponential backoff.
Timeout Handling¶
Both strategies raise TokenAcquisitionTimeoutError when a thread waits too long for a token:
from stkai import TokenBucketRateLimitedHttpClient, TokenAcquisitionTimeoutError, StkCLIHttpClient
http_client = TokenBucketRateLimitedHttpClient(
delegate=StkCLIHttpClient(),
max_requests=10,
time_window=60.0,
max_wait_time=45.0, # Give up after 45 seconds
)
try:
response = http_client.post(url, data=payload)
except TokenAcquisitionTimeoutError as e:
print(f"Timeout after {e.waited:.1f}s (max: {e.max_wait_time}s)")
# Handle timeout: retry later, skip request, or fail gracefully
| Value | Behavior |
|---|---|
60.0 (default) |
Wait up to 60 seconds for a token |
None or "unlimited" |
Wait indefinitely (no timeout) |
0.1 |
Fail-fast mode (almost immediate timeout) |
Choosing max_wait_time
A good rule of thumb is to set max_wait_time equal to time_window. This ensures at least one full rate limit cycle can complete before timing out.
Thread Safety¶
Both rate-limiting strategies are thread-safe and work correctly with:
execute_many()concurrent workers- Multi-threaded applications
- Shared client instances
from stkai import STKAI, RemoteQuickCommand
STKAI.configure(
rate_limit={"enabled": True, "max_requests": 30}
)
rqc = RemoteQuickCommand(
slug_name="my-command",
max_workers=16, # 16 concurrent workers, still rate-limited
)
Custom HTTP Client¶
Implement the HttpClient interface for custom behavior:
from typing import Any
import requests
from stkai import HttpClient
class MyCustomHttpClient(HttpClient):
def get(
self,
url: str,
headers: dict[str, str] | None = None,
timeout: int = 30,
) -> requests.Response:
# Custom GET logic
return requests.get(url, headers=headers, timeout=timeout)
def post(
self,
url: str,
data: dict[str, Any] | None = None,
headers: dict[str, str] | None = None,
timeout: int = 30,
) -> requests.Response:
# Custom POST logic
return requests.post(url, json=data, headers=headers, timeout=timeout)
# Use custom client
rqc = RemoteQuickCommand(
slug_name="my-command",
http_client=MyCustomHttpClient(),
)
Decorator Pattern¶
Rate limiting clients use the decorator pattern - they wrap another client:
from stkai import (
AdaptiveRateLimitedHttpClient,
TokenBucketRateLimitedHttpClient,
StandaloneHttpClient,
ClientCredentialsAuthProvider,
)
# Build a decorated chain
auth_provider = ClientCredentialsAuthProvider(
client_id="id",
client_secret="secret",
)
# Base client with authentication
base_client = StandaloneHttpClient(auth_provider=auth_provider)
# Add fixed rate limiting
rate_limited = TokenBucketRateLimitedHttpClient(
delegate=base_client,
max_requests=50,
time_window=60.0,
)
# Add adaptive rate limiting on top
adaptive_client = AdaptiveRateLimitedHttpClient(
delegate=rate_limited, # Wrap the rate-limited client
max_requests=100,
time_window=60.0,
)
Testing with Mock Client¶
Create a mock client for testing:
from unittest.mock import Mock, MagicMock
import requests
from stkai import HttpClient, RemoteQuickCommand
# Create mock
mock_client = Mock(spec=HttpClient)
# Configure POST response
mock_response = MagicMock(spec=requests.Response)
mock_response.status_code = 200
mock_response.json.return_value = {"execution_id": "exec-123"}
mock_response.raise_for_status.return_value = None
mock_client.post.return_value = mock_response
# Configure GET response
get_response = MagicMock(spec=requests.Response)
get_response.status_code = 200
get_response.json.return_value = {
"progress": {"status": "COMPLETED"},
"result": {"data": "test"},
}
get_response.raise_for_status.return_value = None
mock_client.get.return_value = get_response
# Use in tests
rqc = RemoteQuickCommand(
slug_name="test-command",
http_client=mock_client,
)
Thread Safety¶
All built-in HTTP clients are thread-safe:
EnvironmentAwareHttpClient- Delegates to thread-safe clientsStkCLIHttpClient- Stateless, safeStandaloneHttpClient- Auth provider handles token cachingTokenBucketRateLimitedHttpClient- Usesthreading.Lock()AdaptiveRateLimitedHttpClient- Usesthreading.Lock()
Safe to share across threads and with execute_many():
# Thread-safe: shared client with concurrent workers
http_client = TokenBucketRateLimitedHttpClient(...)
rqc = RemoteQuickCommand(
slug_name="my-command",
http_client=http_client,
max_workers=16, # 16 concurrent threads
)
responses = rqc.execute_many(requests)
Next Steps¶
- RQC Rate Limiting - Detailed rate limiting examples for RQC
- Configuration - Global SDK configuration
- Getting Started - Quick setup guide