Distributed Bulk Rename Pipeline with BullMQ and Redis - Case Study

Background

RenameHub is a web platform for bulk renaming files using template-based rules and AI-assisted document classification. Users connect cloud storage providers — starting with Dropbox — or upload files directly, then run rename jobs that can touch hundreds of files in a single batch.

Early in development, rename jobs ran synchronously inside the HTTP request cycle. A user would submit a job, the server would process every file sequentially, and the response would arrive only after the last file was renamed. This worked fine for small batches but revealed serious problems as job sizes grew:

HTTP timeouts on large batches — the connection would drop before the job finished
No way to report partial progress to the user
A single failing file could abort the entire batch
No retry logic — transient cloud API errors caused silent data loss
Concurrency was limited to one file at a time per request

We needed a proper distributed job queue that could process files asynchronously, track progress in real time, retry on failure, and respect cloud provider rate limits — without requiring complex infrastructure.

The Challenge

Batch file renaming across cloud providers introduces a class of problems common to any distributed, I/O-bound workload:

Reliability under partial failure

A job renaming 500 Dropbox files will encounter transient failures — network blips, token expiry, provider rate limits. The pipeline needed to retry individual file operations independently without restarting the whole batch, and surface a clear success/failure summary at the end.

Real-time progress

Users submitted jobs and immediately asked "how many files are done?" Polling a REST endpoint every second was wasteful. We needed a push-based mechanism that streamed progress updates as each file completed.

Cloud API rate limits

Dropbox enforces per-user request-rate limits. Hammering the API with 500 concurrent rename calls leads to 429 Too Many Requests errors. The queue needed a concurrency cap and backoff strategy tuned to each provider's limits.

Horizontal scalability

The rename worker should be deployable as multiple instances so the system can process many jobs in parallel without any single instance becoming a bottleneck.

The Solution

We built the rename pipeline on BullMQ — a production-grade Node.js job queue library built on Redis. BullMQ handles job persistence, worker distribution, retries, and progress events out of the box, and Redis provides the durable, in-memory data store that backs all of it.

1. Queue Design

Every rename job submitted by a user produces one parent job and N child jobs — one per file — using BullMQ's built-inFlowProducer. The parent job tracks overall job state; child jobs represent individual file rename operations.

// Creating a rename flow
const flow = await flowProducer.add({
  name: 'rename-job',
  queueName: 'rename-jobs',
  data: { jobId, userId, totalFiles: files.length },
  children: files.map((file) => ({
    name: 'rename-file',
    queueName: 'rename-files',
    data: { jobId, fileId: file.id, sourcePath: file.path, targetName: file.targetName },
    opts: {
      attempts: 4,
      backoff: { type: 'exponential', delay: 1000 },
    },
  })),
});

The parent job does not start processing until all child jobs have completed or exhausted their retries. This gives us a natural aggregation point for the final job summary.

2. Worker Pool & Concurrency

A dedicated rename-files worker processes child jobs. The worker's concurrency is set to 8 simultaneous files per instance — a value tuned to stay comfortably within Dropbox's API rate limits while still processing jobs quickly.

const fileWorker = new Worker(
  'rename-files',
  async (job) => {
    const { jobId, fileId, sourcePath, targetName } = job.data;

    // Rename the file via the cloud provider SDK
    await dropboxClient.filesMove({ from_path: sourcePath, to_path: targetName });

    // Report progress back to the parent job
    await job.updateProgress({ fileId, status: 'success' });
  },
  {
    connection: redisConnection,
    concurrency: 8,
  }
);

Each worker instance is stateless and connects to the same Redis instance, so scaling horizontally is as simple as starting additional worker processes — no coordination overhead required.

3. Retry & Error Handling

Child jobs are configured with 4 attempts and exponential backoff starting at 1 second. This handles transient network errors and short rate-limit windows without manual intervention.

If a file exhausts all retries, BullMQ moves it to the failed set and the parent job's aggregation logic counts it as a failed rename. The user sees a final summary that distinguishes successful renames from failed ones — partial success is always reported honestly rather than silently swallowed.

// Parent job aggregation (runs when all children settle)
const parentWorker = new Worker('rename-jobs', async (job) => {
  const children = await job.getChildrenValues();
  const succeeded = Object.values(children).filter((c) => c.status === 'success').length;
  const failed = job.data.totalFiles - succeeded;

  await db.renameJob.update({
    where: { id: job.data.jobId },
    data: { status: failed === 0 ? 'completed' : 'partial', succeeded, failed },
  });
}, { connection: redisConnection });

4. Real-Time Progress via Server-Sent Events

Workers call job.updateProgress() after each file completes. The API layer subscribes to BullMQ's QueueEvents for the rename-files queue and forwards progress events to the user's browser over a Server-Sent Events (SSE) connection.

// SSE endpoint
app.get('/jobs/:jobId/progress', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');

  const queueEvents = new QueueEvents('rename-files', { connection: redisConnection });

  queueEvents.on('progress', ({ jobId, data }) => {
    if (data.jobId === req.params.jobId) {
      res.write(`data: ${JSON.stringify(data)}

`);
    }
  });

  req.on('close', () => queueEvents.close());
});

This means the dashboard updates in real time as each file renames — no polling, no stale counts, no manual refresh.

5. Rate Limiting with BullMQ Throttle

For cloud providers with strict per-minute API limits, we attach a rate limiter directly to the queue. BullMQ's built-in rate limiter pauses job pickup when the limit is reached and automatically resumes after the window resets — without any application-level sleep loops.

const fileWorker = new Worker('rename-files', processor, {
  connection: redisConnection,
  concurrency: 8,
  limiter: {
    max: 100,      // max 100 jobs
    duration: 60_000, // per 60 seconds
  },
});

Architecture Overview

The full pipeline looks like this:

User submits a rename job via the API.
The API creates a BullMQ flow: one parent job + N child file-rename jobs, all stored in Redis.
The rename-files worker picks up child jobs with a concurrency of 8, calling the cloud provider API for each file.
Each completed or failed file emits a progress event that the SSE endpoint forwards to the user's browser.
When all children settle, the rename-jobs parent worker writes the final summary to the database.
Failed files are retained in BullMQ's failed set for inspection and optional replay.

Redis acts as the single source of truth for all job state. There is no additional database polling required — BullMQ's event system handles all fan-out and aggregation.

Results

No more HTTP timeouts

Decoupling job execution from the HTTP request cycle eliminated timeouts entirely. The API returns a job ID immediately; processing happens asynchronously regardless of how many files are in the batch.

Reliable partial-failure handling

With per-file retry logic and honest aggregation, users now receive clear feedback on exactly which files succeeded and which failed — enabling them to re-submit only the failed subset rather than repeating the whole job.

Real-time dashboard updates

The SSE progress stream gives users a live count of completed files, making large jobs feel responsive rather than opaque. Perceived wait time dropped significantly even though actual processing time remained the same.

Horizontal scalability

Because workers are stateless and coordinate through Redis, spinning up additional worker instances during peak load is trivial. The queue automatically distributes pending jobs across all available workers.

Operational visibility

BullMQ's job states — waiting, active, completed,failed, delayed — are stored in Redis and queryable at any time. This gives the ops team a clear window into queue health without custom instrumentation.

Conclusion

BullMQ and Redis gave us a production-grade distributed job queue with minimal infrastructure overhead. By modelling each rename batch as a parent-child flow, we got per-file retry isolation, reliable aggregation, and real-time progress — all of which are critical for a user-facing batch operation where partial failure is a normal condition, not an edge case.

The same pattern — a BullMQ flow with concurrency-capped workers, exponential backoff, and SSE progress streaming — applies to any I/O-bound distributed workload: document conversion, data ingestion pipelines, multi-step workflow automation, or API fan-out. Redis and BullMQ are the simplest stack we've found for getting this right.