Crawler
Automation is at the core of CVSA’s technical architecture. The crawler is built to efficiently orchestrate data collection tasks using a message queue system powered by BullMQ. This design enables concurrent processing across multiple stages of the data collection lifecycle.
State management and data persistence are handled using a combination of Redis (for caching and real-time data) and PostgreSQL (as the primary database).
crawler/db
crawler/dbThis module handles all database interactions for the crawler, including creation, updates, and data retrieval.
init.ts: Initializes the PostgreSQL connection pool.redis.ts: Sets up the Redis client.withConnection.ts: ExportswithDatabaseConnection, a helper that provides a database context to any function.Other files: Contain table-specific functions, with each file corresponding to a database table.
crawler/ml
crawler/mlThis module handles machine learning tasks, such as content classification.
manager.ts: Defines a base classAIManagerfor managing ML models.akari.ts: Implements our primary classification model,AkariProto, which extendsAIManager. It filters videos to determine if they should be included as songs.
crawler/mq
crawler/mqThis module manages task queuing and processing through BullMQ.
crawler/mq/exec
crawler/mq/execContains the functions executed by BullMQ workers. Examples include getVideoInfoWorker and takeBulkSnapshotForVideosWorker.
Terminology note: In this documentation:
Functions in
crawler/mq/execare called workers.Functions in
crawler/mq/workersare called BullMQ workers.
Design detail:
Since BullMQ requires one handler per queue, we use a switch statement inside each BullMQ worker to route jobs based on their name to the correct function in crawler/mq/exec.
crawler/mq/workers
crawler/mq/workersHouses the BullMQ worker functions. Each function handles jobs for a specific queue.
crawler/mq/task
crawler/mq/taskTo keep worker functions clean and focused, reusable logic is extracted into this directory as tasks. These tasks are then imported and used by the worker functions.
crawler/net
crawler/netThis module handles all data fetching operations. Its core component is the NetworkDelegate, defined in net/delegate.ts.
crawler/net/delegate.ts
crawler/net/delegate.tsImplements robust network request handling, including:
Rate limiting by task type and proxy
Support for serverless functions to dynamically rotate requesting IPs
crawler/utils
crawler/utilsA collection of utility functions shared across the crawler modules.
crawler/src
crawler/srcContains the main entry point of the crawler.
We use concurrently to run multiple scripts in parallel, enabling efficient execution of various processes.
Last updated
Was this helpful?