LogoLogo
English
English
  • Welcome
  • About
    • About the CVSA Project
    • Scope of Inclusion
  • Architecture
    • Overview
    • Crawler
    • Database Structure
      • Type of a Song
    • Artificial Intelligence
  • API Doc
    • Catalog
    • Songs
Powered by GitBook

Contents are licensed under CC BY 4.0 if not specified.

On this page
  • crawler/db
  • crawler/ml
  • crawler/mq
  • crawler/mq/exec
  • crawler/mq/workers
  • crawler/mq/task
  • crawler/net
  • crawler/net/delegate.ts
  • crawler/utils
  • crawler/src

Was this helpful?

Edit on GitHub
  1. Architecture

Crawler

PreviousOverviewNextDatabase Structure

Last updated 1 month ago

Was this helpful?

Automation is at the core of CVSA’s technical architecture. The crawler is built to efficiently orchestrate data collection tasks using a message queue system powered by . This design enables concurrent processing across multiple stages of the data collection lifecycle.

State management and data persistence are handled using a combination of Redis (for caching and real-time data) and PostgreSQL (as the primary database).

crawler/db

This module handles all database interactions for the crawler, including creation, updates, and data retrieval.

  • init.ts: Initializes the PostgreSQL connection pool.

  • redis.ts: Sets up the Redis client.

  • withConnection.ts: Exports withDatabaseConnection, a helper that provides a database context to any function.

  • Other files: Contain table-specific functions, with each file corresponding to a database table.

crawler/ml

This module handles machine learning tasks, such as content classification.

  • manager.ts: Defines a base class AIManager for managing ML models.

  • akari.ts: Implements our primary classification model, AkariProto, which extends AIManager. It filters videos to determine if they should be included as songs.

crawler/mq

This module manages task queuing and processing through BullMQ.

crawler/mq/exec

Contains the functions executed by BullMQ workers. Examples include getVideoInfoWorker and takeBulkSnapshotForVideosWorker.

Terminology note: In this documentation:

  • Functions in crawler/mq/exec are called workers.

  • Functions in crawler/mq/workers are called BullMQ workers.

Design detail: Since BullMQ requires one handler per queue, we use a switch statement inside each BullMQ worker to route jobs based on their name to the correct function in crawler/mq/exec.

crawler/mq/workers

Houses the BullMQ worker functions. Each function handles jobs for a specific queue.

crawler/mq/task

To keep worker functions clean and focused, reusable logic is extracted into this directory as tasks. These tasks are then imported and used by the worker functions.

crawler/net

This module handles all data fetching operations. Its core component is the NetworkDelegate, defined in net/delegate.ts.

crawler/net/delegate.ts

Implements robust network request handling, including:

  • Rate limiting by task type and proxy

  • Support for serverless functions to dynamically rotate requesting IPs

crawler/utils

A collection of utility functions shared across the crawler modules.

crawler/src

Contains the main entry point of the crawler.

We use to run multiple scripts in parallel, enabling efficient execution of various processes.

BullMQ
concurrently