Skip to content

CyberOrigen Data Flow Documentation

Overview

This document provides a comprehensive analysis of data flow through the CyberOrigen application, covering all major user interactions, security boundaries, and system integrations. The application is a multi-tenant SaaS platform for autonomous security scanning, compliance assessment, and GRC (Governance, Risk & Compliance) management.

System Architecture Overview

Frontend (React/Vite)     ←→     Backend (FastAPI/Python)     ←→     External Services
├─ Admin Portal (5173)           ├─ API Gateway                      ├─ AI Providers
├─ Marketing Site (5175)         ├─ Authentication                   │  ├─ AWS Bedrock
└─ Mobile Apps (Future)          ├─ Business Logic Services          │  ├─ OpenAI
                                 ├─ Database (PostgreSQL)             │  ├─ Anthropic (Claude)
                                 └─ Background Workers               │  └─ Google Gemini
                                                                      ├─ Email (Resend)
                                                                      ├─ Billing (Stripe)
                                                                      ├─ Ticketing (Peppermint)
                                                                      ├─ Threat Intel (OTX)
                                                                      ├─ Malware Scanning (ClamAV)
                                                                      └─ Monitoring/Alerts

Core Data Entities

Primary Entities

  • User: Platform administrators and customers
  • Organization: Multi-tenant isolation boundary
  • Scan: Security assessment jobs
  • Vulnerability: Security findings
  • Asset: Scanned targets (domains, IPs, etc.)
  • Evidence: GRC compliance artifacts
  • Control: Compliance framework controls
  • Risk: Risk register entries

Security Entities

  • QuarantinedFile: Malware-detected files
  • AuditLog: Security audit trail
  • APIKey: Customer BYOK (Bring Your Own Keys)
  • PlatformAdminKey: Platform AI provider keys

1. Authentication and Authorization Flow

User Authentication Process

mermaid
sequenceDiagram
    participant U as User
    participant F as Frontend
    participant A as Auth API
    participant DB as Database
    participant JWT as JWT Service

    U->>F: Login Request (username/password)
    F->>A: POST /api/v1/auth/login
    A->>DB: Validate Credentials
    DB-->>A: User Data + Organization
    A->>JWT: Generate JWT Token
    JWT-->>A: Signed Token
    A-->>F: JWT + User Context
    F->>F: Store Token (localStorage)
    F-->>U: Redirect to Dashboard

    Note over A: Multi-tenancy enforced at DB level
    Note over JWT: Token includes org_id for tenant isolation

Authorization & Multi-Tenancy

Security Boundaries:

  1. Platform Admin vs Customer: UserType enum distinguishes Bonum staff from customers
  2. Organization Isolation: All queries filtered by organization_id
  3. Role-Based Access: Owner/Admin/Member/Viewer permissions within organizations
  4. Quota Enforcement: Subscription tier limits enforced per organization

Data Isolation Mechanisms:

  • Database row-level security via organization_id foreign keys
  • UserContext object carries tenant information through all requests
  • API endpoints automatically filter by current user's organization
  • Platform admins bypass isolation for support purposes

JWT Token Structure

json
{
  "sub": "username",
  "exp": 1640995200,
  "iat": 1640908800,
  "organization_id": 123,
  "user_type": "CUSTOMER",
  "role": "ADMIN"
}

2. Security Scan Initiation and Processing Flow

Scan Lifecycle State Machine

mermaid
stateDiagram-v2
    [*] --> IDLE
    IDLE --> DISCOVERY: Start Scan
    DISCOVERY --> ENUMERATION: Targets Found
    ENUMERATION --> VULN_SCAN: Services Enumerated
    VULN_SCAN --> CORRELATION: Vulnerabilities Found
    CORRELATION --> THREAT_INTEL: Duplicates Removed
    THREAT_INTEL --> EXPLOIT_CHECK: Intel Gathered
    EXPLOIT_CHECK --> PRIORITIZATION: Exploits Checked
    PRIORITIZATION --> REMEDIATION: Risks Prioritized
    REMEDIATION --> VERIFICATION: Fixes Generated
    VERIFICATION --> REPORTING: Fixes Verified
    REPORTING --> COMPLETED: Report Generated

    DISCOVERY --> FAILED: Error
    ENUMERATION --> FAILED: Error
    VULN_SCAN --> FAILED: Error
    CORRELATION --> FAILED: Error
    THREAT_INTEL --> FAILED: Error
    EXPLOIT_CHECK --> FAILED: Error
    PRIORITIZATION --> FAILED: Error
    REMEDIATION --> FAILED: Error
    VERIFICATION --> FAILED: Error
    REPORTING --> FAILED: Error

Scan Processing Pipeline

mermaid
sequenceDiagram
    participant U as User
    participant F as Frontend
    participant S as Scan API
    participant SW as Scan Worker
    participant AI as AI Service
    participant DB as Database
    participant EXT as External Tools

    U->>F: Create New Scan
    F->>S: POST /api/v1/scans
    S->>DB: Create Scan Record (PENDING)
    S->>SW: Queue Background Job
    S-->>F: Scan Created Response
    F-->>U: Show Scan Progress

    SW->>SW: Start Processing (RUNNING)
    SW->>DB: Update Status: DISCOVERY
    SW->>EXT: Run Nuclei/Nmap Scans
    EXT-->>SW: Raw Scan Results
    SW->>DB: Update Status: VULN_SCAN
    SW->>AI: Analyze Vulnerabilities
    AI-->>SW: Enriched Data + Remediation
    SW->>DB: Store Vulnerabilities
    SW->>DB: Update Status: COMPLETED
    SW->>S: Trigger Notifications

    Note over SW: Each phase updates progress %
    Note over AI: PII redaction + prompt injection protection

Background Worker Architecture

The scan worker runs as a separate background process:

  • Threat Scanner Service: Coordinates external security tools
  • State Machine: Manages scan phase transitions
  • AI Orchestrator: Enriches findings with threat intelligence
  • Notification Dispatcher: Sends real-time updates

External Security Tools:

  • Nuclei: Vulnerability scanner for web applications
  • Nmap: Network discovery and port scanning
  • ClamAV: Malware detection for uploaded files
  • OTX (Open Threat Exchange): Threat intelligence enrichment

3. GRC Compliance Assessment and Evidence Collection Workflow

Compliance Framework Mapping

mermaid
graph TD
    V[Vulnerability] --> CM[Compliance Mapper]
    CM --> SOC2[SOC 2 Controls]
    CM --> PCI[PCI-DSS Requirements]
    CM --> ISO[ISO 27001 Controls]
    CM --> HIPAA[HIPAA Safeguards]
    CM --> GDPR[GDPR Articles]

    SOC2 --> CC71[CC7.1 - System Boundaries]
    PCI --> REQ6[Req 6.5.1 - Injection Flaws]
    ISO --> A811[A.8.11 - Data Protection]

    CC71 --> EV1[Evidence Collection]
    REQ6 --> EV1
    A811 --> EV1

    EV1 --> AUTO[Auto-Evidence Service]
    AUTO --> SCAN[Scan Reports]
    AUTO --> CONFIG[Config Snapshots]
    AUTO --> POLICY[Policy Documents]

Evidence Collection Workflow

mermaid
sequenceDiagram
    participant A as Auditor
    participant GRC as GRC Dashboard
    participant AE as Auto-Evidence
    participant FS as File Scanner
    participant Q as Quarantine
    participant DB as Database

    A->>GRC: Request Evidence for Control
    GRC->>AE: Trigger Auto-Collection
    AE->>DB: Query Related Vulnerabilities
    AE->>AE: Generate Evidence Report
    AE->>FS: Submit for Malware Scan

    alt File is Clean
        FS-->>AE: Scan Result: CLEAN
        AE->>DB: Store Evidence
        AE-->>GRC: Evidence Ready
    else File is Malicious
        FS->>Q: Quarantine File
        FS-->>AE: Scan Result: MALICIOUS
        AE->>DB: Log Security Alert
        AE-->>GRC: Evidence Collection Failed
    end

    Note over FS: ClamAV + OTX threat intel
    Note over Q: Admin review required

Audit Engagement Process

Phases:

  1. Planning: Define scope, controls, and evidence requirements
  2. Fieldwork: Collect evidence, perform testing, document findings
  3. Review: Analyze evidence, validate controls, identify exceptions
  4. Reporting: Generate audit report with findings and recommendations
  5. Completed: Final report delivery and follow-up planning

Evidence Types:

  • Policy: Written procedures and policies
  • Procedure: Step-by-step implementation guides
  • Screenshot: Visual evidence of controls
  • Configuration: System configuration exports
  • Log: Audit trails and access logs
  • Scan Report: Automated vulnerability assessments
  • Attestation: Management assertions and certifications

4. AI-Powered Analysis Pipeline and Chat Integration

AI Service Architecture

mermaid
graph TD
    UI[User Interface] --> CHAT[Chat API]
    UI --> SCAN[Scan Analysis]

    CHAT --> AIS[AI Service]
    SCAN --> AIS

    AIS --> PII[PII Redaction Layer]
    PII --> GUARD[Prompt Injection Guards]
    GUARD --> ROUTE[Provider Router]

    ROUTE --> BEDROCK[AWS Bedrock]
    ROUTE --> OPENAI[OpenAI GPT]
    ROUTE --> CLAUDE[Anthropic Claude]
    ROUTE --> GEMINI[Google Gemini]

    BEDROCK --> ZDR[Zero Data Retention]
    OPENAI --> ZDR
    CLAUDE --> ZDR
    GEMINI --> ZDR

    ZDR --> SANITIZE[Output Sanitization]
    SANITIZE --> RAG[Knowledge Base]
    RAG --> RESPONSE[Final Response]

Chat Interaction Flow

mermaid
sequenceDiagram
    participant U as User
    participant C as Chat Interface
    participant AI as AI Service
    participant PII as PII Redactor
    participant P as AI Provider
    participant RAG as Knowledge Base
    participant DB as Database

    U->>C: Ask Security Question
    C->>AI: Process Chat Message
    AI->>PII: Redact Sensitive Data
    PII-->>AI: Cleaned Input
    AI->>AI: Check Prompt Injection
    AI->>RAG: Query Knowledge Base
    RAG-->>AI: Relevant Context
    AI->>P: Send to AI Provider (Bedrock/OpenAI/Claude/Gemini)
    P-->>AI: AI Response
    AI->>PII: Redact Response
    PII-->>AI: Clean Response
    AI->>DB: Log Interaction (Audit Trail)
    AI-->>C: Final Response
    C-->>U: Display Answer

    Note over PII: Removes emails, SSNs, API keys, etc.
    Note over AI: Zero data retention compliance

AI Provider Selection & Configuration

Provider Priority (Configurable per Organization):

  1. AWS Bedrock (Highest Security): HIPAA compliance, data sovereignty
  2. Anthropic Claude: Strong reasoning, safety features
  3. OpenAI GPT: General-purpose, cost-effective
  4. Google Gemini: Multimodal capabilities

Zero Data Retention (ZDR) Compliance:

  • All providers configured to prevent training on customer data
  • Ephemeral processing with minimal retention periods
  • Audit trails for all AI interactions
  • Customer BYOK (Bring Your Own Keys) support

Knowledge Base (RAG) System

Document Types:

  • Static: Policies, procedures, architecture docs (manually created)
  • Dynamic: Scan summaries, asset profiles, trend insights (auto-generated)

RAG Pipeline:

  1. Document ingestion and chunking
  2. Embedding generation (AI-powered)
  3. Vector storage and indexing
  4. Semantic similarity search
  5. Context injection into AI prompts

5. Real-Time Updates via WebSocket and Notification Flow

Real-Time Communication Architecture

mermaid
graph TD
    SCAN[Scan Worker] --> EVENTS[Event Publisher]
    AI[AI Analysis] --> EVENTS
    GRC[GRC Updates] --> EVENTS

    EVENTS --> WS[WebSocket Manager]
    EVENTS --> EMAIL[Email Service]
    EVENTS --> SLACK[Slack Integration]

    WS --> CLIENTS[Connected Clients]
    EMAIL --> RESEND[Resend API]
    SLACK --> WEBHOOK[Slack Webhooks]

    CLIENTS --> ADMIN[Admin Portal]
    CLIENTS --> MOBILE[Mobile App]

    RESEND --> SMTP[Email Delivery]
    WEBHOOK --> CHANNELS[Slack Channels]

Notification Event Types

json
{
  "scan_started": {
    "scan_id": "scan_123",
    "target": "example.com",
    "organization_id": 123,
    "timestamp": "2024-01-01T10:00:00Z"
  },
  "scan_progress": {
    "scan_id": "scan_123",
    "phase": "VULN_SCAN",
    "progress": 45,
    "organization_id": 123
  },
  "vulnerability_found": {
    "vuln_id": "vuln_456",
    "severity": "HIGH",
    "title": "SQL Injection Vulnerability",
    "scan_id": "scan_123",
    "organization_id": 123
  },
  "evidence_collected": {
    "evidence_id": "ev_789",
    "control_id": "CC7.1",
    "framework": "SOC2",
    "organization_id": 123
  }
}

Email Notification Flow

mermaid
sequenceDiagram
    participant S as System Event
    participant N as Notification Dispatcher
    participant E as Email Service
    participant R as Resend API
    participant U as User

    S->>N: Critical Vulnerability Found
    N->>N: Check User Preferences
    N->>E: Queue Email Notification
    E->>R: Send via Resend
    R-->>U: Email Delivered
    R-->>E: Delivery Confirmation
    E->>N: Log Delivery Status

    Note over N: Respects user notification preferences
    Note over R: Template-based emails with branding

6. Multi-Tenant Data Isolation

Database-Level Isolation

Row-Level Security (RLS):

sql
-- Example: Organizations can only see their own scans
CREATE POLICY org_isolation_scans ON scans
  FOR ALL TO app_role
  USING (organization_id = current_setting('app.current_org_id')::int);

-- Example: Platform admins bypass all restrictions
CREATE POLICY admin_bypass_scans ON scans
  FOR ALL TO platform_admin_role
  USING (true);

Foreign Key Enforcement:

  • All tenant data tables have organization_id foreign key
  • Database constraints prevent cross-tenant data access
  • Queries automatically filtered by organization context

Application-Level Isolation

UserContext Enforcement:

python
# Every API endpoint enforces tenant isolation
@router.get("/vulnerabilities")
async def get_vulnerabilities(
    current_user: UserContext = Depends(get_current_user),
    db: Session = Depends(get_db)
):
    # Query automatically filtered by organization_id
    vulnerabilities = db.query(Vulnerability).filter(
        Vulnerability.organization_id == current_user.organization_id
    ).all()

Subscription Tier Limits:

  • Scan quotas enforced per organization
  • User limits by subscription tier
  • Feature access controlled by tier configuration

7. Security Boundaries and Data Protection

Encryption at Rest

  • Database: Sensitive fields encrypted using EncryptedString/EncryptedJSON
  • File Storage: Evidence attachments encrypted in S3
  • Credentials: API keys encrypted in database

Encryption in Transit

  • API Communication: HTTPS with TLS 1.2+
  • Database Connections: SSL/TLS encrypted
  • AI Provider APIs: Provider-native encryption

Input Validation and Sanitization

  • API Validation: Pydantic models for all inputs
  • SQL Injection Prevention: Parameterized queries only
  • XSS Protection: HTML escaping and CSP headers
  • File Upload Scanning: ClamAV malware detection

Audit Logging

  • User Actions: All API calls logged with user context
  • Data Changes: Change tracking for sensitive entities
  • Security Events: Failed logins, permission escalations
  • AI Interactions: All AI queries and responses logged

8. Caching and Performance Optimization

Caching Layers

mermaid
graph TD
    USER[User Request] --> CDN[CDN Cache]
    CDN --> LB[Load Balancer]
    LB --> APP[Application Cache]
    APP --> DB[Database]

    APP --> REDIS[Redis Cache]
    APP --> MEMORY[In-Memory Cache]

    REDIS --> SESSIONS[User Sessions]
    REDIS --> SCAN[Scan Results]
    REDIS --> AI[AI Responses]

    MEMORY --> CONFIG[Configuration]
    MEMORY --> STATIC[Static Data]

Performance Optimizations

Database:

  • Connection pooling (20 base, 30 overflow)
  • Query optimization with indexes
  • Read replicas for reporting queries
  • Automated vacuum and reindex

API Layer:

  • Response compression (gzip)
  • Pagination for large datasets
  • Async request processing
  • Background job queuing

Frontend:

  • Code splitting and lazy loading
  • Asset compression and minification
  • Service worker caching
  • Progressive Web App features

9. External Service Integrations

AI Providers

  • AWS Bedrock: Primary for sensitive data (HIPAA/SOC2 compliant)
  • OpenAI: Cost-effective general analysis
  • Anthropic Claude: Advanced reasoning and safety
  • Google Gemini: Multimodal analysis capabilities

Communication Services

  • Resend: Transactional email delivery
  • Slack: Team notifications and alerts
  • Peppermint: Helpdesk and ticketing integration

Security Services

  • ClamAV: File malware scanning
  • AlienVault OTX: Threat intelligence feeds
  • Nuclei: Vulnerability scanning engine

Business Services

  • Stripe: Subscription billing and payments
  • AWS S3: File storage and backups
  • Railway/Digital Ocean: Infrastructure hosting

10. Error Handling and Monitoring

Error Handling Strategy

  • Graceful Degradation: System continues functioning with reduced capability
  • Circuit Breakers: Prevent cascade failures in external integrations
  • Retry Logic: Exponential backoff for transient failures
  • User-Friendly Messages: Technical errors translated to user-friendly language

Monitoring and Observability

  • Prometheus Metrics: API performance and business metrics
  • Structured Logging: JSON logs with correlation IDs
  • Health Checks: Endpoint monitoring for all services
  • Alert Management: Automated alerts for critical issues

Disaster Recovery

  • Database Backups: Automated daily backups with point-in-time recovery
  • File Storage: Cross-region replication for evidence attachments
  • Infrastructure: Blue-green deployment capability
  • Data Export: Organization data export for compliance

Security Considerations Summary

  1. Multi-Tenancy: Strict organization isolation at database and application layers
  2. Authentication: JWT-based with role-based access control
  3. Data Protection: Encryption at rest and in transit
  4. Input Validation: Comprehensive sanitization and validation
  5. AI Security: PII redaction, prompt injection protection, zero data retention
  6. Audit Trail: Complete logging of all user actions and system events
  7. Compliance: SOC2, ISO27001, HIPAA, GDPR alignment
  8. Incident Response: Automated threat detection and quarantine

This data flow architecture ensures secure, scalable, and compliant operation of the CyberOrigen platform while maintaining performance and user experience.

Agentic AI-Powered Security & Compliance