Discovery Engine

JobScout

Job boards are noisy by design. JobScout monitors 130+ company career pages directly, bypasses the aggregator middleman, and ranks results with H1B sponsorship awareness built in from the start. It runs as an independent discovery layer that feeds clean, high-quality job data into the AutoApply AI workflow.

130+ career pages 6 ATS platforms Sponsorship-aware Live dashboard AutoApply AI integration FastAPI backend
Python FastAPI BeautifulSoup Playwright PostgreSQL Docker GitHub Pages REST API
130+ career pages monitored
6 ATS platforms covered
H1B sponsorship-aware ranking
0 job board intermediaries

The Problem

Job boards aggregate listings from hundreds of sources, add duplicates, and serve them with ads. By the time a listing appears on LinkedIn or Indeed, it has already been posted on the company's own career page - sometimes days earlier. Worse, job boards don't know or care whether a company sponsors H1B visas. That signal has to be inferred manually, role by role, company by company.

For an international candidate in a competitive market, working from noisy job board data means burning application effort on roles that were never realistic. The fix isn't a better filter on the same bad data - it's going to the source. JobScout monitors company career pages directly, so the data is fresher, the signal is cleaner, and the sponsorship context is baked in from the start.

System Architecture

Three layers: monitoring, ranking, and delivery. Each has a clean boundary.

Monitoring Layer
Scheduled Crawler
130+ Company Career Pages
Detection + Enrichment
ATS Platform Detector
Sponsorship Signal Enricher
Dedup Engine
Storage + Delivery
PostgreSQL (job listings)
FastAPI (REST endpoints)
Consumers
Live Dashboard
AutoApply AI Integration

Monitoring layer

A scheduled crawler visits 130+ company career pages on a configurable cadence. Dynamic pages (SPA-heavy sites like Workday, Greenhouse) are handled with Playwright for JavaScript rendering. Static pages use BeautifulSoup. The crawler detects which ATS platform each company uses and routes accordingly.

Enrichment layer

Each discovered listing is enriched with: ATS platform tag (Workday, Greenhouse, Lever, Ashby, LinkedIn, iCIMS), known H1B sponsorship history for the company, and a deduplication check against existing records. Sponsorship data is maintained as a separate lookup table seeded from public H1B disclosure data.

API and dashboard

FastAPI exposes search endpoints: filter by role type, location, ATS platform, and sponsorship status. The live dashboard renders job listings with one-click apply links that hand off to AutoApply AI. Rankings surface sponsor-likely roles at the top for the specific use case this tool was built to solve.

Why It's a Separate System

JobScout was deliberately extracted as a standalone service rather than baked into AutoApply AI. The reasons are practical: discovery and application are different cadences. Discovery needs to run continuously on a schedule regardless of whether anyone is actively applying. Application is event-driven. Coupling them would mean the crawler is only active when someone opens the Chrome extension.

The API boundary between JobScout and AutoApply AI is also the right abstraction for future growth. Any application system can call the API - a CLI, another browser extension, a mobile app. The discovery layer doesn't need to know what consumes it. This is the same reasoning behind extracting tailor-resume as a separate PyPI package rather than keeping it inside AutoApply AI.

JobScout live dashboard - job listings from 130+ company career pages with sponsorship indicators, ATS platform tags, and one-click hand-off to AutoApply AI.

ATS Platform Coverage

Different companies use different applicant tracking systems, and each ATS renders its job listings differently. Greenhouse uses a JSON API. Workday renders client-side. Lever has a public API. Ashby has its own schema. Each requires its own parsing strategy - one generic crawler fails at the first SPA it encounters.

JobScout covers 6 ATS platforms: Greenhouse, Workday, Lever, Ashby, LinkedIn Jobs, and iCIMS. The platform detector runs first and selects the right parser. Adding a new ATS is a new parser module - the rest of the pipeline doesn't change.

Honest Assessment: What Worked and What Didn't

What worked

Going direct to source. Career page data is consistently fresher than job board aggregates. Several listings appeared on the career page 2-3 days before showing on LinkedIn, with fewer duplicate entries and more accurate job details.

What failed

Bot detection on Workday pages. Workday's client-side rendering triggers bot detection on some domains even with Playwright. Had to add randomized request timing and user-agent rotation, which slowed the crawler significantly on Workday-hosted sites.

What worked

Sponsorship-aware ranking. Surfacing sponsorship-likely companies at the top of results cut irrelevant application attempts significantly. The H1B lookup table, even when incomplete, dramatically improves signal quality for international candidates.

What failed

Closed job detection. Detecting when a job listing is no longer active requires re-crawling every URL periodically. Without a job state machine (open → closed → archived), stale listings persist in the database longer than they should.

What worked

Modular ATS parsers. The platform-detect-then-parse pattern made adding new ATS support straightforward. Each parser is isolated - breaking one doesn't break others, and testing is simple because parsers are pure functions over HTML/JSON.

What failed

Rate limiting from smaller companies. Large tech companies rarely rate-limit their career pages. Smaller companies sometimes return 429s on repeated visits, especially from a single IP. A rotating proxy layer would solve this but adds operational cost.

What I Would Build Next

  • Job state machine: track open → closed → archived transitions with timestamps, alert when monitored listings change status
  • Proxy rotation: solve the rate-limiting problem without slowing the crawler to avoid detection
  • Similarity deduplication: vector-based dedup to catch listings that are semantically identical but have slightly different titles or locations across multiple career page variants
  • Relevance scoring: score each listing against a stored target role profile (skills, seniority, domain) so the ranking reflects fit, not just sponsorship status
  • Alert pipeline: notify via webhook or email when new listings matching a saved search appear, without requiring the user to actively check the dashboard