Job boards are noisy by design. JobScout monitors 130+ company career pages directly, bypasses the aggregator middleman, and ranks results with H1B sponsorship awareness built in from the start. It runs as an independent discovery layer that feeds clean, high-quality job data into the AutoApply AI workflow.
Job boards aggregate listings from hundreds of sources, add duplicates, and serve them with ads. By the time a listing appears on LinkedIn or Indeed, it has already been posted on the company's own career page - sometimes days earlier. Worse, job boards don't know or care whether a company sponsors H1B visas. That signal has to be inferred manually, role by role, company by company.
For an international candidate in a competitive market, working from noisy job board data means burning application effort on roles that were never realistic. The fix isn't a better filter on the same bad data - it's going to the source. JobScout monitors company career pages directly, so the data is fresher, the signal is cleaner, and the sponsorship context is baked in from the start.
Three layers: monitoring, ranking, and delivery. Each has a clean boundary.
A scheduled crawler visits 130+ company career pages on a configurable cadence. Dynamic pages (SPA-heavy sites like Workday, Greenhouse) are handled with Playwright for JavaScript rendering. Static pages use BeautifulSoup. The crawler detects which ATS platform each company uses and routes accordingly.
Each discovered listing is enriched with: ATS platform tag (Workday, Greenhouse, Lever, Ashby, LinkedIn, iCIMS), known H1B sponsorship history for the company, and a deduplication check against existing records. Sponsorship data is maintained as a separate lookup table seeded from public H1B disclosure data.
FastAPI exposes search endpoints: filter by role type, location, ATS platform, and sponsorship status. The live dashboard renders job listings with one-click apply links that hand off to AutoApply AI. Rankings surface sponsor-likely roles at the top for the specific use case this tool was built to solve.
JobScout was deliberately extracted as a standalone service rather than baked into AutoApply AI. The reasons are practical: discovery and application are different cadences. Discovery needs to run continuously on a schedule regardless of whether anyone is actively applying. Application is event-driven. Coupling them would mean the crawler is only active when someone opens the Chrome extension.
The API boundary between JobScout and AutoApply AI is also the right abstraction for future growth. Any application system can call the API - a CLI, another browser extension, a mobile app. The discovery layer doesn't need to know what consumes it. This is the same reasoning behind extracting tailor-resume as a separate PyPI package rather than keeping it inside AutoApply AI.
Different companies use different applicant tracking systems, and each ATS renders its job listings differently. Greenhouse uses a JSON API. Workday renders client-side. Lever has a public API. Ashby has its own schema. Each requires its own parsing strategy - one generic crawler fails at the first SPA it encounters.
JobScout covers 6 ATS platforms: Greenhouse, Workday, Lever, Ashby, LinkedIn Jobs, and iCIMS. The platform detector runs first and selects the right parser. Adding a new ATS is a new parser module - the rest of the pipeline doesn't change.
Going direct to source. Career page data is consistently fresher than job board aggregates. Several listings appeared on the career page 2-3 days before showing on LinkedIn, with fewer duplicate entries and more accurate job details.
Bot detection on Workday pages. Workday's client-side rendering triggers bot detection on some domains even with Playwright. Had to add randomized request timing and user-agent rotation, which slowed the crawler significantly on Workday-hosted sites.
Sponsorship-aware ranking. Surfacing sponsorship-likely companies at the top of results cut irrelevant application attempts significantly. The H1B lookup table, even when incomplete, dramatically improves signal quality for international candidates.
Closed job detection. Detecting when a job listing is no longer active requires re-crawling every URL periodically. Without a job state machine (open → closed → archived), stale listings persist in the database longer than they should.
Modular ATS parsers. The platform-detect-then-parse pattern made adding new ATS support straightforward. Each parser is isolated - breaking one doesn't break others, and testing is simple because parsers are pure functions over HTML/JSON.
Rate limiting from smaller companies. Large tech companies rarely rate-limit their career pages. Smaller companies sometimes return 429s on repeated visits, especially from a single IP. A rotating proxy layer would solve this but adds operational cost.