Protocol v5.1.0

THE HARBOR
STRATEGY.

Don't just publish. Engineer. Harbor replaces the "spray and pray" SEO model with a high-fidelity, agentic research protocol.

EXECUTE STRATEGY

VIEW AGENT LOGS

The 4-Stage
Agentic Loop.

Most AI tools are just wrappers around a single prompt. Harbor is a multi-agent ecosystem that performs iterative research loops with real-time scraping, structured extraction, and intelligent deduplication.

Real-time Scraping

Jina + BrightData Fallback

JSON Schema Mode

Strict output formatting

Zero Cannibalization

4-layer deduplication

STEP_01_INDEX_PARSING

Sitemap Index Intelligence

Our agents don't blindly crawl. They analyze sitemap indexes to identify 'money sitemaps' (products, collections) while ignoring overhead like image-sitemaps or foreign language variants (/de/, /fr/, /es/).

•Detects if sitemap.xml is an index with sub-sitemaps
•AI selects 1-5 most relevant sitemaps from the index
•Filters out: image sitemaps, video sitemaps, admin pages
•Prioritizes: products, collections, categories, pages
•Fetches up to 300 URLs per selected sitemap

client.responses.create({ model: 'gpt-5-nano', text: { format: { type: 'json_schema', strict: true } } })

STEP_02_URL_SELECTION

Smart URL Sampling

From a pool of thousands, Harbor shuffles and samples 300 URLs, then uses AI reasoning to pick the top 50 pages that determine your site's topical authority.

•Shuffles URLs randomly for variety (no alphabetical bias)
•Samples 300 URLs from the total pool
•AI selects top 50 most relevant for target keyword
•Homepage always included automatically
•Filters out: login, cart, checkout, admin, legal pages
•Classifies each URL by page type (product, collection, about)

const shuffledUrls = [...urls].sort(() => Math.random() - 0.5).slice(0, 300)

STEP_03_MASS_EXTRACTION

Parallel Entity Extraction

Using 5x parallel concurrency with Promise.allSettled(), we scrape and extract structured data (pricing, images, offerings) to build a temporary knowledge base for your content.

•5x parallel concurrency (configurable)
•Promise.allSettled() for resilient batch processing
•Jina for standard pages, BrightData for protected sites
•Extracts: titles, headings, descriptions, pricing
•Extracts: images with alt text and context
•Extracts: offerings, contact info, internal links
•Each page scored 0-1 for relevance

await Promise.allSettled(batch.map(url => scrapeAndExtract(url)))

STEP_04_STRATEGY_GEN

Contextual Synthesis

The OpenAI Responses API synthesizes the extracted entities into a master strategy, ensuring your new content is perfectly nested within your existing site graph with zero keyword cannibalization.

•Sorts results by relevance score descending
•Generates 2-3 sentence summary focused on keyword
•Maps internal links from actual scraped URLs
•Queries existing titles to prevent duplication
•4-level anti-cannibalization enforcement
•Returns structured analysis with internalLinks array

const synthesis = await client.responses.create({ model: 'gpt-5-nano', input: synthesisPrompt })

4-Layer Anti-Cannibalization

Unlike blind AI wrappers, Harbor implements database-level deduplication to prevent keyword cannibalization

Domain-Scoped Query

Before generating any keyword, Harbor queries all previously generated titles from your specific domain hostname.

getAllPreviousSiteSeekerTitles({ sitemapUrl })

Status Filtering

Only completed, non-generating records are included. In-progress articles won't block new topics, but finished content creates a permanent exclusion zone.

status === 'completed' && siteSeeker.keywords

Prompt Injection

The AI receives an explicit list of existing titles with instructions to avoid identical, similar, or semantically overlapping topics.

previousTitlesSection in systemPrompt

Semantic Distinctness

For pillar generation, the AI must create 15 distinct subniches with zero semantic overlap - no two pillars can cover similar ground.

NEVER repeat topics or create pillars that are semantically similar

Structured Data Extraction

Every scraped page is parsed into a consistent JSON schema, enabling intelligent content synthesis and internal linking.

Field	Type	Description
title	string	Page title
headings	array	All H1/H2/H3 headings
descriptions	array	Meta and content descriptions
pricing	array	Items with name, price, currency
images	array	URLs with alt text and context
offerings	array	Products/services with descriptions
links	array	Internal links with anchor text
contactInfo	object	Email, phone, address
relevanceScore	number	0-1 relevance to keyword

// Example extraction output

{
  "title": "Premium Running Shoes",
  "headings": ["Features", "Sizing"],
  "pricing": [{
    "item": "Air Max Pro",
    "price": "189.99",
    "currency": "USD"
  }],
  "images": [{
    "url": "/shoes/air-max.jpg",
    "alt": "Air Max Pro side view",
    "context": "Product hero image"
  }],
  "offerings": [{
    "name": "Air Max Pro",
    "description": "Cushioned running"
  }],
  "relevanceScore": 0.92
}

SITEMAP-AWARE SYNTHESIS.

DOMAIN-SCOPED DEDUPLICATION

The agent queries all previously generated titles from your specific domain hostname before generating new topics.

SEMANTIC LINK MAPPING

Internal links are selected from your actual sitemap URLs, scored for relevance, and placed at semantically appropriate positions.

BRAND VOICE EXTRACTION

Autonomous extraction of your brand's unique voice across scraped pages to ensure tone consistency in generated content.

REWRITE THE RULES.

Stop using 2022 tools for a 2026 market.

DEPLOY THE PROTOCOL

THE HARBOR STRATEGY.

The 4-Stage Agentic Loop.