Protocol v5.1.0

THE HARBOR
STRATEGY.

Don't just publish. Engineer. Harbor replaces the "spray and pray" SEO model with a high-fidelity, agentic research protocol.

The 4-Stage
Agentic Loop.

Most AI tools are just wrappers around a single prompt. Harbor is a multi-agent ecosystem that performs iterative research loops with real-time scraping, structured extraction, and intelligent deduplication.

Real-time Scraping
Jina + BrightData Fallback
JSON Schema Mode
Strict output formatting
Zero Cannibalization
4-layer deduplication
STEP_01_INDEX_PARSING
01

Sitemap Index Intelligence

Our agents don't blindly crawl. They analyze sitemap indexes to identify 'money sitemaps' (products, collections) while ignoring overhead like image-sitemaps or foreign language variants (/de/, /fr/, /es/).

  • Detects if sitemap.xml is an index with sub-sitemaps
  • AI selects 1-5 most relevant sitemaps from the index
  • Filters out: image sitemaps, video sitemaps, admin pages
  • Prioritizes: products, collections, categories, pages
  • Fetches up to 300 URLs per selected sitemap
client.responses.create({ model: 'gpt-5-nano', text: { format: { type: 'json_schema', strict: true } } })
STEP_02_URL_SELECTION
02

Smart URL Sampling

From a pool of thousands, Harbor shuffles and samples 300 URLs, then uses AI reasoning to pick the top 50 pages that determine your site's topical authority.

  • Shuffles URLs randomly for variety (no alphabetical bias)
  • Samples 300 URLs from the total pool
  • AI selects top 50 most relevant for target keyword
  • Homepage always included automatically
  • Filters out: login, cart, checkout, admin, legal pages
  • Classifies each URL by page type (product, collection, about)
const shuffledUrls = [...urls].sort(() => Math.random() - 0.5).slice(0, 300)
STEP_03_MASS_EXTRACTION
03

Parallel Entity Extraction

Using 5x parallel concurrency with Promise.allSettled(), we scrape and extract structured data (pricing, images, offerings) to build a temporary knowledge base for your content.

  • 5x parallel concurrency (configurable)
  • Promise.allSettled() for resilient batch processing
  • Jina for standard pages, BrightData for protected sites
  • Extracts: titles, headings, descriptions, pricing
  • Extracts: images with alt text and context
  • Extracts: offerings, contact info, internal links
  • Each page scored 0-1 for relevance
await Promise.allSettled(batch.map(url => scrapeAndExtract(url)))
STEP_04_STRATEGY_GEN
04

Contextual Synthesis

The OpenAI Responses API synthesizes the extracted entities into a master strategy, ensuring your new content is perfectly nested within your existing site graph with zero keyword cannibalization.

  • Sorts results by relevance score descending
  • Generates 2-3 sentence summary focused on keyword
  • Maps internal links from actual scraped URLs
  • Queries existing titles to prevent duplication
  • 4-level anti-cannibalization enforcement
  • Returns structured analysis with internalLinks array
const synthesis = await client.responses.create({ model: 'gpt-5-nano', input: synthesisPrompt })

4-Layer Anti-Cannibalization

Unlike blind AI wrappers, Harbor implements database-level deduplication to prevent keyword cannibalization

1

Domain-Scoped Query

Before generating any keyword, Harbor queries all previously generated titles from your specific domain hostname.

getAllPreviousSiteSeekerTitles({ sitemapUrl })
2

Status Filtering

Only completed, non-generating records are included. In-progress articles won't block new topics, but finished content creates a permanent exclusion zone.

status === 'completed' && siteSeeker.keywords
3

Prompt Injection

The AI receives an explicit list of existing titles with instructions to avoid identical, similar, or semantically overlapping topics.

previousTitlesSection in systemPrompt
4

Semantic Distinctness

For pillar generation, the AI must create 15 distinct subniches with zero semantic overlap - no two pillars can cover similar ground.

NEVER repeat topics or create pillars that are semantically similar

Structured Data Extraction

Every scraped page is parsed into a consistent JSON schema, enabling intelligent content synthesis and internal linking.

FieldTypeDescription
titlestringPage title
headingsarrayAll H1/H2/H3 headings
descriptionsarrayMeta and content descriptions
pricingarrayItems with name, price, currency
imagesarrayURLs with alt text and context
offeringsarrayProducts/services with descriptions
linksarrayInternal links with anchor text
contactInfoobjectEmail, phone, address
relevanceScorenumber0-1 relevance to keyword
// Example extraction output
{
  "title": "Premium Running Shoes",
  "headings": ["Features", "Sizing"],
  "pricing": [{
    "item": "Air Max Pro",
    "price": "189.99",
    "currency": "USD"
  }],
  "images": [{
    "url": "/shoes/air-max.jpg",
    "alt": "Air Max Pro side view",
    "context": "Product hero image"
  }],
  "offerings": [{
    "name": "Air Max Pro",
    "description": "Cushioned running"
  }],
  "relevanceScore": 0.92
}

SITEMAP-AWARE SYNTHESIS.

Harbor Graph

DOMAIN-SCOPED DEDUPLICATION

The agent queries all previously generated titles from your specific domain hostname before generating new topics.

SEMANTIC LINK MAPPING

Internal links are selected from your actual sitemap URLs, scored for relevance, and placed at semantically appropriate positions.

BRAND VOICE EXTRACTION

Autonomous extraction of your brand's unique voice across scraped pages to ensure tone consistency in generated content.

REWRITE THE RULES.

Stop using 2022 tools for a 2026 market.