Extract and Parse Web Page Content in Kotlin Using Jsoup

Modern applications often need to extract structured data from web pages that do not expose a public API. This is common for dashboards, content aggregators, internal tools, or automation scripts. In JVM-based ecosystems, Kotlin combined with Jsoup offers a clean and efficient solution for HTML scraping and parsing.

This article explains how to extract ranked list data from a public web page using Kotlin, covering:

HTTP fetching
HTML parsing
CSS selectors
Data modeling
Common pitfalls
Best practices for maintainability and legality

All examples are anonymized and apply to any ranking-style or list-based web page.

When Is HTML Parsing the Right Approach?

HTML parsing is appropriate when:

No official REST or GraphQL API exists
Data is publicly visible
You need lightweight read-only access
The page structure is relatively stable

⚠️ Always verify the website’s Terms of Service and robots.txt before scraping.

Technology Stack

Kotlin (JVM)
Jsoup – HTML parser with jQuery-like selectors
Optional: OkHttp (for advanced networking)

Adding Jsoup to Your Kotlin Project

Gradle (Kotlin DSL)

dependencies {
    implementation("org.jsoup:jsoup:1.17.2")
}

Understanding the Page Structure

Most ranking pages follow a predictable HTML pattern:

A container for the list
Repeating elements for each item
Nested tags for:
- position
- title
- author / artist / description

Using browser DevTools (Inspect Element), identify:

Common parent container
Repeating row structure
Semantic tags (h2, h3, span, etc.)

Data Model (Best Practice)

Create a clean domain model to decouple parsing from business logic.

data class RankedItem(
    val position: Int,
    val title: String,
    val subtitle: String
)

Fetching and Parsing the Page

Basic Kotlin + Jsoup Example

import org.jsoup.Jsoup

fun main() {
    val document = Jsoup
        .connect("https://example.com/ranking-page")
        .userAgent("Mozilla/5.0")
        .timeout(10_000)
        .get()

    val items = mutableListOf<RankedItem>()

    val rows = document.select("div.ranking-item") // example selector

    for (row in rows) {
        val position = row.selectFirst(".position")?.text()?.toIntOrNull()
        val title = row.selectFirst(".title")?.text()
        val subtitle = row.selectFirst(".subtitle")?.text()

        if (position != null && title != null && subtitle != null) {
            items.add(
                RankedItem(
                    position = position,
                    title = title,
                    subtitle = subtitle
                )
            )
        }
    }

    items.forEach { println(it) }
}

Choosing the Right CSS Selectors

Jsoup supports full CSS selector syntax:

Selector	Description
`.class`	Elements with class
`#id`	Element by ID
`div > h2`	Direct child
`div h2`	Any descendant
`:nth-child(n)`	Positional selection

Example

document.select("div.chart-entry h2")

Handling Encoding and Performance

Recommended Settings

Jsoup.connect(url)
    .timeout(10_000)
    .maxBodySize(2_000_000)
    .userAgent("Mozilla/5.0")

Avoid

Parsing entire pages repeatedly
Blocking the UI thread (Android!)
Hardcoding brittle selectors

Error Handling

Always assume the structure can change.

try {
    val doc = Jsoup.connect(url).get()
} catch (ex: Exception) {
    // log + fallback
}

Android-Specific Notes

Use Dispatchers.IO
Never run Jsoup on the main thread
Consider caching results

withContext(Dispatchers.IO) {
    Jsoup.connect(url).get()
}

SEO & Content Aggregation Use Cases

This approach is commonly used for:

Music or media rankings
News headlines aggregation
Public statistics dashboards
Market monitoring tools
Research automation

Legal and Ethical Considerations

✔️ Public data only
✔️ Respect rate limits
✔️ No authentication bypass
❌ No personal data scraping

If long-term stability is required, request an official API whenever possible.

Final Thoughts

Using Kotlin + Jsoup provides a clean, expressive, and maintainable way to extract structured data from public web pages. With proper abstraction, selector hygiene, and error handling, this approach scales well for both backend services and Android applications.

When implemented responsibly, HTML parsing remains a powerful tool in a modern developer’s toolbox.

This article is inspired by real-world challenges we tackle in our projects. If you're looking for expert solutions or need a team to bring your idea to life,