Extract and Parse Web Page Content in Kotlin Using Jsoup

Modern applications often need to extract structured data from web pages that do not expose a public API. This is common for dashboards, content aggregators, internal tools, or automation scripts. In JVM-based ecosystems, Kotlin combined with Jsoup offers a clean and efficient solution for HTML scraping and parsing.

This article explains how to extract ranked list data from a public web page using Kotlin, covering:

  • HTTP fetching
  • HTML parsing
  • CSS selectors
  • Data modeling
  • Common pitfalls
  • Best practices for maintainability and legality

All examples are anonymized and apply to any ranking-style or list-based web page.


When Is HTML Parsing the Right Approach?

HTML parsing is appropriate when:

  • No official REST or GraphQL API exists
  • Data is publicly visible
  • You need lightweight read-only access
  • The page structure is relatively stable

⚠️ Always verify the website’s Terms of Service and robots.txt before scraping.


Technology Stack

  • Kotlin (JVM)
  • Jsoup – HTML parser with jQuery-like selectors
  • Optional: OkHttp (for advanced networking)

Adding Jsoup to Your Kotlin Project

Gradle (Kotlin DSL)

dependencies {
    implementation("org.jsoup:jsoup:1.17.2")
}

Understanding the Page Structure

Most ranking pages follow a predictable HTML pattern:

  • A container for the list
  • Repeating elements for each item
  • Nested tags for:
    • position
    • title
    • author / artist / description

Using browser DevTools (Inspect Element), identify:

  • Common parent container
  • Repeating row structure
  • Semantic tags (h2, h3, span, etc.)

Data Model (Best Practice)

Create a clean domain model to decouple parsing from business logic.

data class RankedItem(
    val position: Int,
    val title: String,
    val subtitle: String
)

Fetching and Parsing the Page

Basic Kotlin + Jsoup Example

import org.jsoup.Jsoup

fun main() {
    val document = Jsoup
        .connect("https://example.com/ranking-page")
        .userAgent("Mozilla/5.0")
        .timeout(10_000)
        .get()

    val items = mutableListOf<RankedItem>()

    val rows = document.select("div.ranking-item") // example selector

    for (row in rows) {
        val position = row.selectFirst(".position")?.text()?.toIntOrNull()
        val title = row.selectFirst(".title")?.text()
        val subtitle = row.selectFirst(".subtitle")?.text()

        if (position != null && title != null && subtitle != null) {
            items.add(
                RankedItem(
                    position = position,
                    title = title,
                    subtitle = subtitle
                )
            )
        }
    }

    items.forEach { println(it) }
}

Choosing the Right CSS Selectors

Jsoup supports full CSS selector syntax:

SelectorDescription
.classElements with class
#idElement by ID
div > h2Direct child
div h2Any descendant
:nth-child(n)Positional selection

Example

document.select("div.chart-entry h2")

Handling Encoding and Performance

Recommended Settings

Jsoup.connect(url)
    .timeout(10_000)
    .maxBodySize(2_000_000)
    .userAgent("Mozilla/5.0")

Avoid

  • Parsing entire pages repeatedly
  • Blocking the UI thread (Android!)
  • Hardcoding brittle selectors

Error Handling

Always assume the structure can change.

try {
    val doc = Jsoup.connect(url).get()
} catch (ex: Exception) {
    // log + fallback
}

Android-Specific Notes

  • Use Dispatchers.IO
  • Never run Jsoup on the main thread
  • Consider caching results
withContext(Dispatchers.IO) {
    Jsoup.connect(url).get()
}

SEO & Content Aggregation Use Cases

This approach is commonly used for:

  • Music or media rankings
  • News headlines aggregation
  • Public statistics dashboards
  • Market monitoring tools
  • Research automation

Legal and Ethical Considerations

✔️ Public data only
✔️ Respect rate limits
✔️ No authentication bypass
❌ No personal data scraping

If long-term stability is required, request an official API whenever possible.


Final Thoughts

Using Kotlin + Jsoup provides a clean, expressive, and maintainable way to extract structured data from public web pages. With proper abstraction, selector hygiene, and error handling, this approach scales well for both backend services and Android applications.

When implemented responsibly, HTML parsing remains a powerful tool in a modern developer’s toolbox.

This article is inspired by real-world challenges we tackle in our projects. If you're looking for expert solutions or need a team to bring your idea to life,

Let's talk!

    Please fill your details, and we will contact you back

      Please fill your details, and we will contact you back