Modern applications often need to extract structured data from web pages that do not expose a public API. This is common for dashboards, content aggregators, internal tools, or automation scripts. In JVM-based ecosystems, Kotlin combined with Jsoup offers a clean and efficient solution for HTML scraping and parsing.
This article explains how to extract ranked list data from a public web page using Kotlin, covering:
- HTTP fetching
- HTML parsing
- CSS selectors
- Data modeling
- Common pitfalls
- Best practices for maintainability and legality
All examples are anonymized and apply to any ranking-style or list-based web page.
When Is HTML Parsing the Right Approach?
HTML parsing is appropriate when:
- No official REST or GraphQL API exists
- Data is publicly visible
- You need lightweight read-only access
- The page structure is relatively stable
⚠️ Always verify the website’s Terms of Service and robots.txt before scraping.
Technology Stack
- Kotlin (JVM)
- Jsoup – HTML parser with jQuery-like selectors
- Optional: OkHttp (for advanced networking)
Adding Jsoup to Your Kotlin Project
Gradle (Kotlin DSL)
dependencies {
implementation("org.jsoup:jsoup:1.17.2")
}
Understanding the Page Structure
Most ranking pages follow a predictable HTML pattern:
- A container for the list
- Repeating elements for each item
- Nested tags for:
- position
- title
- author / artist / description
Using browser DevTools (Inspect Element), identify:
- Common parent container
- Repeating row structure
- Semantic tags (
h2,h3,span, etc.)
Data Model (Best Practice)
Create a clean domain model to decouple parsing from business logic.
data class RankedItem(
val position: Int,
val title: String,
val subtitle: String
)
Fetching and Parsing the Page
Basic Kotlin + Jsoup Example
import org.jsoup.Jsoup
fun main() {
val document = Jsoup
.connect("https://example.com/ranking-page")
.userAgent("Mozilla/5.0")
.timeout(10_000)
.get()
val items = mutableListOf<RankedItem>()
val rows = document.select("div.ranking-item") // example selector
for (row in rows) {
val position = row.selectFirst(".position")?.text()?.toIntOrNull()
val title = row.selectFirst(".title")?.text()
val subtitle = row.selectFirst(".subtitle")?.text()
if (position != null && title != null && subtitle != null) {
items.add(
RankedItem(
position = position,
title = title,
subtitle = subtitle
)
)
}
}
items.forEach { println(it) }
}
Choosing the Right CSS Selectors
Jsoup supports full CSS selector syntax:
| Selector | Description |
|---|---|
.class | Elements with class |
#id | Element by ID |
div > h2 | Direct child |
div h2 | Any descendant |
:nth-child(n) | Positional selection |
Example
document.select("div.chart-entry h2")
Handling Encoding and Performance
Recommended Settings
Jsoup.connect(url)
.timeout(10_000)
.maxBodySize(2_000_000)
.userAgent("Mozilla/5.0")
Avoid
- Parsing entire pages repeatedly
- Blocking the UI thread (Android!)
- Hardcoding brittle selectors
Error Handling
Always assume the structure can change.
try {
val doc = Jsoup.connect(url).get()
} catch (ex: Exception) {
// log + fallback
}
Android-Specific Notes
- Use
Dispatchers.IO - Never run Jsoup on the main thread
- Consider caching results
withContext(Dispatchers.IO) {
Jsoup.connect(url).get()
}
SEO & Content Aggregation Use Cases
This approach is commonly used for:
- Music or media rankings
- News headlines aggregation
- Public statistics dashboards
- Market monitoring tools
- Research automation
Legal and Ethical Considerations
✔️ Public data only
✔️ Respect rate limits
✔️ No authentication bypass
❌ No personal data scraping
If long-term stability is required, request an official API whenever possible.
Final Thoughts
Using Kotlin + Jsoup provides a clean, expressive, and maintainable way to extract structured data from public web pages. With proper abstraction, selector hygiene, and error handling, this approach scales well for both backend services and Android applications.
When implemented responsibly, HTML parsing remains a powerful tool in a modern developer’s toolbox.


