GoLang simple scraper

Learn how to build a simple web scraper in Go using the Colly package. This tutorial will guide you through setting up a Go project, installing Colly, and writing code to scrape data from websites like Hacker News.

GoLang simple scraper
Photo by Sasha Kaunas / Unsplash

When it comes to scraping data with Golang, the go-to framework for most developers is Golang Colly. Its popularity stems from its efficiency and ease of use. Additionally, if you're only interested in scraping data from a single page, GoQuery is a great tool to consider.

Requirements:

  • Golang installed
  • IDE of your choice

Create a new project in your root directory

mkdir hacker-news-scraper

Now lets the Golang scraper module

go mod init hacker-news-scraper

Install colly

go get -u github.com/gocolly/colly/v2

Next, lets create main.go

package main

import (
	"fmt"
	"github.com/gocolly/colly"
	"log"
)

type Comment struct {
	User    string
	Comment string
}

func main() {
	c := colly.NewCollector(
		colly.AllowedDomains("news.ycombinator.com"),
	)
	comments := make([]Comment, 0)

	c.OnHTML("tr.athing", func(e *colly.HTMLElement) {
		comment := Comment{
			User:    e.ChildText("a.hnuser"),
			Comment: e.ChildText("span.commtext"),
		}
		comments = append(comments, comment)
	})

	err := c.Visit("https://news.ycombinator.com/item?id=40396005")
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(comments)
}

What we did here is basically create a new colly collector(scraper). 

c.OnHtml says that we want to get all of tr.athing and get child user and comment text.

and put them into the comments slice

with a few improvements to be able to provide URL in the command line

I will add a flag parser to add the URL parameter

	var url string
	flag.StringVar(&url, "url", "https://news.ycombinator.com/item?id=40396005", "URL to scrape")
	flag.Parse()

And now our code look like this.

package main

import (
	"flag"
	"fmt"
	"github.com/gocolly/colly"
	"log"
)

type Comment struct {
	User    string
	Comment string
}

func main() {
	var url string
	flag.StringVar(&url, "url", "https://news.ycombinator.com/item?id=40396005", "URL to scrape")
	flag.Parse()
	c := colly.NewCollector(
		colly.AllowedDomains("news.ycombinator.com"),
	)
	comments := make([]Comment, 0)

	c.OnHTML("tr.athing", func(e *colly.HTMLElement) {
		comment := Comment{
			User:    e.ChildText("a.hnuser"),
			Comment: e.ChildText("span.commtext"),
		}
		comments = append(comments, comment)
	})

	err := c.Visit(url)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(comments)
}

you can run it with

go run ./main.go -url=https://news.ycombinator.com/item?id=40396005

That's it; a simple scraper is working.

Of course, you can trim comment text to remove redundant whitespace, or new lines, etc,