最近发现了一个quote-lib网站:https://www.goodreads.com
因而了解到golang有个在github上star数超过6K的爬虫框架:Colly.linux
我想首先将这个goodreads的quotes全都爬下来,而后保存到一个文件里。 最后解析爬下来的quotes,为了优美的markdown效果而格式化每一个quote,使得在网页中这样展现出来:git
每条quote有三个元素:quote的类型, quote文本体,做者或出处github
“We are what we pretend to be, so we must be careful about what we pretend to be.” .
Kurt Vonnegut, Mother Night“Sometimes you wake up. Sometimes the fall kills you. And sometimes, when you fall, you fly.”
Neil Gaiman, Fables & Reflectionsgolang
Lightning Fast and Elegant Scraping Framework for Gophers.web
Colly provides a clean interface to write any kind of crawler/scraper/spider.api
With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.bash
gocolly/colly : https://github.com/gocolly/collymarkdown
$ go get -u github.com/gocolly/colly/...
$ go version go version go1.12.8 linux/amd64
you can export GO111MODULE=on optionalyapp
draft:框架
package main import ( "fmt" "os" "regexp" "strings" "github.com/gocolly/colly" "github.com/gocolly/colly/extensions" ) func main() { fileName := "quote.md" file, errFile := os.Create(fileName) if errFile != nil { println("operating system create file error :%s", errFile.Error()) panic(errFile) } defer func() { err := file.Close() if err != nil { println("file close error") } }() c := colly.NewCollector() errProxy := c.SetProxy("http://127.0.0.1:1080/") if errProxy != nil { println("colly set proxy error :%s", errProxy.Error()) panic(errProxy) } // c.AllowedDomains = []string{"https://www.goodreads.com"} c.AllowURLRevisit = true extensions.RandomUserAgent(c) c.OnHTML(".quoteText ", func(e *colly.HTMLElement) { text := strings.TrimSpace(strings.Split(e.Text, "―")[0]) author := TrimSpaceNewlineInString(strings.TrimSpace(e.ChildText(".authorOrTitle"))) fileWriteForMarkdown(file, text, author) }) c.OnHTML(".next_page", func(e *colly.HTMLElement) { println("visit: ", e.Request.AbsoluteURL(e.Attr("href"))) errHrefVisit := c.Visit(e.Request.AbsoluteURL(e.Attr("href"))) if errHrefVisit != nil { panic(errHrefVisit) } }) errVisit := c.Visit("https://www.goodreads.com/quotes/tag/philosophy") if errVisit != nil { panic(errVisit) } } func TrimSpaceNewlineInString(s string) string { re := regexp.MustCompile(`\n`) return re.ReplaceAllString(s, " ") } func fileWriteForMarkdown(file *os.File, lines ...string) { var admotionBot = ` \{\{% /admonition %\}\} ` head := fmt.Sprintf(` \{\{%% admonition quote "%s" %%\}\} `, lines[1]) _, err := (*file).Write([]byte(head)) if err != nil { println("file write error ", err.Error()) } _, err = (*file).Write([]byte(lines[0])) if err != nil { println("file write error ", err.Error()) } _, err = (*file).Write([]byte(admotionBot)) if err != nil { println("file write error ", err.Error()) } } func fileWriteDirect(file *os.File,lines ...string){ _, err := (*file).Write([]byte(lines[0])) if err != nil { println("file write error ", err.Error()) } _, err = (*file).Write([]byte(lines[1])) if err != nil { println("file write error ", err.Error()) } }