Puppeteer is the NodeJs library that provides API to automate Chrome or Chromium browsers. It can be used to get the inner text of any element on the page however the approach differs slightly for the individual type of elements.
Let’s explore how we can scrape the inner text of headings, links, paragraphs, list, table, button, input, text area elements using puppeteer.
We will be using the following test page which contains all types of HTML elements.
https://gulshansainis.github.io/portfolio/
I will be using below code as starting point
// file index.js const puppeteer = require('puppeteer') const options = { executablePath: '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome', headless: false, defaultViewport: null, args: ['--window-size=1920,1080'], } ;(async () => { const browser = await puppeteer.launch(options) const page = await browser.newPage() await page.setDefaultNavigationTimeout(0) await page.goto('https://gulshansainis.github.io/portfolio/') // rest of code goes below await browser.close() })()
First we will get the text of h1
heading which is displayed inside Hero
element
The selector of h1
element is "body > div > div > h1"
To get element of heading we need to use element.textContent
method
const heading1 = await page.$eval("body > div > div > h1", el => el.textContent); console.log(heading1)
After, you put above code just below comment // rest of code goes below
and execute the file using node index.js
command, it should output text Hey 👋, I'm Gulshan Saini
on terminal console.
Getting link text is very much similar to getting the text of the heading. We will be getting the text of the first item in the navigation list i.e. Portfolio
at the time of writing.
The CSS selector of element is #nav-menu > li:nth-child(1)
const navListFirstItem = await page.$eval("#nav-menu > li:nth-child(1)", el => el.textContent); console.log(navListFirstItem)
Once, you save above code and run the script again you should see text Portfolio
on console.
Next, we will be targetting paragraph element displayed inside Hero
element
The CSS selector of paragraph is body > div > div > p
const heroParagraph = await page.$eval("body > div > div > p", el => el.textContent); console.log(heroParagraph)
You should get below output after saving above code and running the index.js
I’m fullstack developer at Exponential, a global provider of advertising intelligence and digital media solutions to brand advertisers.
So far everything was simple and the technique was common to get the text. Let’s now see how we can iterate over list elements and print individual item text
We will be selecting the list items in the Services
section having selector #services ul li
So to get innerText is like following,
const services = await page.evaluate(() => Array.from( document.querySelectorAll('#services ul li'), (element) => element.textContent ) ) console.log(services)
Let’s understand what is happening here
document.querySelectorAll('#services ul li')
selects all nodesArray.from
converts all nodes to array list as document.querySelectorAll
returns, NodeList
instead of arrayWe first get all elements using page.evaluate
method which captures all the nodes using Array.from
method. Array.from
takes NodeList
of matching selector i.e. document.querySelectorAll("#services ul li")
.
The services
variable holds the inner text of all list elements is order they were present on page. After saving above code you should see below output on console
[ 'Web Development', 'Technical Fesibility', 'Infrastructure Set-up', 'Architecture Design Review', 'Amazon Web Service Migration' ]
Getting the text of input element or input element of type submit i.e. button works differently as the text is contained inside the value
attribute
To get the text of the input element we need to use element.value
instead of element.textContent
.
We will be using the below code to get the text of the input button that is of type submit
const formSubmitButton = await page.$eval( '#contact-form > div > input[type=submit]', (el) => el.value ) console.log(formSubmitButton)
Once you save the above code and run node index.js
, this should return the inner text of the button i.e. Submit
Below is the final code which contains all the scenarios to get the inner text of the element
const puppeteer = require('puppeteer') const options = { executablePath: '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome', headless: false, defaultViewport: null, args: ['--window-size=1920,1080'], } ;(async () => { const browser = await puppeteer.launch(options) const page = await browser.newPage() await page.setDefaultNavigationTimeout(0) await page.goto('https://gulshansainis.github.io/portfolio/') await page.waitForSelector('body > div > div > h1') const heading1 = await page.$eval( 'body > div > div > h1', (el) => el.textContent ) console.log(heading1) const navListFirstItem = await page.$eval( '#nav-menu > li:nth-child(1)', (el) => el.textContent ) console.log(navListFirstItem) const heroParagraph = await page.$eval( 'body > div > div > p', (el) => el.textContent ) console.log(heroParagraph) const services = await page.evaluate(() => Array.from( document.querySelectorAll('#services ul li'), (element) => element.textContent ) ) console.log(services) const formSubmitButton = await page.$eval( '#contact-form > div > input[type=submit]', (el) => el.value ) console.log(formSubmitButton) await browser.close() })()