HomeTutorsContact

How to get the text of element using puppeteer?

By Gulshan Saini
Published in Puppeteer
September 25, 2020
2 min read

Puppeteer is the NodeJs library that provides API to automate Chrome or Chromium browsers. It can be used to get the inner text of any element on the page however the approach differs slightly for the individual type of elements.

Let’s explore how we can scrape the inner text of headings, links, paragraphs, list, table, button, input, text area elements using puppeteer.

We will be using the following test page which contains all types of HTML elements.

https://gulshansainis.github.io/portfolio/

Boilerplate code

I will be using below code as starting point

// file index.js
const puppeteer = require('puppeteer')

const options = {
  executablePath:
    '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
  headless: false,
  defaultViewport: null,
  args: ['--window-size=1920,1080'],
}

;(async () => {
  const browser = await puppeteer.launch(options)
  const page = await browser.newPage()
  await page.setDefaultNavigationTimeout(0)
  await page.goto('https://gulshansainis.github.io/portfolio/')

  // rest of code goes below

  await browser.close()
})()

Getting Heading text

First we will get the text of h1 heading which is displayed inside Hero element

The selector of h1 element is "body > div > div > h1"

To get element of heading we need to use element.textContent method

const heading1 = await page.$eval("body > div > div > h1", el => el.textContent);
console.log(heading1)

After, you put above code just below comment // rest of code goes below and execute the file using node index.js command, it should output text Hey 👋, I'm Gulshan Saini on terminal console.

Getting link text is very much similar to getting the text of the heading. We will be getting the text of the first item in the navigation list i.e. Portfolio at the time of writing.

The CSS selector of element is #nav-menu > li:nth-child(1)

const navListFirstItem = await page.$eval("#nav-menu > li:nth-child(1)", el => el.textContent);
console.log(navListFirstItem)

Once, you save above code and run the script again you should see text Portfolio on console.

Scraping paragraph element text

Next, we will be targetting paragraph element displayed inside Hero element

The CSS selector of paragraph is body > div > div > p

const heroParagraph = await page.$eval("body > div > div > p", el => el.textContent);
console.log(heroParagraph)

You should get below output after saving above code and running the index.js

          I’m fullstack developer at Exponential, a global provider of
          advertising intelligence and digital media solutions to brand
          advertisers.

Getting the text of all list elements

So far everything was simple and the technique was common to get the text. Let’s now see how we can iterate over list elements and print individual item text

We will be selecting the list items in the Services section having selector #services ul li

So to get innerText is like following,

const services = await page.evaluate(() =>
  Array.from(
    document.querySelectorAll('#services ul li'),
    (element) => element.textContent
  )
)
console.log(services)

Let’s understand what is happening here

  • document.querySelectorAll('#services ul li') selects all nodes
  • Array.from converts all nodes to array list as document.querySelectorAll returns, NodeList instead of array

We first get all elements using page.evaluate method which captures all the nodes using Array.from method. Array.from takes NodeList of matching selector i.e. document.querySelectorAll("#services ul li").

The services variable holds the inner text of all list elements is order they were present on page. After saving above code you should see below output on console

[ 'Web Development',
  'Technical Fesibility',
  'Infrastructure Set-up',
  'Architecture Design Review',
  'Amazon Web Service Migration' ]

Getting the text of input element

Getting the text of input element or input element of type submit i.e. button works differently as the text is contained inside the value attribute

To get the text of the input element we need to use element.value instead of element.textContent.

We will be using the below code to get the text of the input button that is of type submit

const formSubmitButton = await page.$eval(
  '#contact-form > div > input[type=submit]',
  (el) => el.value
)
console.log(formSubmitButton)

Once you save the above code and run node index.js, this should return the inner text of the button i.e. Submit

Final code & Test

Below is the final code which contains all the scenarios to get the inner text of the element

const puppeteer = require('puppeteer')

const options = {
  executablePath:
    '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
  headless: false,
  defaultViewport: null,
  args: ['--window-size=1920,1080'],
}

;(async () => {
  const browser = await puppeteer.launch(options)
  const page = await browser.newPage()
  await page.setDefaultNavigationTimeout(0)
  await page.goto('https://gulshansainis.github.io/portfolio/')

  await page.waitForSelector('body > div > div > h1')
  const heading1 = await page.$eval(
    'body > div > div > h1',
    (el) => el.textContent
  )
  console.log(heading1)

  const navListFirstItem = await page.$eval(
    '#nav-menu > li:nth-child(1)',
    (el) => el.textContent
  )
  console.log(navListFirstItem)

  const heroParagraph = await page.$eval(
    'body > div > div > p',
    (el) => el.textContent
  )
  console.log(heroParagraph)

  const services = await page.evaluate(() =>
    Array.from(
      document.querySelectorAll('#services ul li'),
      (element) => element.textContent
    )
  )
  console.log(services)

  const formSubmitButton = await page.$eval(
    '#contact-form > div > input[type=submit]',
    (el) => el.value
  )
  console.log(formSubmitButton)

  await browser.close()
})()

Tags

#puppeteer
Previous Article
How to launch Chrome browser from command line?

Related Posts

Puppeteer
How to launch the Firefox browser using puppeteer?
July 05, 2020
1 min
Gulshan Saini

Gulshan Saini

Fullstack Developer

Topics

JavaScript
Angular
ReactJS
Typescript
Linux

Subscribe to our newsletter!

We'll send you the best of our blog just once a month. We promise.

Quick Links

Contact UsBrowserCSSPythonPuppeteer

Social Media