Skip to main content
Practice

Real-time Crawling of Pull Request Count

In this lesson, we will crawl the number of Pull Requests from the Django repository page on GitHub and display it on the screen.

Please note that a Pull Request refers to suggesting changes to another user's repository.


Step 1

Fetching HTML from the Web Page
response = requests.get(url)
html_content = response.text
  • requests.get(url): Retrieves data from the web page at the given URL. Here, it is the URL of the Django GitHub repository page.
  • response.text: Extracts the HTML content as a string from the response received by the requests.get function.

Step 2

Parsing HTML
soup = BeautifulSoup(html_content, "html.parser")
  • BeautifulSoup(html_content, "html.parser"): Uses BeautifulSoup to parse the obtained HTML content (html_content). This allows easy access to various elements within the HTML document.

Step 3

Extracting Information
count = soup.find(id="pull-requests-repo-tab-count").get_text()
  • soup.find(id="pull-requests-repo-tab-count"): Searches for an element with the ID pull-requests-repo-tab-count in the parsed HTML content. This ID corresponds to the element that displays the number of pull requests on the GitHub repository page.
  • .get_text(): Extracts the text content (in this case, the number of pull requests) from the found element.

Note: When performing crawling, make sure to check the robots.txt file and terms of service of the target website to ensure compliance with their regulations.


Practice Exercise

  • Execute the above code using various repository URLs from GitHub.

  • Practice targeting different HTML tags and extracting data from those tags.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.