Crawling Stars and Forks Count from a Repository
In this lesson, we'll delve into a more structured logic to crawl and display the Stars (Likes)
and Forks (Project Clones)
count from a repository.
Step 1
Fetch HTML from Web Page
response = requests.get(url)
html_content = response.text
requests.get(url)
: A function that fetches data from a web page at the given URL. In this context, it targets the GitHub repository page of Django.response.text
: Extracts the HTML content as a string from the response obtained byrequests.get
.
Step 2
Parse HTML
soup = BeautifulSoup(html_content, 'html.parser')
BeautifulSoup(html_content, 'html.parser')
: UtilizesBeautifulSoup
to parsehtml_content
, enabling easy access and manipulation of HTML elements.
Step 3
Locate Stars and Forks Count
ids_to_find = ['repo-stars-counter-star', 'repo-network-counter']
- This list holds the IDs of HTML elements that display the stars and forks count. These IDs are used to locate the information on the webpage.
Step 4
Extract Information
for id_value in ids_to_find:
element_content = soup.find(id=id_value)
found_contents[id_value] = element_content.get_text() if element_content else "No content"
soup.find(id=id_value)
: Finds the HTML element with the specified ID in the parsed HTML content.element_content.get_text()
: Extracts the text content from the found element. If the element doesn't exist, "No content" is returned.
Step 5
Output
for id_value, content in found_contents.items():
print(f"ID '{id_value}': {content}")
found_contents.items()
: Iterates through the found content, printing each ID and its corresponding text content, allowing users to see the stars and forks count.
Practical Exercise
-
Execute the code above with a different repository URL on GitHub.
-
Practice extracting various data by using different IDs or classes.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.