Sunday, December 4, 2022

Web Scraping: Collect all Urls from a Webpage

This python program uses selenium to find all valid links(internal link only), and save it to a csv file. I used DVWA as target, should be logged in at first to display the index or home page. 

This program is useful when a project needs to scan for vulnerabilities of a given domain name. If there are thousands of links, then this will be a good automation tool instead of doing it manually.

The source code had comments describing the purpose of each lines to make it readable for beginners in Python programming. Please note that this is not a tutorial, it is meant to share my python programming journey.

the sample output:



Here is the source code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
##  Import webdriver from Selenium Wire instead of Selenium
from seleniumwire import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

##  Get the URL

driver = webdriver.Chrome()#make sure chromedriver.exe is on the folder as this python program
driver.get("http://localhost/dvwa/login.php")


# find username/email field and send the username itself to the input field
driver.find_element(by=By.NAME, value="username").send_keys('admin')
# find password input field and insert password as well
driver.find_element(by=By.NAME, value="password").send_keys('test1')
# click login button
driver.find_element(by=By.NAME, value="Login").click()
#find all links
elems = driver.find_elements(by=By.TAG_NAME, value='a')
data = []
for elem in elems:
    #get the url from the link
    x = elem.get_attribute("href")
    if x.find('http://localhost/dvwa') != -1:       
          data.append(elem.get_attribute("href"))
#close the browser          
driver.close()
#close the debug window
driver.quit()
data = list(dict.fromkeys(data)) #remove duplicates from list
df = pd.DataFrame(data)
pd.set_option('display.max_colwidth', None) #make sure that the panda print the whole column
pd.set_option('display.max_rows', None) #make sure all rows are printed
print('lastrow : ' + df.iloc[-1]) #print last row
print('')
print(df)#display the collected url
df.to_csv('url.csv')


No comments:

Post a Comment