Today we learn how to fetch all links in a website . We write a small script for that purpose. I came across this when I was reading bash guide.
example microsoft.com : I want to get list of all domains that are present on microsoft.com and fetch their IP Addresses.
Step 1 : Fetch microsoft.com
Save it as html or wget the page.
root@ETHICALHACKX:~# wget microsoft.com --2020-02-16 22:51:31-- http://microsoft.com/ Resolving microsoft.com (microsoft.com)... 13.77.161.179, 40.113.200.201, 40.112.72.205, ... Connecting to microsoft.com (microsoft.com)|13.77.161.179|:80... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://www.microsoft.com/ [following] --2020-02-16 22:51:32-- https://www.microsoft.com/ Resolving www.microsoft.com (www.microsoft.com)... 106.51.146.24, 2600:140f:4:1a1::356e, 2600:140f:4:186::356e Connecting to www.microsoft.com (www.microsoft.com)|106.51.146.24|:443... connected. HTTP request sent, awaiting response... 302 Moved Temporarily Location: https://www.microsoft.com/en-in/ [following] --2020-02-16 22:51:32-- https://www.microsoft.com/en-in/ Reusing existing connection to www.microsoft.com:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: ‘index.html’ index.html [ <=> ] 153.15K --.-KB/s in 0.1s 2020-02-16 22:51:32 (1.48 MB/s) - ‘index.html’ saved [156821] root@ETHICALHACKX:~# ls -l index.html -rw-r--r-- 1 root root 156821 Feb 16 22:51 index.html root@ETHICALHACKX:~#

Step 2 : We need links, that are in the form of: <li><a href="http://xbox.microsoft.com/">XBox</a></li>
Step 4 : Cut the lines with links
We use grep to cut the lines that contain any link within them, or in sort we get the lines with “href” in them.grep "href=" index.html
root@ETHICALHACKX:~# grep "href=" index.html <link rel="dns-prefetch" href="https://assets.onestore.ms" /> <link rel="preconnect" href="https://assets.onestore.ms" /> <link rel="dns-prefetch" href="https://web.vortex.data.microsoft.com" /> <link rel="preconnect" href="https://web.vortex.data.microsoft.com" /> <link rel="dns-prefetch" href="https://mem.gfx.ms" /> <link rel="preconnect" href="https://mem.gfx.ms" /> <link rel="dns-prefetch" href="https://img-prod-cms-rt-microsoft-com.akamaized.net" /> <link rel="preconnect" href="https://img-prod-cms-rt-microsoft-com.akamaized.net" /> <link rel="dns-prefetch" href="https://microsoftwindows.112.2o7.net" /> <link rel="preconnect" href="https://microsoftwindows.112.2o7.net" /> <link rel="SHORTCUT ICON" href="https://c.s-microsoft.com/favicon.ico?v2" type="image/x-icon" />

Step 5 : We further clear the output by removing text. We can notice the “/” character delimiter at 3rd position, let’s use it.
root@ETHICALHACKX:~# grep "href=" index.html | cut -d "/" -f3 assets.onestore.ms" assets.onestore.ms" web.vortex.data.microsoft.com" web.vortex.data.microsoft.com" mem.gfx.ms" mem.gfx.ms" img-prod-cms-rt-microsoft-com.akamaized.net" img-prod-cms-rt-microsoft-com.akamaized.net" microsoftwindows.112.2o7.net" microsoftwindows.112.2o7.net" c.s-microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com

Step 6 : The output we have is now better from previous one, lets put a bit more effort to clean the extra words appearing in the list.
We clean all lines that have period “.”
root@ETHICALHACKX:~# grep "href=" index.html | cut -d "/" -f3 | grep "\." assets.onestore.ms" assets.onestore.ms" web.vortex.data.microsoft.com" web.vortex.data.microsoft.com" mem.gfx.ms" mem.gfx.ms" img-prod-cms-rt-microsoft-com.akamaized.net" img-prod-cms-rt-microsoft-com.akamaized.net" microsoftwindows.112.2o7.net" microsoftwindows.112.2o7.net" c.s-microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com" aria-label="Microsoft" data-m='{"cN":"GlobalNav_Logo_cont","cT":"Container","id":"c3c2m1r1a1","sN":3,"aN":"c2m1r1a1"}'> products.office.com www.microsoft.com www.microsoft.com

Step 7 : We make output more cleaner by filtering out the part of lines having ‘”‘ delimiter at position one.
root@ETHICALHACKX:~# grep "href=" index.html | cut -d "/" -f3 | grep "\." | cut -d '"' -f1 assets.onestore.ms assets.onestore.ms web.vortex.data.microsoft.com web.vortex.data.microsoft.com mem.gfx.ms mem.gfx.ms img-prod-cms-rt-microsoft-com.akamaized.net img-prod-cms-rt-microsoft-com.akamaized.net microsoftwindows.112.2o7.net microsoftwindows.112.2o7.net c.s-microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com products.office.com www.microsoft.com www.microsoft.com www.xbox.com support.microsoft.com

Step 8: We now have a clean list but with duplicates, lets sort with -u (unique) argument.
root@ETHICALHACKX:~# grep "href=" index.html | cut -d "/" -f3 | grep "\." | cut -d '"' -f1 | sort -u account.microsoft.com assets.onestore.ms azure.microsoft.com careers.microsoft.com channel9.msdn.com choice.microsoft.com c.s-microsoft.com developer.microsoft.com docs.microsoft.com go.microsoft.com img-prod-cms-rt-microsoft-com.akamaized.net mem.gfx.ms microsoftwindows.112.2o7.net msdn.microsoft.com news.microsoft.com onedrive.live.com outlook.live.com privacy.microsoft.com products.office.com store.office.com support.microsoft.com technet.microsoft.com twitter.com visualstudio.microsoft.com web.vortex.data.microsoft.com www.facebook.com www.microsoft.com www.onenote.com www.skype.com www.xbox.com www.youtube.com root@ETHICALHACKX:~#

Step 9 : Export this list to a text file.
root@ETHICALHACKX:~# grep "href=" index.html | cut -d "/" -f3 | grep "\." | cut -d '"' -f1 | sort -u > microsoft.txt root@ETHICALHACKX:~#
Step 10 : we now check the IP Address of each of the domains in the file saved by host command.

Step 11 : We again get many irrelevant output which we want to trim down, let’s again apply grep on output for “has address” and cut and again sort unique.
root@ETHICALHACKX:~# for url in $(cat microsoft.txt); do host $url; done | grep "has address" | cut -d " " -f 4 | sort -u 104.121.243.97 104.121.246.126 104.122.12.200 104.244.42.65 106.51.144.228 106.51.144.82 106.51.146.105 106.51.146.24 13.107.42.11 13.107.42.13 13.107.42.16 13.235.141.20 13.235.224.156 13.92.199.137 172.217.160.142 172.217.163.110 172.217.163.174 172.217.163.206 172.217.163.46 172.217.163.78 172.217.166.110 172.217.167.142 172.217.26.174 172.217.26.206 172.217.31.206 184.29.11.224 192.237.225.141 202.83.22.200 202.83.22.218 216.58.196.174 216.58.197.46 216.58.197.78 216.58.200.142 23.8.183.228 23.8.185.225 23.8.187.90 23.8.188.96 31.13.79.35 40.77.226.250 52.109.120.67 52.109.56.1 52.113.194.133 65.52.210.213 root@ETHICALHACKX:~#

So we got all the IP address for domains appearing on microsoft.com without much trouble. The same can be applied in various scenarios.