I am using the tld python library to grab the first level domain from the proxy request logs using a apply function. When I run into a strange request that tld doesnt know how to handle like 'http:1 CON' or 'http:/login.cgi%00' I run into an error message like the following:
TldBadUrl: Is not a valid URL http:1 con!
TldBadUrlTraceback (most recent call last)
in engine
----> 1 new_fld_column = request_2['request'].apply(get_fld)
/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)()
/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in get_fld(url,
fail_silently, fix_protocol, search_public, search_private, **kwargs)
385 fix_protocol=fix_protocol,
386 search_public=search_public,
--> 387 search_private=search_private
388 )
389
/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in process_url(url, fail_silently, fix_protocol, search_public, search_private)
289 return None, None, parsed_url
290 else:
--> 291 raise TldBadUrl(url=url)
292
293 domain_parts = domain_name.split('.')
To overcome this it was suggested to me to wrap the function in a try-except clause to determine the rows that error out by querying them with NaN:
import tld
from tld import get_fld
def try_get_fld(x):
try:
return get_fld(x)
except tld.exceptions.TldBadUrl:
return np.nan
This seems to work for some of the "requests" like "http:1 con" and "http:/login.cgi%00" but then fails for "http://urnt12.knhc..txt/" where I get another error message like the one above:
TldDomainNotFound: Domain urnt12.knhc..txt didn't match any existing TLD name!
This is what the dataframe looks like total of 240,000 "requests" in a dataframe called "request":
request
request count
0 https://login.microsoftonline.com 24521
1 https://dt.adsafeprotected.com 11521
2 https://googleads.g.doubleclick.net 6252
3 https://fls-na.amazon.com 65225
4 https://v10.vortex-win.data.microsoft.com 7852222
5 https://ib.adnxs.com 12
6 http:1 CON 6
7 http:/login.cgi%00 45822
8 http://urnt12.knhc..txt/ 1
My code:
from tld import get_tld
from tld import get_fld
import pandas as pd
import numpy as np
#Read back into to dataframe
request = pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove rows where there were null values in the request column
request = request[pd.notnull(request['request'])]
#Find the urls that contain IP addresses and exclude them from the new dataframe
request = request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset index
request = request.reset_index(drop=True)
import tld
from tld import get_fld
def try_get_fld(x):
try:
return get_fld(x)
except tld.exceptions.TldBadUrl:
return np.nan
request['flds'] = request['request'].apply(try_get_fld)
#faulty_url_df = request[request['flds'].isna()]
#print(faulty_url_df)
Answer
It fails because it's a different exception. You expect
a tld.exceptions.TldBadUrl:
exception but get a TldDomainNotFound
You can either be less specific in your except clause and catch more exception with one except clause or add another except clause to catch the other type of exception:
try:
return get_fld(x)
except tld.exceptions.TldBadUrl:
return np.nan
except tld.exceptions.TldDomainNotFound:
print("Domain not found!")
return np.nan
No comments:
Post a Comment