Skip to content

Conversation

@skytreader
Copy link

Fixes #8.

Instead of passing around the domain, a cleaner way would be to create an object that abstracts the parsing of a document. That way, the domain can be extracted from the constructor into a field and then used by methods that need it. But I think this PR is large enough as it is for #8 so I guess I'll owe the refactor to another PR. :)

@edsu
Copy link
Owner

edsu commented Jul 12, 2019

I apologize for not seeing this sooner. It seems that the pull request has conflicts now. If this is something you still want to get in let me know if you can address the conflicts and I promise to apply it sooner this time!

def __init__(self, string):
self.string = string
def __init__(self, string, domain=""):
if string.startswith("http://") or string.startswith("https://"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it's duplicating the stdlib urljoin semantics:

>>> from urlparse import urljoin
>>> urljoin('https://example.com/foo/bar', '/baaz')
'https://example.com/baaz'
>>> urljoin('https://example.com/foo/bar', 'http://quux/baaz')
'http://quux/baaz'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions:

  1. Judging by the .travis.yml this project is officially on Python 3. As such I should use urllib.parse right?
  2. Sorry I'm not sure I agree with you that this bit is duplicating urljoin semantics. This conditional just ensures that the string starts with either "http" or "https". urljoin does not do that:
Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urljoin
>>> urljoin("github.com", "edsu")
'edsu'

Can you clarify if I'm missing something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urljoin expects its first parameter to be a URL:

>>> from urllib.parse import urljoin
>>> urljoin("https://github.com", "edsu")
'https://github.com/edsu'

return self.string

@staticmethod
def get_domain(url_string):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be replaced with something using urlparse, which returns a handy named tuple:

>>> urlparse('https://example.com/foo/bar')
ParseResult(scheme='https', netloc='example.com', path='/foo/bar', params='', query='', fragment='')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used urlparse inside this method instead of manually splitting and joining but urlparse can't be a drop-in replacement for what I want to achieve here:

>>> urlparse("github.com/edsu/microdata/pull/32/files")
ParseResult(scheme='', netloc='', path='github.com/edsu/microdata/pull/32/files', params='', query='', fragment='')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urlparse expects a URL:

>>> urlparse("https://github.com/edsu/microdata/pull/32/files")
ParseResult(scheme='https', netloc='github.com', path='/edsu/microdata/pull/32/files', params='', query='', fragment='')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really knowledgeable about the field here but my point is, are we sure to never encounter malformed-URLs ? I get it that urllib expects well-formed URLs but shouldn't a microdata parser be more forgiving? Hence I manually check and adapt in case the string leaves out the protocol part.

@skytreader
Copy link
Author

Hey there. This is not relevant to my work anymore but let me just complete the job; besides, I know others who might benefit from this. That said, I'll get back to this this weekend.

@skytreader
Copy link
Author

Just got back to this. Resolving conflicts and addressing comments took a while longer since I'm no longer familiar with the domain (knowledge). Anyway I see that tests are failing :(. I'll address that next but in the meantime, feel free to comment further on my clarifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

make relative URLs absolute

3 participants