I have a list of urls and a table that contains urls. I want to only insert if the url is not in the table.
Data in the Table: |id | url | ... | |---| --- | --- | | 1 | example.com | ... |
List<String> urls = new ArrayList<>() urls.add("example.com/"); urls.add("example.com/#"); urls.add("www.example.com/"); urls.add("https://www.example.com/"); urls.add("example.net");
After the insert the Data-table should be contain:
Data in the Table: |id | url | ... | |---| --- | --- | | 1 | example.com | ... | | 2 | example.net | ... |
My current attempt is to have a method findByURL(url):List and a call this method for every URL in the list. If the returned list is empty I insert the url into the table but unfortunately my statement makes a difference between example.com and example.com#
@Table(name = "url_to_edit") @NamedQueries({ @NamedQuery(name= UrlToEdit.FIND_BY_URL, query = "select urlToEdit from UrlToEdit urlToEdit where urlToEdit.url = :url") }) @NoArgsConstructor public class UrlToEdit { ... }
With my current solution the table contains the follow rows:
Data in the Table: |id | url | ... | |---| --- | --- | | 1 | example.com | ... | | 2 | example.com/ | ... | | 3 | example.com/# | ... | | 4 | www.example.com/ | ... | | 5 | https://www.example.com/ | ... | | 6 | example.net | ... |
How can I say in the sql that it is the same? Or need I some kind of pre parser? And is it possible to make a bulk insert? My current Code inserts one after the other.
EDIT: I have multiple urls from one host. I can’t go after the hostnames. e.g. example.com/test/ example.com/test/# and example.com/# etc.
Advertisement
Answer
I think you should transform the urls even before storing them into the database; this way, all your data would be normalized and you won’t have to check every row manually. Using a UNIQUE constraint to the url column in the table would help too.
In terms of the transformation, I think (not assured) that the following regex might work:
Pattern URL_REGEX = Pattern.compile("(?:https?:\/\/)?(www\.)?([^\/]+).*"); String url = "http://www.example.com/xxx"; Matcher matcher = URG_REGEX.matcher(url); if (matcher.matches()) { url = matcher.group(2); }
NOTE: I adapt the regular expression in order to fit your data, but I wouldn’t consider example.com
and www.example.com
to be the same URL.