Vexen Crabtree 2015

vexen

Vexen Crabtree's Live Journal

Sociology, Theology, Anti-Religion and Exploration: Forcing Humanity Forwards


Previous Entry Share Next Entry
Vexen Crabtree 2015
vexen

Counting URLs (SQL, in Access environment)

I'm trying to do something quite simple in SQL but can't get it right. I suddenly found myself creating 3 queries just to do this one thing, and I'm sure I should be able to do it in one query. I have a list of URLs in a table (T_URLs including the field URL_URL), and I simply want to return a list of URLs, ordered by domain, with a count of how many URLs are in each domain.

So,

1. Extract the Domain part of the URL. For example the domain of http://www.vexen.co.uk/countries/best.html is http://www.vexen.co.uk/:
SELECT Mid(URL_URL,1,InStr(8,URL_URL,"/")) AS URL_Domain FROM T_URLs;
2. I need to Group By and Order By this URL_Domain field, which is where my problems start. So I need extra SQL like:
Count(URL_Domain)
and
GROUP BY Mid(URL_URL,1,InStr(8,URL_URL,"/")) ORDER BY URL_Domain
3. I then want to count how many URLs are in that domain. The result should be a list of URLs, with an extra "Domain_Q" field that counts how many other URLs are also in that domain:

In total:
SELECT Mid(URL_URL,1,InStr(8,URL_URL,"/")) AS Domain_URL, Count(URL_URL) AS Domain_Q

FROM T_URLs

GROUP BY Mid(URL_URL,1,InStr(8,URL_URL,"/")), URL_URL

ORDER BY URL_URL;
This will later let me iterate through every URL that exists in a domain which has more than 10 URLs in it.

But the SQL so far returns a (correct) list of URL_URLs and Domain_URLs, but, the Domain_Q says "1" for every line in the results.

It works if I put the Count_Q in a different SQL query, but then I end up creating a circular reference when I try to bring the Domain_Q back into the URL list. Can anyone work out (a) what I'm talking about, and (b) what nifty little trick I'm missing to get this done sensibly?

  • 1
At a quick glance, have you tried removing URL_URL from the GROUP BY?

That would be the sensible approach, I agree, but, it gives an error if I do that:

"You tried to execute a query that does not include the specificed expression 'URL_URL' as part of an aggregate function."

Which is an age-old error message that has demonized me for years, and I've never understood, but, the way to get rid of it is to include the offending field in the Group By list.

Shouldn't you be counting Domain_URL 's , and not URL_URL 's? That line creates a URL entry for a domain, and I'm guessing if the same old domain doesn't come up there'll be no new entry in the database table?

Also a sensible question; the problem is that there is no field called Domain_URLs so it causes an error, it is worked-out from URL_URL. This is why I've had to put that part in a seperate query - although it still seems like there should be a way round that, I can't find it!

Hmm, and if you ORDER BY URL_Domain. Or cheat and add max(URL_URL) or something. I really should sit down and think about this.

Well don't worry, I've used a 3-query solution so I'm just going to forget about it. One query does a SELECT URL_URL, URL_Domain, another does "COUNT(URL_Domain) GROUP BY URL_Domain) and then a third brings them together in a LEFT JOIN.

Max(URL_URL) would (if it works for non-numeric) give the last URL_URL from an alphabetical list (i.e., MAX('A','B','C','D','E','F') would be 'F' so it wouldn't work as a count.

I prefer to have things in as few quesries as possible (my db, Ziggy, now has 50), but it is always a trade-off between time-taken (too much, on this issue!) and efficiency.

Thanks for the help though!

I'd gladly try to help, if only it didn't mean scraping the rust of such old SQL that I'd almost be starting from scratch. So I merely wish you luck.

More saliently—would you please put the gory details of such posts behind a cut?

Footpad, you must come visit us sometime once we're settled back in England (near Salisbury).

Gladly! Have I blown it for coming to see you in Monchengladbach?

Where do you fit the screwdriver?

I'm guessing there's no function to count up your Domain_URL factors. Never touched SQL so I'm thinking BASIC but if there's no count-up you'll just get individual units, not ' Σ of ' in olde engineeryey type stuff.

Fundamentally, you're asking for a non-normalised table - a join between count-of-domains and urls-for-a-domain on domain.

I suspect there's no elegant way to do this.

Something like this?

mysql> select * from urls;
+---------------------+
| url |
+---------------------+
| http://foo.com/bar |
| http://foo.com/mew |
| http://foo.com/rawr |
| http://bar.com/rawr |
| http://bar.com/mew |
+---------------------+
5 rows in set (0.00 sec)

mysql> select url, count from (select substring(url, 8, locate('/',url,8)-8) as domain, count(url) as count from urls group by domain) as t1 join (select substring(url, 8, locate('/',url,8)-8) as domain, url from urls) as t2 on t1.domain = t2.domain order by t2.domain;
+---------------------+-------+
| url | count |
+---------------------+-------+
| http://bar.com/rawr | 2 |
| http://bar.com/mew | 2 |
| http://foo.com/bar | 3 |
| http://foo.com/mew | 3 |
| http://foo.com/rawr | 3 |
+---------------------+-------+
5 rows in set (0.00 sec)

mysql>

Re: Something like this?

select url from (select substring(url, 8, locate('/',url,8)-8) as domain, count(url) as count from urls group by domain) as t1 join (select substring(url, 8, locate('/',url,8)-8) as domain, url from urls) as t2 on t1.domain = t2.domain where count > 10 order by t2.domain;

Re: Something like this?

OK I see, you're awesome old friend, I never thought of putting the two complementary queries in as subqueries... I've done with this for now (the thing works - it writes sitemap.xml files correctly for each domain I happen to have URLs on), I'll play with subqueries next time I have a go at editting that part of Ziggy!

Yeah, you might be right but it doesn't seem right!

  • 1
?

Log in

No account? Create an account