Newznab - config help etc. Not an invites thread. :)

Soldato
Joined
1 Oct 2006
Posts
14,302
Posted this in the NZBMatrix thread, but figured it would probably be better off in it's own thread.

Before we start, this thread is aimed at people running their own NewsNab indexes at home. It's not a thread for people looking to go public, it's solely about home indexing folk.

Therefore, and as above this is not a place to ask for invites to people's indexes. :D

-

Anyway, I'll start. I've got a NN index setup indexing around 40 groups, and I'm starting to back fill each group when I can spare the processing power and disk I/O on the main rig. When that's done, I'll move the VM back onto it's permanent and slightly less-powered home. Things have gone well, and the docs out there fairly helpful.

The problem I have is:

Has anyone managed to get Sickbeard talking to NewzNab via HTTPS yet? I keep getting a bunch of auth errors sadly.

I noticed a bunch of homespun newznab servers out there, but they're all http only. I'd rather not back off to http to get the API working, but until Sickbeard is updated to allow auth to Newznab servers I fear this is my only option. :(

Any takers? Happy to offer advice/assistance to anyone with config issues (apart from the one above. :D)
 
I played with newsnab on the weekend - Installed it on ubuntu (in a vm) following this guide (http://www.howtogeek.com/120285/how-to-build-your-own-usenet-indexer/) which was fine (had the site up and selecting groups etc... but decided to buy the plus version).

The problem I have: I've selected some of the groups from the admin page, set the back fill to be say 100 days.

Running the update commands int he misc folder (which took a fair few hours) but I still have no nzbs on the main site.

So installed and running fine.
Run update_releases.php
Run update_binaries.php

Navigate to the site and there's no results.

I don't see errors when running the two commands above and it takes a while.
 
I played with newsnab on the weekend - Installed it on ubuntu (in a vm) following this guide (http://www.howtogeek.com/120285/how-to-build-your-own-usenet-indexer/) which was fine (had the site up and selecting groups etc... but decided to buy the plus version).

The problem I have: I've selected some of the groups from the admin page, set the back fill to be say 100 days.

Running the update commands int he misc folder (which took a fair few hours) but I still have no nzbs on the main site.

So installed and running fine.
Run update_releases.php
Run update_binaries.php

Navigate to the site and there's no results.

I don't see errors when running the two commands above and it takes a while.

Wrong way round.

Run update_binaries and then update_releases. Binaries gets the current headers, releases turn those binaries headers in to releases and nbz's.

Run update_backfill to get past posts but you need to set the backfill number of days in NewzNabs groups config screen.

Another interesting thing I came across is that you can also call update_binaries [news group] to just update a single group but the name has to match the name configured in NewzNab. This was helpful in discovering that some of the groups in NewzNab had names that did not match what the news server had and thus highligh why those groups had no updates.

The binaries and backfill threaded scripts also seem to work quite well and are faster at getting headers (seen over 200KB/s max so far) as they update multiple groups in parallel. This is a big plus for backfilling where groups may have a few hundred million headers in the retention period of the server.

My problem now is that with the eu-astraweb server crapping out, I need to reset the first and last counters for all the groups as the server supplied corrupt headers but NewzNab believes it has them so will not redownload. Purge seems to make no difference. I deleted a group and recreated but it seemed to pick up the groups old settings and so will not grab the headers again. Not sure if clean install is the only way to go.

Oh and the regex on the plus version still misses massive amounts of releases. The raw search feature is a good spot check but then it seems you have to create the regex to catch what has been missed in order to get the release as you cannot just tag something from the raw search and it will create a NZB from all the parts.

I may look at putting some filtering in at the MySQL backend as my SQL is much stronger than my regex.

RB
 
Last edited:
Once you've done the update_binaries and backfill, just use the "newznab.sh" script in "update_scripts/nix_scripts", which will go on and on checking for you in the background.

Then simply use an init script for this, as below (just change the directories - this was working on RHEL for me):

#!/bin/sh
#
# Ian - 16/11/2011
# /etc/init.d/newznab: start and stop the newznab update script
#
# run update-rc.d newznab_ubuntu.sh defaults


### BEGIN INIT INFO
# Provides: Newznab
# Required-Start: $remote_fs $syslog
# Required-Stop: $remote_fs $syslog
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Start newznab at boot time
# Description: Enable newznab service provided by daemon.
### END INIT INFO

RED=$(tput setaf 1)
GREEN=$(tput setaf 2)
NORMAL=$(tput sgr0)

col=20


# Newznab variables
NN_PATH="/var/www/nnplus/misc/update_scripts"
NN_BINUP="update_binaries.php"
NN_RELUP="update_releases.php"
NN_SLEEP_TIME="10" # in seconds . 10sec is good for 100s of groups. 600sec might be a good start for fewer.
NN_PID_PATH="/var/run/"
SCREEN_NAME="newznab"
PGREP_SEARCH="SCREEN -dmS $SCREEN_NAME"
PIDFILE="newznab.pid"
PRETTY_NAME="Newznab binaries update"

test -f /lib/lsb/init-functions || exit 1
. /lib/lsb/init-functions


do_start() {
if pgrep -f "$PGREP_SEARCH" > /dev/null
then
echo "$PRETTY_NAME is already running.";
return 1;
fi
echo -n "Starting $PRETTY_NAME ... "
screen -dmS ${SCREEN_NAME} /var/www/nnplus/misc/update_scripts/nix_scripts/newznab_local.sh &
PID=`echo $!`
echo $PID > ${NN_PID_PATH}${PIDFILE}
sleep 1
if pgrep -f "$PGREP_SEARCH" > /dev/null
then
printf '%s%*s%s\n' "$GREEN" $col '[OK]' "$NORMAL"
else
printf '%s%*s%s\n' "$RED" $col '[FAILED]' "$NORMAL"
fi
}

do_stop() {
echo -n "Stopping Newznab binaries update ... "
if pgrep -f "$PGREP_SEARCH" > /dev/null
then
kill -9 `pgrep -f "$PGREP_SEARCH"`
screen -wipe
fi
printf '%s%*s%s\n' "$GREEN" $col '[OK]' "$NORMAL"
}

do_status() {
if pgrep -f "$PGREP_SEARCH" > /dev/null
then
echo "$PRETTY_NAME is running."
else
echo "$PRETTY_NAME is not running."
fi
}


case "$1" in
start)
do_start
;;
stop)
do_stop
;;
status)
do_status
;;
restart)
do_stop
do_start
;;
*)
echo "Usage: $0 [start|stop|status|restart]"
exit 1
esac

Then "chmod +x /etc/init.d/newznab" and "/etc/init.d/newznab start" and you can leave it. Then simply enable groups via the Http://newznab/admin/groups..." page, and it will be picked up by this.
 
Anyone had a dig around the DB.

The groups table is handy for updating retention days, for example. I updated all active groups back 10 days at a time with one update command rather than using the front end to do so one group at a time and it is so much easier.

One thing that came to light though is the part and repairpart tables. Any idea what is in these tables. IIRC has a header and body field and the body looks like UUEncoded or Yenc data. This gives me the impression it is actually downloading the full messages along with the headers. It cannot be the full body for all the releases as I am using only 30GB of disk space and some of the releases are listed as over 30GB on their own. The dev doc does not go in to any real detail on the database tables and their use. The binaries class mentiones the scan function which "Download a range of usenet messages. Store binaries with subjects matching a specific pattern in the database.". I am just hoping it is not grouping full releases and then downloading the bodies as well before making nzbs and then disgarding the bodies one release at a time.

Any ideas ?. Could be it is used for the password protected check which makes sense but not if it gets the body even if check for password dprotected is disabled (havent checked myself).

Update: Anyone tried rotating the config.php (/var/www/nnplus/www/config.php on my install) as this is where the NNTP server details are stored. In theory you could set multiple servers up (one per config.php file) and then rotate through the config files to try and collect headers that are missing on one server from another. As multiple servers can be setup in downloaders like SABnzb, that should not pose a problem for them. Have 3 config.php files (config_1.php, config_2.php etc) which are all the same apart from the server name and maybe a file which stores the current config number and a script that runs after an update_binaries.php finishes that just removes the config.php and renames a config_x.php to config.php, records the latest number in the current_config file and then re-runs the update_binaries.php. Of course you could just have a sed or awk script that modifies the original file to change the server details rather than separate files. May have to give this a go over the weekend.

RB
 
Last edited:
Any MySQL nuggets you can impart RB? Be interested to have a dig around the DB. In particular I was after something to purge and delete all groups from the command line instead of having to point and click my way through the lot on the GUI.

Have you trusted the self signed certificate? I've not tried this, but normally issues with SSL and self signed (in general) is from needing to add the cert into the trusted users / publishers container for the computer account (on windows atleast).

Yep, it's trusted. I'll back it off to HTTP tonight and see if I can get it working. Other than that I think it's going to be a case of getting the API URL crafted correctly.

Finally, I've started backfilling my groups. Doing it in blocks of 30 days, but it's terribly slow. My update_binaries used to thrash the DB, CPU and disk I/O with the php/mysql processes being top talkers, now I'm lucky if it uses 100k/sec bandwidth and it's only making one connection per group to get the headers.

Does this sound about right to you all?
 
Oh and another thing, the threaded scripts are very handy but you have no idea what they're doing. I'm trying to diagnose why my backfill is running at 5kb/sec and I've really not got a lot to go on. :(
 
Oh and another thing, the threaded scripts are very handy but you have no idea what they're doing. I'm trying to diagnose why my backfill is running at 5kb/sec and I've really not got a lot to go on. :(

Have you tried running the non-threaded version ?. Any errors reported, disk/DB/Netowkr contention ?. Mine was running slow but the backfill threaded is now running at between 200-300KB/s with AstraWeb. Single backfill was running pretty slow.

Any MySQL nuggets you can impart RB? Be interested to have a dig around the DB. In particular I was after something to purge and delete all groups from the command line instead of having to point and click my way through the lot on the GUI.

The groups table holds all group details including the first and last post, backfill target etc. The thing is that clearing from there is not enough, you also need to clear part and repair part tables or when you try to redownload parts you will get DB errors as the part id (presumably a key field or with unique constraints at the very minimum) reports it already exists so insert fails. You need to clear out the two part tables based on the groupID I imagine. You can also link catagory and releases so you can update any releases from a particular group to one catagory and then go an manually fine tune.

I do notice a number of releases with nfo files do not rename the release to the name in the nfo so a way of sorting that out would be good.

Finally, I've started backfilling my groups. Doing it in blocks of 30 days, but it's terribly slow. My update_binaries used to thrash the DB, CPU and disk I/O with the php/mysql processes being top talkers, now I'm lucky if it uses 100k/sec bandwidth and it's only making one connection per group to get the headers.

Does this sound about right to you all?

I am hitting 300K using the threaded backfill with compressed headers (AstraWeb). CPU can be high, disk IO is tiny as is DB use. I am running on a midrange SATA III Intel SSD and could quite easily drop to a mechanical HDD with little if any penalty.

RB
 
Any MySQL nuggets you can impart RB? Be interested to have a dig around the DB. In particular I was after something to purge and delete all groups from the command line instead of having to point and click my way through the lot on the GUI.

Have spend a lot of today trying to work out if I can force processing of releases not matching the regex by renaming them and having a specific regex setup to catch the renamed releases. This way I can manage any items not captured via the standard regex and be sure that they get processed the way I want rather than trying to write some regex that may also catch other items not intended to be caught. I am just testing the process and fine tuning.

For the purging or deleting then you can purge from binary and parts fairly easily (groups.ID-> binary.groupID, part.binaryID -> binary.ID & goup.ID -> partrepair.groupID). For releases it is likely to be a bit more tricky as there are a few release tables and no published data dictonary.

Note, from what I can work out so far, Binary holds a releases parts (i.e. [1/24], [2/24] etc). Parts are the posts that go to make up the individual parts of the release in the binary table.

So for each item in the binary table you will have one or more items in the parts table and multiple parts from the binary table will be needed to make a release (a single item in binary is also valid but will usually have multiple parts in the part table).

I have also noticed that the datatype for the ID in binary is bigger than it is in other tables as binaryID.

RB
 
How do you see the speed that you're getting?

I've got Newznab plus installed in a ubuntu VM. All seems to be working but I've not dabbled with the backfill yet!

Speed is ok around 300KB/s multithreaded.

I have over 500,000 headers which have not made it to releases for one reason or another.

After setting the retention back to 1500 days and almost finishing the backfills, some of the groups have over 80,000 releases. The thing is that most are useless as they are just code numbers or incorrect regex matches. I am using around 41GB storage but still need to do a bit more backfill to hit the 1500 days on all the groups.

I was also quite surprised they were are not using stored procedures and firing them with passed parameters but are instead building SQL in the php scripts and then firing the sql commands.

The biggest issue, apart from DMCA takedowns, is the volume of crap on Usenet. That and the number of posts containing 'applications'.

RB
 
Sorry my question was how do you see what speed you're getting? It seems to take a fair old time to run the update_binaries so I'm wondering how fast it's actually pulling them at.

Ahh, I see :). I run it on a virtual machine. The ESXi host allows me to monitor that the virtual machines are doing (processor, ram, disk, networking). I just check the network bandwidth logged against the NewzNab VM to get current, minimum and average.

RB
 
Had it installed on atom machine on debian, just few popular groups as a test. As someone who uses non binary NNTP lib for forum software I was more interested in grouping/threading/parts methodology than the actual binary content outcome.

Both update_releases and update_binaries just kept stalling on my atom box. Couldn't leave it in screen, it was that often. The process would start, print few lines of messages then just stay there doing nothing for days. I think there is something broken in their php code that's not compatible with machines running other tasks via cli php already (my atom box is running zoneminder CCTV). They don't do enough checks to see if the process loaded to memory died or timed out on NNTP server end. Anyway, moved to another server I have in US with just apache services and it runs fine for the last couple of weeks, but:
- some highly, highly, I can't stress how highly populated groups have zero releases even after backfilling with 365 days retention, which suggests default regex' for all generic binary groups are completely broken
- foreign groups are random hit and miss, with definitely more focus on the latter
- wrong releases are being assigned to wrong titles and categories all over the place - some generic HD music videos appear as lossless audio, foreign titles are bundled with completely unrelated posters/covers, etc etc
- password filtering is very broken. Even with unrar, with deep checks, most obvious fakes and spam just get through. I thought backfill stage was at fault, maybe checks were removed for time saving, but day to day update_binaries/releases still pulls rubbish. I think it uses the same broken method as sabnzbd+, and doesn't check the most obvious thing - not if the rar made out of parts is protected with a password, but if the rar made out of parts contain another rar inside. In which case for let's say media category, it's an obvious fake.
 
Anyway, moved to another server I have in US with just apache services and it runs fine for the last couple of weeks, but:
- some highly, highly, I can't stress how highly populated groups have zero releases even after backfilling with 365 days retention, which suggests default regex' for all generic binary groups are completely broken
- foreign groups are random hit and miss, with definitely more focus on the latter
- wrong releases are being assigned to wrong titles and categories all over the place - some generic HD music videos appear as lossless audio, foreign titles are bundled with completely unrelated posters/covers, etc etc
- password filtering is very broken. Even with unrar, with deep checks, most obvious fakes and spam just get through. I thought backfill stage was at fault, maybe checks were removed for time saving, but day to day update_binaries/releases still pulls rubbish. I think it uses the same broken method as sabnzbd+, and doesn't check the most obvious thing - not if the rar made out of parts is protected with a password, but if the rar made out of parts contain another rar inside. In which case for let's say media category, it's an obvious fake.

The problem is more with human nature and the desire for everyone to be different and not conform.

Binaries posted as binaries are quite common in having no set structure to the headers and so are pretty hard to catch with regex whilst also not catching a lot of false positives.

Someone posts;
[NASA Release] "Great space station videos" [00/30]

Other people post
Great space station [NASA Release] (part 1)
"Great space station videos" [NASA Release] - [00/30]
"NASA Release" - part 1 of "space station videos"
"Space station.mp4"
"Derp haha" - funny family video [It's great] [00/30]

Getting some regex to correctly capture, name and work out how many parts are required whilst still being generic and not getting any wrong names or false positives is quite difficult.

The regex does an ok job but it really is best efforts and we can always add our own.

Another issue is heavy cross posting. One group I have I would like to have the releases allocated to but as the releases are cross posted to other groups and the 'official' regex filters on those groups first, I never get anything in the group I want the posts in. I therefore cannot just brouse by group to that group to look at the releases relevent rather than try to sort them from the cross post mess of the other group.

I have been doing quite a bit of work over the last 3 days working on renaming headers and then having generic regex to process them but the cross posting means they have already been processed to releases. I may now have to write some code to remove the cross post info before processing_releases.

I now have the SQL (untested) for deleting releases (and all associated links / previews / thumbnails) and renaming some header formats).

I did manage to rename 2,000 headers correctly in one go at one stage but changed the other 498,000s name to NULL which was a bit of a pain :). After sorting the sql it bow all works as expected although probably needs a bit of optimisation. All the work I am doing is backend SQL though as I have no knowledge of PHP and no time or desire to learn it. I will probably wrap it in shell scripts if I take it any further.

RB
 
hi all, any tips on how to create a new default website in apache? I created a file "/etc/apache2/sites-available/newznab" and populated it with :

<VirtualHost *:80>
ServerAdmin webmaster@localhost
ServerName localhost # You might want to change this

# These paths should be fine
DocumentRoot /var/www/newznab/www
ErrorLog /var/log/apache2/error.log
LogLevel warn
</VirtualHost>

but when I run "sudo a2ensite newznab" I get ERROR: site newznab doesn't exist!
 
Back
Top Bottom