MusicBrainz andmebaasi importimine PostgreSQL-i28. May '14
Käesolevas artiklis püüame importida MusicBrainz andmebaasi oma isiklikus käsutuses olevasse Ubunu/Debian masinasse.
Teeme väljavõtte MusicBrainz serveri Git repositooriumist:
git clone https://github.com/metabrainz/musicbrainz-server/
Loome mbrainz nimelise kasutaja Postgre andmebaasi, selleks kõigepealt logida administratiivkasutajana andmebaasi:
psql postgres postgres
Loome MusicBrainz andmebaasi ning lisame selle kasutamise jaoks ka rolli:
-- Loo roll nimega mbrainz ning sea selle parooliks salakala
CREATE ROLE mbrainz LOGIN PASSWORD 'salakala';
-- Loo andmebaas musicbrainz
CREATE DATABASE musicbrainz;
-- Anna mbrainz rollile kõik õigused (INSERT, SELECT, jms) selles andmebaasis
GRANT ALL ON DATABASE musicbrainz TO mbrainz;
GRANT ALL ON SCHEMA musicbrainz TO mbrainz;
-- Lülitu andmebaasilt postgres üle andmebaasile musicbrainz
\connect musicbrainz;
-- Luba musicbrainz andmebaasis cube laiendus
CREATE EXTENSION cube;
-- Loome nimeruumi musicbrainz kuhu istutatakse hiljem mõned
-- funktsioonid
CREATE SCHEMA musicbrainz;
-- Seame musicbrainz andmebaasil vaikimisi nimeruumi
-- otsingusse musicbrainz schema
ALTER DATABASE musicbrainz SET search_path TO musicbrainz,public;
-- Loome tabelid musicbrainz-server repositooriumist
\i musicbrainz-server/admin/sql/CreateTables.sql
\q
Väljume musicbrainz-server kataloogist:
cd ..
Laadime alla MusicBrainz andmebaasi dump-i. Tegu on sisuliselt hunniku kokku pakitud CSV failidega:
wget http://ftp.musicbrainz.org/pub/musicbrainz/data/fullexport/20130713-003116/mbdump.tar.bz2
Pakime lahti:
tar xvjf mbdump.tar.bz2
cd mbdump
Impordime:
mkdir -p ../done
for t in * ; do
echo `date` $t ; echo "\\copy $t from ./$t" | \
psql musicbrainz postgres && mv $t ../done/ ;
done ;
echo `date` Done
Võtame lahti uuesti Postgres käsurea:
cd musicbrainz-server
psql musicbrainz postgres
Lisame primaarvõtmed ning indeksid, muidu on üleüldse lootusetu nendest andmekogustest midagi välja filtreerida:
-- Lisa primaarvõtmed
\i admin/sql/CreatePrimaryKeys.sql
-- Lisa funktsioonid
\i admin/sql/CreateFunctions.sql
-- Kommenteeri välja järgnevas failis indeksid mis kasutavad
-- paigaldamata laiendusi musicbrainz_collate, page_index jms
\i admin/sql/CreateIndexes.sql
-- Lisa välised võtmed
\i admin/sql/CreateFKConstraints.sql
Nüüd on kõik andmebaasi kitsendused paigas, kuid kettal valitseb paras segadus, selle vastu aitab VACUUM käsk:
VACUUM ANALYZE;
Kui osaliselt on opereeritud postgres kasutajana, võib juhtuda, et mõni tabel pole kättesaadav kasutajale mbrainz:
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA musicbrainz TO mbrainz;
Nüüdseks on kõige hullem möödas ja võib ennast lõbustada päringute tegemisega sellel andmebaasil
Leiame artistide nimede tabelist vaste bändile Queen:
SELECT
id, comment
FROM artist
WHERE name = (
SELECT id
FROM artist_name
WHERE NAME LIKE 'Queen');
Käitatud päring annab artistide nimede tabelist vasteks rea identifikaatoriga 192:
id | gid | comment
--------+--------------------------------------+---------------
701492 | 420ca290-76c5-41af-999e-564d7c71f1a7 | 板野友美
192 | 0383dadf-2a4e-4d10-a46a-e9e041da8eb3 | UK rock group
992994 | 5eecaf18-02ec-47af-a4f2-7831db373419 | US rapper
(3 rows)
Siinkohal kommentaariks, et saadud nimekiri on sisuliselt sama kui kasutaja kirjutab MusicBrainz-i lehel otsingusse 'Queen' ning peale seda valib otsingutulemustest meelepärase kirje:
Järgnevalt otsime artistide tabelist vasteid kus oleks viide samale nimele:
SELECT
release_name.name AS album_name
FROM release
JOIN release_name
ON release_name.id = release.name
WHERE release.artist_credit = 192
LIMIT 20;
Albumite nimekiri:
album_name
----------------------
The Works
Jazz
The Game
A Kind of Magic
A Night at the Opera
A Day at the Races
News of the World
Made in Heaven
The Game
Flash Gordon
Jazz
Sheer Heart Attack
Live Killers
The Works
The Game
Live at Wembley ’86
Flash Gordon
Greatest Hits II
Made in Heaven
Headlong
(20 rows)
Keerukama ühendpäringuga saab klapitada kokku lugude nimed ja albumite nimed:
SELECT DISTINCT
release.id AS "Release mbid",
release_name.name AS "Album name",
medium.position AS "Disc number",
track.position AS "Track number",
track_name.name AS "Track name"
FROM release
JOIN release_name
ON release_name.id = release.name
JOIN medium
ON medium.release = release.id
JOIN track
ON track.medium = medium.id
JOIN track_name
ON track.name = track_name.id
WHERE release.artist_credit = 192
ORDER BY
release_name.name, release.id, medium.position, track.position
LIMIT 18;
Annab vasteks:
Release mbid | Album name | Disc number | Track number | Track name
--------------+----------------------------------+-------------+--------------+-------------------------------------
727641 | 18 Greatest Hits Live From Queen | 1 | 1 | Killer Queen
727641 | 18 Greatest Hits Live From Queen | 1 | 2 | Now I'm Here
727641 | 18 Greatest Hits Live From Queen | 1 | 3 | In the Lap of the Gods... Revisited
727641 | 18 Greatest Hits Live From Queen | 1 | 4 | Bohemian Rhapsody
727641 | 18 Greatest Hits Live From Queen | 1 | 5 | Love of My Life
727641 | 18 Greatest Hits Live From Queen | 1 | 6 | Somebody to Love
727641 | 18 Greatest Hits Live From Queen | 1 | 7 | We Will Rock You
727641 | 18 Greatest Hits Live From Queen | 1 | 8 | We Are the Champions
727641 | 18 Greatest Hits Live From Queen | 1 | 9 | Crazy Little Thing Called Love
727641 | 18 Greatest Hits Live From Queen | 1 | 10 | Another One Bites the Dust
727641 | 18 Greatest Hits Live From Queen | 1 | 11 | Play the Game
727641 | 18 Greatest Hits Live From Queen | 1 | 12 | Save Me
727641 | 18 Greatest Hits Live From Queen | 1 | 13 | Under Pressure
727641 | 18 Greatest Hits Live From Queen | 1 | 14 | Radio Ga Ga
727641 | 18 Greatest Hits Live From Queen | 1 | 15 | I Want to Break Free
727641 | 18 Greatest Hits Live From Queen | 1 | 16 | One Vision
727641 | 18 Greatest Hits Live From Queen | 1 | 17 | A Kind of Magic
727641 | 18 Greatest Hits Live From Queen | 1 | 18 | Who Wants to Live Forever
(18 rows)
Kasutades gruppeerimist ning aggregaatfunktsioone:
SELECT
max(track.position) as "Lugude arv",
track.medium as "Plaadi id",
max(release_name.name) as "Album"
FROM track
JOIN medium
ON medium.id = track.medium
JOIN release
ON release.id = medium.release
JOIN release_name
ON release_name.id = release.name
WHERE track.artist_credit=192
GROUP BY medium
ORDER BY max(track.position) desc
LIMIT 20;
Saame tabeli kus on albumid millel on kõige enim lugusid ning mille esitajate hulgas on Queen:
Lugude arv | Plaadi id | Album
------------+-----------+-----------------------------------------------------
38 | 848290 | School Disco
37 | 789068 | After the Goldrush
32 | 496633 | Extraordinary
29 | 1008179 | James Hyman: DJ Mission Vol 007 (License To Thrill)
29 | 810285 | Highlander
28 | 810284 | Highlander
27 | 1252727 | Hungarian Rhapsody: Queen Live in Budapest ’86
26 | 1139952 | England Expects
26 | 1225309 | Rock Montreal
25 | 460813 | We Will Rock You (Deutsche Originalaufnahme)
25 | 858280 | Wedding in a Box
25 | 1243577 | We Will Rock You (Deutsche Originalaufnahme)
24 | 923754 | 70 Greatest Songs (disc 1)
24 | 1010885 | Queen II / Sheer Heart Attack
24 | 922278 | Hangmania
24 | 1330320 | Hungarian Rhapsody: Live in Budapest
24 | 1257319 | 1979-01-24: Deutschlandhalle, Berlin, Germany
23 | 1238682 | Jock Rock, Volume 2
23 | 1272429 | Hungarian Rhapsody: Queen Live in Budapest ’86
23 | 1257306 | 1977-03-13: Seattle Center Arena, Seattle, WA, USA
(20 rows)
Python-iga liidestamine on üsna lihtne kasutades openpyxl teeki Excel 2007 vormingus failide lugemiseks/kirjutamiseks ning psycopg2 teeki PostgreSQL andmebaasiga liidestamiseks.
#!/usr/bin/python
# coding: utf-8
# Sõltuvuste paigaldamiseks:
# sudo apt-get install python-openpyxl python-psycopg2
QUERY_QUEEN_ALBUMS = """
SELECT DISTINCT
release.gid AS "Release mbid",
release_name.name AS "Album name",
medium.position AS "Disc number",
track.position AS "Track number",
track_name.name AS "Track name"
FROM release
JOIN release_name
ON release_name.id = release.name
JOIN medium
ON medium.release = release.id
JOIN track
ON track.medium = medium.id
JOIN track_name
ON track.name = track_name.id
WHERE release.artist_credit = 192
ORDER BY
release_name.name, release.gid, medium.position, track.position
"""
# Ühendume PostgreSQL andmebaasi
import psycopg2
conn = psycopg2.connect(dbname="musicbrainz", user="mbrainz")
cur = conn.cursor()
cur.execute(QUERY_QUEEN_ALBUMS)
# Loome uue Exceli workbook-i
from openpyxl import Workbook
workbook = Workbook()
worksheet = wb.get_active_sheet()
# Kirjutame päise
for index, header in enumerate(cur.description):
worksheet.cell(column=index, row=0).value = header.name
# Kirjutame andmed
row_index = 1
while True:
row = cur.fetchone()
if not row:
break
print "Kirjutan rea:", row
for column_index, cell in enumerate(row):
ws.cell(column=column_index, row=row_index).value = cell
row_index += 1
# Sulgeme ühenduse PostgreSQL andmebaasi
cur.close()
conn.close()
# Salvestame Exceli workbook-i kettale
workbook.save("albumid.xlsx")
Käesolev juhend on kokku lapitud mitmest allikast tuginedes peaasjalikult MusicBrainzi vikile 1.