CSIS 340 - Project Description

Project Description

Organization of the Project Teams
The Domain of the Data
The Hardware and Software Platform
Project Details
Schedule of Reports and Deliverables
The Final Report
Project Grading
Sample Queries
Final Queries

The project will be done in groups of 3 students (maybe one group of 2, depending on enrollment). Each group will design and populate a database and perform queries over their completed database. Each group will need at least one capable programmer (Java, Python, etc.) - this is more than a suggestion and I will take no pity on groups that struggle with the project because they failed to include a savvy programmer.

The Domain:

The database will consist of information about movies. The raw data is encoded in a number of HTML web pages (or XML files) and can be accessed starting with this page that describes the data. The HTML data is embedded in tables that you will have to read and parse programmatically to extract the data and load your database. The XML data may be easier to parse using an XML parser available in many modern languages. Note that some fields will have multiple values, some have "codes", and some encode relationships to information in other files.

Some of the fields in these "records" can be ignored. Specifically, you may ignore the following pages (files) and fields:

The entire "Synonyms" and "Quotes" files can be ignored
Ignore locations in the "lc" attribute of Movies that are not a valid location code
Ignore the "notes" field in Actors and Casts

Be aware of the "codes" listed in Section 4. You may want to enter these by-hand. Many database applications make use of "codes" and you should consider carefully how you will handle these codes.

Appendix B gives some hints about converting these files to other formats (such as RDBMS).

Finally, like most non-trivial, real-world situations, this dataset is incomplete and inconsistent. As you encounter problems, feel free to ask about them. You should think about what sort of inconsistencies might arise and how you will deal with them.

Platform:

You will be using the public-domain PostgreSQL DBMS. A few demonstration databases are available for you to query. One is called company - a database that resembles the database used by Elmasri & Navathe. In addition, a database called teach_yourself is available that corresponds to the database used by the Teach Yourself SQL book. If you are interested, you can look at the SQL scripts I used to create and populate the company and teach_yourself databases.

You can create an account in PostgreSQL via the PostgreSQL Help page mentioned earlier. You are free to create individual accounts, but in order to avoid having to grant access to others you may choose to share one account for the project.

Project Steps:

Your group will create a conceptual design of the database using ER notation.
Your conceptual design will be converted to PostgreSQL table definitions and the tables created in a database you create.
Populate the database with the data described above. Database population will occur in two steps:
1. To verify your database design and implementation, an initial set of 10 movies may be entered by hand. You may also need to add related data from other files.
2. Formulate a plan for populating the database with the rest of the data. You will need to design and implement a program for populating your database from the raw data.
Formulate queries given in English into SQL and use them to query your database. Some example queries are given below. A final set of queries I will expect you to translate and execute will be given later in the semester.

Deliverables:

There will be 5 project-related assignments in the form of "milestones", along with a polished formal final report. The schedule for these deliverables will be given in the schedule of homework assignments. Some milestones may be due together as appropriate.

A description of the deliverables follows:

An ER diagram for your database design, along with any annotations needed to understand it.
A listing of PostgreSQL SQL table definitions for your database, along with any comments needed to explain its connection to the conceptual design.
Evidence that you have created the database tables and loaded a small subset of the data - you may load the data by hand. Also include a short description of how you plan to load the rest of the data into the database (e.g., language used, rough sketch of the approach, etc.).
Evidence that your database has been loaded (i.e., execute a query to obtain the number of rows in each table)
Your translations of the final queries into PostgreSQL SQL along with a count of how many rows each query returned.
Final Report
(see description below)

Final Report:

The final report is basically a compendium of earlier material, revised to reflect the actual implementation. For each of the tasks listed above you should describe the steps that you tried or considered and the rationale you used to make your final decisions, design and implementation. You must also describe the final result.

Your report should contain an ER diagram of your final schema and PostgreSQL table definitions. You should discuss alternatives you considered and why you rejected them as well as how your design evolved over the course of the project

You should discuss the data loading process, what sort of things you had to consider or problems you encountered and how you dealt with them. In addition, you should also describe any optimizations you made to the database schema and justify your choices.

For each of the queries you should provide your translation into a single PostgreSQL SQL query along with a description of your approach. It may be the case that you could not translate all the queries to SQL due to the limitations of the language, limitations on PostgreSQL's implementation, or your database design. If you were unable to translate a query, you should explain why the query was not expressible. NOTE: I do not give brownie points for highly-convoluted non-general queries written to answer a particular question. Your queries should be short, elegant, and work for all cases. Do not attempt to "cook" the design of your database to answer a particularly difficult query.

The final report should also provide a short section summarizing your experience with the project (e.g., difficulties you experienced, lessons learned, novel solutions you employed, PostgreSQL experiences, and anything else you think interesting or notable). The final report should be relatively short, on the order of 6 - 10 pages, and not include code for populating the database nor database table listings.

Grading:

The milestones are "check-off" items (i.e., merely marked as 1/0 - completed or not).
Each of the 5 is 12% of the project grade.
The final report is 40% of the project grade.

Sample Queries:

The following queries are meant to give you an idea of the sort of queries I will provide to you near the end of the project. The actual set of questions will have values filled in for the X's, and there will be more queries than this.

What movies have <X> in their titles?
List the title and year of movies that <X> has appeared in ordered by the year.
List the names of all actors that have every appeared in a movie with <X>. Names should appear only once and be in sorted order.
How many times did <X> co-star with <Y>?
Which pair of actors have appeared together the most times?
List the names of movies that are remakes of <X>.
List the genres (i.e., categories) of movies and a count of the number of movies in each genre - note that I want the full name of the category, not just the "code".
Has any actor directed a movie they've appeared in? If so, tell me their name, the title of the movie they directed, and the role they played.
Which films, actors, or other people won Academy Awards in 1970? Return only one column with the film title or actor/person name.

Last modified: , by David M. Hansen