Groups and Users | Create and Read

From a previous installment, we had already defined the user table:

CREATE TABLE `user` (
  -- An unique identifier used across the system
  `user_id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  -- The display name
  `user_name` VARCHAR(255) NOT NULL
);

The group table is going to be strikingly similar:

CREATE TABLE `group` (
  -- An unique identifier used across the system
  `group_id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  -- The display name
  `group_name` VARCHAR(255) NOT NULL
);

Additional properties (such as creation date and other status elements) can be added later on, all that matters right now is to provide group and user properties for our items.

The Database Tables

A row in the item table represents the current scalar properties of an item. Non-scalar properties (such as comments) and the item history are going to be saved in other tables. The item table itself looks like this:

CREATE TABLE `item`
(
  -- An unique identifier used across the system
  `item_id`    INT NOT NULL AUTO_INCREMENT PRIMARY KEY,

  -- The creator of this item, to join with the user table
  `author_id`  INT NOT NULL REFERENCES `user`(`user_id`),

  -- The user the item is assigned to (if any)
  `user_id`    INT NULL REFERENCES `user`(`user_id`),

  -- The group the item is assigned to (if any)
  `group_id`   INT NULL REFERENCES `group`(`group_id`),

  -- The creation date
  `created_on` DATETIME NOT NULL,

  -- Last modification date, indexed
  `altered_on` DATETIME NOT NULL,

  -- Is it solved or not?
  `is_solved`  INT(1) NOT NULL DEFAULT 0,

  -- Is it deleted or not (deleted data is never gone)
  `is_deleted` INT(1) NOT NULL DEFAULT 0,

  -- Is it public or not
  `is_public`  INT(1) NOT NULL DEFAULT 0,

  -- Is it archived or not
  `is_archive` INT(1) NOT NULL DEFAULT 0,

  -- The text in the item
  `item_text`  TEXT NULL,

  -- The number of attached files
  `attached`   INT NOT NULL DEFAULT 0,

  -- The number of replies and comments
  `comments`   INT NOT NULL DEFAULT 0
);
exclamationI could have used a “MEDIUMTEXT” field to store the item text. This would have allowed users to store up to 16 million characters, as opposed to the current 65 thousand. This would have created two problems: first, without reasonable limits on what can be stored, either a developer or an user could have had a stupid idea, like storing the contents of entire files in the database, which would be a bad idea for performance—files should be stored on the disk anyway, and there’s no point in serving heaps of data to the user as part of the item. The other problem is how the data is extracted from the database: our mysqli interface based on statements has a fairly naive approach to receiving data from the database by allocating a buffer as large as the maximum size of the input data. This means the text field would have been a 16 million bytes buffer, which could easily exceed the maximum memory allowed by the hosting provider. By using a “TEXT”, this buffer remains at a reasonable 65 thousand bytes.

Then, there’s the matter of keeping track of everything that ever happened to the item. This is actually quite complex, because a lot of things can happen to an item:

  1. Be created (this information actually appears inside the item, but I still need to remember it for optimization reasons that will become obvious later on).
  2. Be assigned to an user (or unassigned).
  3. Be assigned to a group (or unassigned).
  4. Be shown to an user.
  5. Be tagged with a relationship to another item (copy, duplicate, see also, reference).
  6. Be commented on.
  7. Be associated with an attached file.
  8. Be solved (or unsolved).
  9. Be deleted (or undeleted).
  10. Be edited.
  11. Be made public (anyone is a contributor).
  12. Be archived (or unarchived).
  13. Be shaken (updates modification time, like the touch command line tool).

And that’s only the current low-feature version—future versions could involve workflows or mail interaction or other things. Until these are implemented, however, a simple enumeration will suffice:

  class ItemHistoryModel
  {
      const CREATED           =  'created';       
      const ASSIGNED_TO_USER  =  'assigned_to_user';  // user_id
      const ASSIGNED_TO_GROUP =  'assigned_to_group'; // group_id
      const SHOWN_TO_USER     =  'shown_to_user';     // user_id
      const RELATED_TO_ITEM   =  'related_to_item';   // rel_type, item_id
      const COMMENTED_ON      =  'commented_on';      // text
      const FILE_ATTACHED     =  'file_attached';     // text
      const SOLVED            =  'solved';            // status
      const DELETED           =  'deleted';           // status
      const EDITED            =  'edited';            // text
      const PUBLISHED         =  'public';  
      const SHAKEN            =  'shaken';  

      const REL_SEE_ALSO      =  'see_also'; 
      const REL_DUPLICATE     =  'duplicate'; 
  }

The comments to the right of every constant indicate the kind of data that is expected to be associated with that event. For instance, a COMMENTED_ON history line must know what the text of the comment was, so it comes with a text variable. Similarly, an EDITED history line must know what the original, pre-edit text used to be, so that we never lose any data. A TAGGED history line contains the comma-separated tags.

I could have used integer enumerations. This would have made for faster PHP code and faster SQL processing as well, but it would have decreased the readability of the underlying SQL database. If I use normal SQL enumerations, I will end up having the above enumeration text appearing in my fields as-is when I query it manually. If this ends up being a real performance problem, I can always change it back by altering the database schema and the above enumeration definitions—as long as I make sure no other part of the code assumes anything about the value of my constants, which should certainly be the case.

In addition to the event-specific data, history lines must also contain some other pieces of information:

  1. What item the history line relates to.
  2. When it was performed.
  3. Who performed it (an user identifier, or null if done by the system).
  4. If it’s currently disabled, when and by whom.
  5. A revision number to uniquely identify the history line.

So:

CREATE TABLE `revision` (
  -- The unique revision number
  `revision_id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,

  -- The item this applies to
  `item_id` INT NOT NULL REFERENCES `item`(`item_id`),

  -- The user who did this
  `user_id` INT NOT NULL REFERENCES `user`(`user_id`),

  -- When this happened
  `date` DATETIME NOT NULL,

  -- What happened (an enumeration value)
  `type` ENUM(
    'created',
    'assigned_to_user',
    'assigned_to_group',
    'shown_to_user',
    'related_to_item',
    'commented_on',
    'file_attached',
    'solved',
    'deleted',
    'edited',
    'public'
  ) NOT NULL,

  -- The user referenced by the event (shown to, assigned to)
  `user_to` INT NULL REFERENCES `user`(`user_id`),

  -- The group referenced by the event (assigned to)
  `group_to` INT NULL REFERENCES `group`(`group_id`),

  -- The item referenced by the event (related to)
  `item_to` INT NULL REFERENCES `item`(`item_id`),

  -- The type of relationship (to be determined later)
  `relationship` ENUM(
     'see_also',
     'duplicate'
  ) NULL,

  -- The text of the event (if any)
  `text` TEXT NULL,

  -- The resulting status (on/off)
  `status` INT(1) NULL,

  -- The last moderation date
  `moderated_on` DATETIME NOT NULL,

  -- The moderator
  `moderated_by` INT NULL REFERENCES `user`(`user_id`)
);

The combination of these two tables provide all the necessary information about items that we need so far. Constructing high-level information about an item is merely a matter of indexing into the item table (this data is displayed in item lists, and is mostly a summary), while detailed information can be gathered by collecting item-specific data from the revision table.

What is missing is the storage of user rights. First, there’s the group-level rights:

CREATE TABLE `group_users` (
  -- The group and user
  `group_id` INT NOT NULL REFERENCES `group`(`group_id`),
  `user_id` INT NOT NULL REFERENCES `user`(`user_id`),
  PRIMARY KEY (`group_id`,`user_id`),

  -- The level of rights (1 : contributor, 2 : moderator)
  `rights` INT,
);

Once you have this, you can determine whether someone has moderator rights over an item:

  • Is the author of the item.
  • Is the user the item is assigned to.
  • Is a moderator in the group the item is assigned to.

You can also determine whether someone has contributor rights over an item:

  • Has moderator rights over the item.
  • The item is public.
  • Has been shown the item.
  • Is a contributor in the group the item is assigned to.

So, there is no need to store the user rights over items. In fact, storing values in a permanent table is an outright bad idea precisely because it can be deduced this way: keeping the table updated would be slow and require efforts everywhere (an user becoming a group moderator would have to be updated as being a moderator over all items in that group).

The real question is, where do these seven rules for moderator and contributor status end up being written? Is it in the PHP code, or in the SQL code?

If they are written in PHP, they will be slow: for every item in a set, a query will have to be made to determine whether the appropriate rights are available over that item.

If they are written in the SQL queries performed from PHP, we will have to duplicate the code on every query, sometimes in subtly different manners (you cannot join on updates, for instance). This is brittle and ends up being very painful.

If they are written as an SQL view, however, things are looking much better. The rules are automatically executed from SQL, but they are not duplicated anywhere in PHP code. And if they become too complex for SQL to handle, you can turn the view into a table and let PHP insert the values there as appropriate.

Warning: evil SQL code ahead!

CREATE VIEW `item_users` AS
  SELECT `item_id`, `user_id`, MAX(`rights`) AS `rights`
  FROM (   SELECT `item_id`, `user_id`, 2 AS `rights`
           FROM `item`
         UNION
           SELECT `item_id`, `author_id` AS `user_id`, 2 AS `rights`
           FROM `item`
         UNION
           SELECT `item_id`, `user_to` AS `user_id`, 1 AS `rights`
           FROM `revision` WHERE `type` = 'show_to'
         UNION
           SELECT `item_id`, `user`.`user_id` AS `user_id`, 1 AS `rights`
           FROM `user` INNER JOIN `revision` ON `type` = 'public'
         UNION
           SELECT `item_id`, `user_id`, `rights`
           FROM `item` NATURAL JOIN `group_user`
       )
  GROUP BY (`item_id`, `user_id`)

Oops. UNION doesn’t let us use a merge algorithm for this view.

exclamationThere are two MySQL algorithms for views: Merge and Temporary Table. We don’t want to use a temporary table, because that would create one brand new temporary table whenever a query involves accessing user privilegies, which can get very, very, very bad given the expected size of that table once you have enough users and items. What we want to do is use the merge algorithm, which inserts the query SQL into any query that uses the view, but this is not possible if the view query uses an UNION, or the aggregate MAX, or GROUP BY.

So, while in theory a view could have solved this issue cleanly, we end up with technical limitations that prevent MySQL from optimizing the view appropriately. But how much effort, exactly, are we looking at as far as testing for user rights goes? Could we use knowledge about the model to optimize this abit?

Most of the time, all that we need to know is whether an user has Collaborator access to a large number of items (because Moderator access is only required when working on a single item, or a small collection of items, certainly not when reading them).

Now, consider the typical access-large-number-of-items request:

SELECT {fields} FROM `item` {joins} WHERE {conditions} {other clauses}

And somewhere inside the joins list will be the item table, because that’s what we’re reading items from (of course, when reading only from the revisions table, this might be a bit more complicated, but all we have to do is decide that all revisions table reading should join with the item table). And since we’re joining with the item table, we can replace that table with a subquery that eliminates invisible elements—and if we place it at the leftmost among the other joins, we also know where the parameters will be in the PHP function call!

Except that now, we know what the user identifier is, which makes the job a lot easier by eliminating a lot of elements every time.

(SELECT * FROM `item`
 WHERE `author_id` = ?
    OR `user_id` = ?
    OR `public` = 1
    OR `group_id` IN (SELECT `group_id` FROM `group_users`
                      WHERE `user_id` = ?)
    OR `item_id` IN (SELECT `item_id` FROM `revision
                      WHERE `type` = 'show_to'
                        AND `moderated_by` IS NULL
                        AND `user_to` = ?)
) `item`

Since we want to handle all of this transparently, and prepared statements notoriously have no use for array arguments, I will modify the query functions so that they can take array arguments, and automatically expand their arguments. The basic idea is to treat:

Query("foo ? bar ? ? qux ?", $a, array( $b, $c ), $d);

As if the user had written:

Query("foo ? bar ? ? qux ?", $a, $b, $c, $d );

Why? Because this lets us manipulate subqueries in a much more elegant (and reusable) way by sticking together the arguments to the subquery. In essence, we can write:

list($q_visible, $a_visible) = Model::VisibleBy($user);
Query("SELECT * FROM $q_visible `item` WHERE `group_id` = ? LIMIT 20",
      $a_visible, $group);

And this will work regardless of the number of parameters inside the “q_visible” sub-query, because the “visibleby” function makes sure that every parameter in “q_visible” is matched by a value in the “a_visible” array.

The appropriate modifications happen in the DBUtils class:

<?php // utils/db.php

  class DBUtils
  {
      private static function _query(mysqli $mysqli, $query, $input)
      {
          $statement = $mysqli->prepare($query);

          if (!$statement) {
              throw new exception($mysqli->error);
          }

          array_shift($input);
          array_shift($input);
          $input = ArrayUtils::Flatten($input); // new!

          if (count($input) > 1)
              call_user_func_array(array($statement, 'bind_param'), $input);

With a new utility class, ArrayUtils:

<?php // utils/array.php

  class ArrayUtils
  {
      private static function _rec_flatten(&$into, &$from)
      {
          foreach ($from as & $element) {
              if (is_array($element))
                  self::_rec_flatten($into, $element);
              else
                  $into[] = $element;
          }
      }

      public static function Flatten(array $array)
      {
          $result = array();
          self::_rec_flatten($result, $array);
          return $result;
      }
  }

This will allow us to write the PHP side of the data model, which will contain all the behavior.

Groups and Users | Create and Read

0 Responses to “17. Item Model”


  1. No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>



1170 feed subscribers
(readers who polled a feed this week)