Main

October 04, 2006

Some background into the Bunk name....

History may be Bunk but here is some bunk history. With the last name of Bunk I've been asked many times of the name got shortened on the boat over to the US. Well it turns out is did. My Great Grandparents Josef and Agnieszka Bak came over to America and at some point their name got changed to Bunk. Ellis Island has a great website where you can look up the records of passengers coming over. Below if the one for my great grandma.
Arrival Record of Adnieszka Bunk
Interesting to note on the manifest that she had not previously been in the US, so she was coming to join Josef at that time. His last name is also given in the list as Bak. I can find no record of his arrival, probably came before Ellis was constructed. She came from Komorow, which I believe is located west of Krakow in what was then Silesia (a part of Germany, since Poland didn't exist at that time). My 2nd cousin Jeff remembers that Mom (My great Grandma) said she remembered her father hating the Prussians (e.g. Germans?) which would make sense if they both came from somewhere in Silesia.

July 27, 2006

Using State Machines and Semantic Technology to Build Decision Assistants

I've often heard about decision assistants and form generation being an area where Semantic Technology can be used to implement a more effective solution. I often struggled to really understand why. Is the implementation faster? Is more functionality possible? Is it more easily maintainable? To answer these questions I decided to map out a design approach that uses a semantic technology and then see how it stacks up with the traditional approach. For the purposes of this example I'll choose a first aid response decision assistant.

So what does my first aid decision assistant need to do? Well in a nutshell it needs to help the user make a decision. The system will ask the user some questions about the incident and based on those answers the system will give a response or ask for additional information and so on and so on. Lets choose to model the system as a state machine with each node representing a point in the decision making process (For example, testing vitals). The arcs of the model represent what conditions or responses move the user to a different state (For example, irregular heart beat).

Let’s assume that we have accomplished the daunting task of creating an Ontology that maps out all of the states a first aid responder could be in for a given situation. A portion of which for treating a burn victim could look like:

example first aid ontology

The next step would be to apply attributes to each of the nodes detailing what information needs to be provided to the users at each state. For the top node you would need to tell the users how to determine what degree of burn it is. You would also need to define what information you need to collect from the user, in this case the degree of burn. You would need to tie the input of this information to an arc on the graph. If this Ontology were in a triple store (semantic database) you could query it to determine what state should you go to if you are currently in state X (node) and event Y occurred (arc). The triple store would return the next state along with the information that should be presented to the user and information that needs to be collected.

How does this compare to a traditional design? One way of doing a traditional design would be to create a bunch of web pages with information and forms that direct the user to the next appropriate page. This solution would definitely be faster. The functionality in this simple example would be the same. The only disadvantage of this approach comes in when you walk about maintenance and that would be minimal. Down the road if I wanted to add a new state or change the state change conditions I would have to add a new web page and alter the content in one or many pages, especially in a larger model. The chance of missing the adjustment of a link or event would be high possibly causing quality issues. In this simple example the traditional approach seems to be the better design.

This example highlights the reasons why Semantic Technology provides little convincing arguments for adoption in an isolated project. But what if I’m running a website like WebMD and have a boat load of information about the medical field. Let’s say I already had a repository of lessons learned and had structured those using semantic standards. One of the fields in the model tied lessons to a step in a process related to first aid. With very little work I could add to my query for a given state, a check for relevant lessons learned.

Now let’s imagine that “treat for shock” was a common step in numerous first aid procedures with common things that must be told to and collected from the user but with different next steps. In Semantic Modeling all you would have to do is create a “shock treatment” parent class with one set of information on how to treat for shock followed by making a “hypothermia shock treatment” and “broken bone shock treatment” subclasses with different next steps that can occur. Now if I have a lesson learned tied to treatment for shock then it can be inferred that it ties to subclasses as well while only having to maintain one actual link.

To accomplish this with a traditional approach there would be a large amount of linkages and repeat data I would have to worry about entering. When making changes there could potentially be tons of data I would have to change in many pages. Not so simple anymore.

So with this broader vision in mind we can again compare to a traditional solution. The semantic approach would likely be faster because information would not have to be duplicated for similar areas and links between relevant information would not have to always be explicitly stated. The traditional approach might be able to have all the same functionality a semantic solution would provide but as the complexity of the system grows it would take exponentially more time to implement imposing a resource availability limit to functionality. Maintenance might have only been a little better with a semantic tech approach in the small example before, but it leaves the traditional approach in the dust for complex systems.

The benefits of a semantic approach increase as system complexity and available information increases. Keep this in mind when you consider whether or not semantic technology if right for you.

July 18, 2006

The National Information Exchange Model

The Government has released the National Information Exchange Model Beta 1.0. Initially this will help government agencies to exchange information covered by this model more effectively. The long term benefits of this model will be the ability to perform concept aware searches across vast repositories of information and enable greater data visibility. This can be made possible with the introduction of semantic technologies to help define the interrelationships of the information.

I was going to write up a sample use case but found one in the Introduction documentation instead. It reads,

"U.S. Border Patrol agents view a map of the area, displaying fixed locations such as landmarks and roads, agent locations, and the status of seismic sensors, on a vehicle‑mounted or, when away from their vehicle, handheld device. When sensor activation is displayed on the map, the nearest agent indicates that he will respond to it. The responding agent approaches the location and encounters a group of suspected undocumented migrants. He identifies himself as a U.S. Border Patrol agent and apprehends the majority of the group, but two men in the group escape. The agent radios a description of the two men and their direction, and the approximate last known position of the “got aways” is entered on the map so that other agents in the field can view it. A search is coordinated for the two migrants. Meanwhile, information from a citizen’s call about two suspected undocumented migrants loading into a pickup is entered on the map. The closest mobile unit pulls in behind the pickup. The agent immediately runs searches concerning the vehicle license plates, and after receiving positive results on the records checks, a traffic stop is affected. The agent begins to question the driver and the two passengers and notices that the passengers match the description of the two “got aways” reported earlier. He runs the name and identification of each passenger in a federated query against local, state, and federal databases. The rapid response comes back with a positive history of immigration violations, as well as records of criminal violations. With probable cause established, the three men are taken into custody for further processing. When their fingerprints are run and other national databases are checked, one of the prisoners is found to be on a Terrorist Watch List, under a different name and identification."

Implementing these high value use cases requires large amounts of information to be formatted properly with the correct Meta attributes. Creating a model like NIEN is a big start to making that a possibility.

January 31, 2006

Data Mining Enron's Email

My boss pointed me to a wired article talking about the "Enron Corpus", the collection of 619,446 E-Mails publicly available as a result of the trial.

What a great source of data to practice data mining on. In particular a new visualization engine I heard about called Prefuse from Jeffrey Heer. Check out the Precipe toolkit being used on the Enron Corpus. Very cool stuff.

March 21, 2005

Free DMTA Tools

One of my favorite free tools dealing with visualization is Treemap

If you were to put together a data portal to constantly evaluate the trends of data in the agency this would be the type of thing that gives a nice overall snapshot of the data and the direction it is going. It also serves as a nice starting point for performing a specific analysis on a set of data. Here is a great example of what Treemaps can do.

The one thing to know about this example is they have taken the free concept and initial source code of Treemap and extended it beyond it's capability to be specifically geared to financial analysis. They charge a license fee if you want to use their extended version.

Traditional statistical/analytic techniques are provided for free by the R project. The R project is modeled after the S project and contains most of the same functionality. R can perform linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and present the results graphically.

The next few free tools are as much source code APIs as they are tools. They are powerful but difficult to use tools that require text and data mining theory to effectively use.

The first tool is Kea. It performs text key phrase extraction. Think of it as a way to figure out key concepts in unstructured text.

The next tool is Weka. Weka is by far the best free (open sourced) software package I have scene for data mining. It contains the majority of methods of data mining discussed in the workshop (data pre-processing, classification, regression, clustering, association rules, and visualization).

Here is an interesting article about an example of someone using Weka and Kea to mine, organize and analyze an internet mailing list's archives.
**Note its been translated from German so the wording is a bit off.
The first chapter of their results is available on line.

One more worth mentioning in the Free Tool Category is JFreeChart - free java class library for graphing

As you can see not all DMTA has to be expensive.

March 20, 2005

Statistical Data Mining Tutorials

Data Mining Tutorials

March 10, 2005

Starlight Information Visualization System

From Battelle Corportation (the people who created the CD) comes a software product that as they say in their website

couples advanced information modeling and management functionality with a visualization-oriented user interface

Continue reading "Starlight Information Visualization System" »