GIS Tips – How to Find the EPSG Code of your Shapefile

RTFM is the best phrase I can think of here….having spent the last 10yrs checking through the .prj parameters of shapefiles to ensure I am using the correct EPSG code, someone nudges me and says that they used something they found on the Boundless site.

The answer? A real simple tool called PRJ2EPSG which allows you to load or paste your .prj info from your shapefile!…..If only I’d read the Boundless manual (& not been such a know it all!!) I’d have been using this months ago!

Below is the text from the Boundless page

A common problem for people getting started with PostGIS is figuring out what SRID number to use for their data. All they have is a .prj file. But how do humans translate a .prj file into the correct SRID number?

The easy answer is to use a computer. Plug the contents of the .prj file into http://prj2epsg.org. This will give you the number (or a list of numbers) that most closely match your projection definition. There aren’t numbers for every map projection in the world, but most common ones are contained within the prj2epsg database of standard numbers.

_images/prj2epsg_01.png

GIS Tip of the day? RTFM!

Nick D

What is Bayes analysis and how can it benefit your business? (The GIS bit)

Okay, so in the last blog I shamefully shared a great article from Butler Analytics on Bayes analysis and WHAT it was. Question is how does that have implications on my GIS? In the second article from Butler Analytics, they describe some of the applications of analysis. I would dearly love to see some of these start to make their way into the more prominent GIS as at the moment this kind of analysis is only possible withMapwindow (SMILE).

Shamefully Stolen from Butler Analytics

Predictive Analytics Techniques

A  number of predictive analytics techniques (applications of data mining technologies) are explained which are frequently used in predictive modeling.

Supervised Learning Techniques

The techniques shown below are used in a supervised learning scenario. This is where a data set is provided for the tools to learn from, so that new data can be classified or a value predicted through regression.

Bayes Classifiers

Bayesian classifiers use a probabilistic approach to classifying data. Unlike many data mining algorithms Bayesian classifiers often work well with a large number of input attributes, without hitting what is called the dimensionality problem. Naive Bayes is the technique most often employed – the term ‘naive’ coming from the fact that input attributes should be independent of each other (ie there are no correlations between them). Despite the fact that this is often not true, naive Bayes still gives good results. Unfortunately it is often overlooked for more esoteric methods, whereas it should actually be a first port-of-call if relevant to the problem and where most attributes are categorical (ie categorised).

Bayes works by combining what are called conditional probabilities into an overall probability of an event being true. Explaining Bayes is difficult (as evidenced by the large number of explanatory videos on youtube). But if you want to learn more an introductory article can be found here.

Decision Trees

Decision trees are a favorite tool to use in data mining simply because they are so easy to understand. A decision tree is literally a tree of decisions and it conveniently creates rules which are easy to understand and code. We start with all the data in our training data set and apply a decision. If the data contains demographics then the first decision may be to segment the data based on age. In practice the decision may contain several categories for segmentation – young/middle age/old. Having done this we might then create the next level of the tree by segmenting on salary – and so on. In the context of data mining we normally want the tree to categorise a target variable for us – whether someone is a good candidate for a loan for example.

The clever bit is how we order the decisions, or more accurately the order in which we apply attributes to create the tree. Should we use age first and then salary – or would the converse produce a better tree? To this end decision trees in data mining uses a number of algorithms to create the best tree. The most popular algorithms are Gini (which uses probability calculations to determine tree quality) and information gain (which uses entropy calculations).

When large data sets are used there is the very real possibility that the leaf nodes (the very last nodes where the target variable is categorised) become sparsely populated with just a few entries in each leaf. This is not useful because the generalisation is poor. It is also the case that the predictive capability drops off when the leaves contain only a few records. To this end most data mining tools support pruning, where we can specify a minimum number of records to be included in a leaf. There is no magical formula that will say what the level of pruning should be, it’s just a matter of trial and error to see what gives the best predictive capability.

Virtually all data mining tools implement decision trees and some offer elaborations on the basic concept – regression trees for example where the tree is used to predict a value, rather than categorise.

Decision trees are often used to get a feel for data even if they are not part of the resulting model, although good results are to be got from decision trees in many business applications.

Nearest Neighbors (k-NN)

Entities can often be classified by the neighborhood they live in. Simply ask whether your own neighborhood gives a fair representation of you, in terms of income, education, aspiration, values and so on. It doesn’t always work – but usually it does – birds of a feather and all that. A similar mechanisms has been developed to classify data – by establishing which neighborhood a particular record lives in. The official name for this algorithm is k-Nearest Neighbor, or k-NN for short.

The essential idea is this. Imagine you are interested in buying a second hand car. Mileage, fuel efficiency, service history and a number of other attributes will typically be of interest. Someone you know has a database of used cars which includes these details and each car is categorised as a peach or a lemon. By entering the details of the car you are interested in the k-NN algorithm will find the 5 (so k=5 in this instance) cars with the closest match to yours. If more are peaches then lemons then you might have a good car – and that’s it.

Obviously it gets a bit more involved with large commercial data sets – but the idea is simple enough. It works best where most of the attributes are numbers that measure some sort of magnitude, so that the algorithm can establish where the nearest neighbors are. Attributes that represent classifications can be a problem and so k-NN may not be suitable. Even so this simple algorithm is widely used and can deliver good results.

Neural Networks

If decision trees represent transparency and good behaviour then neural networks epitomise opaqueness and temperamental behaviour. But what else would you expect from a sometimes brilliant and other times obstinate technology? Neural networks are used for prediction and classification, and through the development of self-organising maps (SOM), for clustering. They are called neural networks because they supposedly mimic the behaviour of neurons in the nervous system, taking inputs from the environment, processing them and creating an output. And just in the same way that neurons are linked together, so are nodes in a neural network. As with other data mining techniques neural networks demand that a good selection of relevant inputs are available, that the target output is well understood and that copious amounts of data are available for training.

The most commonly used type of neural network is called a feed forward network. As the name suggests it works by feeding the outputs from each node forward to the next node as its inputs. The flow is essentially one direction, although something called back propagation is used to tune the network by comparing the network’s estimate of a value against the actual value. Nodes in a network do two things. They combine the inputs by multiplying each input by a weight (to simulate its importance) and summing the products – this is called the combination function. Other functions are used, but this is the most common. Secondly, the output from the combination function (a single number) is fed into a transfer function which usually takes the form of a sigmoid (an S shaped curve) or a hyperbolic tangent. These curves allow the network to deal with non-linear behaviour. In essence they create a linear relationship for small values, but flatten out for large values. This form of non-linearity is an assumption – but it often works well. The output from the transfer function is then fed to the next node in the network.

Most neural networks have three layers – the input layer, a hidden layer, and the output layer. The hidden layer is so named because it is invisible, with no direct contact to inputs or outputs. Knowing how large to make the hidden layer is one of the crucial issues in using a neural network. Make it too large and the network will simply memorise the training set with absolutely no predictive capability at all. Make it too small and useful patterns will be missed.

Using a neural network requires a considerable amount of skill and the results can range from the sublime to the ridiculous simply by modifying any one of a number of parameters. The most important parameters include:

  • The size of the training set.
  • The number of hidden layers and the number of nodes in each hidden layer.
  • Parameters affecting how quickly the network learns.
  • The features to use as input.
  • The combination functions and transfer functions.

This is by no means exhaustive and issues such as normalising inputs, converting categorical inputs and so on, all have a profound effect on the quality of the network produced. Some of the plug and play analytics tools omit neural networks altogether, and for good reason. Other methods produce equally good results without the temperamental behaviour. Having said this, neural networks can detect patterns that evade detection by other means, and they are very good at picking up some non-linear behaviours.

Support Vector Machines

Support Vector Machines (SVMs) are one of the most powerful classes of predictive analytics technologies. They work by separating out data into regions (by hyperplanes in multi-dimensional spaces for those interested), and as such classify the data. Oracle for example has a predictive analytics Excel add-on that uses SVMs exclusively. Having said this they are not relevant tool for all analytics problems and can over-fit the data in the same way as neural networks – although there are mechanisms for minimizing this effect.

SVMs are an essential component in any analytics toolkit and virtually all suppliers include an implementation.

Unsupervised Learning Techniques

These techniques are used to find relationships within data without being offered a data set to learn from. As such there is no special nominated attribute in a data set that is to be categorized or calculated (or scored in the lingo of predictive analytics). Despite this these techniques do allow new data to be allocated to a cluster or associated with a rule. The two dominant techniques here are called clustering and association.

Clustering

Clustering is very similar to the k-NN technique mentioned above but without specifying a particular attribute that is to be classified. Data are simply presented to the clustering algorithm, which then creates clusters using any one of a number of techniques. This is as much an exploratory technique as a predictive one. A typical example might be clustering patients with similar symptoms.

Association Rule Mining

Summary – Use Association Rule Mining (ARM) when you want to discover rules relating attribute values to each other. The general form of a rule is ‘IF this THEN that’, and in a supermarket shopping habits analysis an example might be IF milk THEN bread. Establishing the usefulness of rules is a major part of most ARM projects.

On the face of it association rule mining (ARM) is a simple enough activity. We simply let loose the ARM algorithms on the data of our choice and a plethora of rules will be found. The general format for these rules is as follows:

IF this THEN that

To add some meat to the bones let’s consider the habits of shoppers at a supermarket. A specific rule might say:

IF (bread and milk) THEN (Jam)

The ARM algorithm will have plowed through millions of combinations of products to identify this particular rule and thousands of others. In fact this is one of the problems associated with ARM, it finds an excess of rules, many of which are simply not useful.

ARM is generally categorized as an unsupervised learning method. Supervised learning involves giving the data mining algorithms a target variable to classify (a good or bad loan prospect for example), and a data set to learn from. Unsupervised learning does not involve a target variable and is more exploratory in nature.

In the supermarket example the ARM algorithm will typically produce thousands of rules of the form mentioned above. Another one might be IF (cheese and apples) THEN (beef and oranges). When we look at the data we might find that the number of people buying cheese and apples together is quite small, and that only 100 cases out of millions of shopping baskets obeyed this rule – clearly it doesn’t mean very much.

So that we are delivered a meaningful set of rules, there are a number of measures that can be used to filter out the more trivial ones. The use of such measures is a major part of using ARM and so we will give it some attention. This involves the use of a Venn diagram, just to make things easier to understand.

arm1w

In the diagram above we have three basic shapes. The blue rectangle represents all the shopping baskets we wish to consider, and this has a total of Nt shopping basket details The yellow circle represents all the shopping baskets with the ‘this’ combination of goods. The green circle representing all the ‘that’ combination of goods, and the intersection of the two circles represents out rule. Going back to our basic rule formulation we can rephrase it as:

IF left THEN right

Both ‘left’ and ‘right’ represent combinations of items, and are labeled such because they appear on the left and right hand side of the rule. So Nl is the number of instances (shopping baskets) that satisfy the ‘left’ combination of items (cheese and apples say). Nr is the number of instances that satisfy the ‘right’ combination of items (beef and oranges). The items which satisfy the actual rule are represented by the intersection of the two circles and number Nb. So having got the terminology out the way we can define three central measures used to filter useful rules.

The first is Confidence and this is calculated as Nb/Nl. This is a measure of the number of baskets that satisfy the rule as a fraction of baskets that satisfy the left hand combination. To make this meaningful we will go back to our first rule:

IF (bread and milk) THEN (Jam)

Let’s say there are 500,000 shopping baskets that contained bread and milk (Nl) out of a total of 1,000,000 shopping baskets (Nt). And there are 300,000 shopping baskets that contained jam (Nr). The number of shopping baskets which contained bread, milk and jam is 200,000 (Nb). Confidence for this rule would be 200,000 / 500,000 = 0.4 or 40%. Confidence gives a measure of how exclusive a rule is. In this case 60% of the bread and milk purchases were associated with other items. Confidence should be much higher than this – around 75%. So knowing someone purchased bread and milk does not really tell us all that much.

Support is specified as Nb/Nt. This gives us some measure of how ‘big’ this rule is. Typically users of ARM methods want a support of greater than 1%. A rule really doesn’t mean very much if there are only five instances of it out of 1,000,000. In our example support is calculated as 200,000 / 1,000,000 = 0.2 or 20% – a very respectable number.

Completeness is specified by Nb/Nr. This tells us whether the rule predicts a substantial number of the instances where the ‘right’ combination is present. If it only predicts a small fraction then it would not be much use in promoting those products on the ‘right’. The completeness of our rule is 200,000 / 300,000 or 0.67 or 67% – once again very acceptable. There are many other measures of a rule, but these are the most common. We will look at others in subsequent articles.

ARM is used for many purposes other than analyzing shopping baskets. It can be used to analyze credit card purchases, medical symptoms and other uses. In the next article we will get more into the algorithms used for ARM with their relative strengths and weaknesses.

Bayes predictive analysis and why what it can do for your business

I sometimes work with an environmental team doing analysis of environmental impacts and cumulative issues, one thing that has been required of late is Bayes predictive analysis. Not heard of it? Me neither, until I came across this really insightful article on the Butler Analytics website.In essence, Bayes analysis is a rather accurate way of prediction, before I post the GIS side of the analysis, I thought I’d post the info about what it is and then you can start to understand how useful this would be within a GIS.

Shamefully stolen from Butler Analytics

Bayes for Business Part 1

The basic idea behind the Bayesian approach to probability and statistics is fairly straightforward. In fact it mimics the way we operate in daily life quite accurately. Bayes provides a means of updating the probability some event will occur as we acquire new information. This sounds like a fairly uncontentious way of going about business, and yet for over two hundred years Bayes was rejected by the mainstream community of statisticians – although the military has always used it secretly, and left the egg-heads to debate the philosophical issues (Bayes was used in the Second World War to help crack the Enigma Code). The basic philosophy behind Bayes is well expressed in the following quote:

When the facts change, I change my opinion. What do you do, sir?
John Maynard Keynes

The Bayes Rule for computing probabilities was formulated by Rev Thomas Bayes an eighteenth century clergyman. It was published posthumously and didn’t really receive any attention until the great mathematician Lapalce reformulated and extended the idea. Even then statisticians remained in denial for over two hundred years until the overwhelming success of some high profile applications meant they could not hide their heads in the sand any longer – but believe it or not there is still a statisticians equivalent of the flat earth society.

Classical statistics is far more rigid. It assumes a population of objects we are interested in and that some unknown parameters describe this population perfectly. This is flawed from the start – but let’s not get distracted by philosophical issues. We then sample the population and derive statistics which, through a whole lot of jigery-pokery, are assumed to approximate to the parameters of the whole population. Bayes is far more direct. Nothing is fixed and probabilities vary as new information becomes available. Imagine it is summer and rain is a rarity. On watching the weather forecast a high probability of rain for the following day is made. Do we choose to ignore this information, or do we make sure we go out with an umbrella the following day? The latter would seem to make sense – a perfect example of Bayesian thinking.

If we are going to get any further than anecdotes and analogies we need a smattering of math unfortunately. It isn’t onerous, and as long as we remember the meaning behind the math the end results should make sense. Although I should warn that the results from Bayesian analysis are often counter-intuitive.

Conditional Probabilities

A central notion used in Bayes is that of conditional probability. Please do not fast forward at this point, conditional probabilities are fairly easy to understand. Going back to the rain example we might say that the probability of rain on a given day in summer is just two percent – or a probability of 0.02. So given it is summer we assume that the probability of rain tomorrow is fairly remote. This could be expressed as P(Rain|Summer) – read as the probability of rain given it is summer – the vertical bar can be read as ‘given that’. We know the answer to this it is P(Rain|Summer) = 0.02. This is the conditional probability of rain given that it is summer.

Conditional probabilities can be expressed in terms of unconditional probabilities via a simple formula. To get to this formula we’ll use a Venn diagram.

venn

The blue rectangle represents all the days in the year – 365 of them. The yellow circle represents all the summer days – say 90 of them. The large brown circle represents all the days it rains – 150 say. The bit we are really interested in is the small orange area, the days when it is summer and it rains. The probability of rain on any particular day in the year, is the number of days it rains divided by the number of days in the year. We’ve already said it rains 150 days a year, and so the probability of rain on any given day is 150/365, and this equals 0.411, or forty one percent. This takes no account for season, but is just the chance of rain given any random day from the year. However we know that it is summer. The minute we say ‘given that it is summer’ we lose interest in all the other days – the yellow circle becomes our universe.
To calculate P(Rain|Summer) we need to know the number of days it rains when it is summer, and divide by the total number of days in summer. In the Venn diagram it means dividing the number of days in the small orange area by the number of days in the yellow area.

Now for some more notation (and mathematics is nothing more than notation). When two areas in Venn diagram overlap they form an intersection – because they intersect. Mathematicians use an inverted ‘U’ to symbolize this, but we’ll use the ‘&’ sign. In other words ‘rain and summer’ is represented by ‘rain&summer’ – the intersection on the diagram above. We’ve already stated that the probability of rain given that it is summer can be calculated by dividing the number of days in the summer when it rains by the number of days in summer. This can be expressed mathematically as:

P(Rain|Summer) = P(Rain&Summer)/P(Summer) – (i)

Now there is no reason why we cannot reverse the reasoning here and calculate the probability of summer given that it is raining – or P(Summer|Rain). Swapping rain and summer around we get:

P(Summer|Rain) = P(Summer&Rain)/P(Rain) – (ii)

Bear with it – we are almost there.

Now Summer&Rain is just the same as Rain&Summer. Since both the above equations contain P(Summer&Rain) we can isolate this term in both equations by multiplying through by P(Summer) in (i) and P(Rain) in (ii). Having done this we get:

P(Rain|Summer) x P(Summer) = P(Summer|Rain) x P(Rain)

By rearranging we get a simplified form of Bayes rule:

P(Summer|Rain) = (P(Rain|Summer) x P(Summer))/P(Rain)

So what you might ask. Well what we have just done is permit the calculation of one conditional probability given that we know its opposite. In this trivial example we can ask what is the probability that it is summer given that it is raining. Well we know all the probabilities on the right hand side. P(Rain|Summer) = 0.02 as stated earlier. P(Summer) = 90/365 = 0.247. P(Rain) = 150/365 = 0.411

So P(Summer|Rain) = (0.02 x 0.247) / 0.411 = 0.012

The probability it is summer given a rainy day is just over one per cent, whereas the probability of a rainy day given summer is two percent. It is often difficult to see what this really means, but if you look at the Venn diagram it can be seen that rainy days are a much larger proportion of all days than summer days, and so the proportion of summer days given it is raining is smaller.

I’m going to end this piece by introducing two terms – prior and posterior probabilities. As the name suggests a prior probability is one prior to any conditions, and so P(Summer) is a prior because it is totally unconditioned. Whereas P(Summer|Rain) is posterior because it is post a condition – namely that it is raining. This is what Bayes is really all about and why at the beginning I talked about updating probabilities based on evidence. So our example can be formulated as:

P(Summer|Rain) = P(Summer) x P(Rain|Summer)/P(Rain)

In other words how does the knowledge that it is raining modify our belief that this might be a summer day? Well P(Summer), as we have shown before is 0.247 or around 25%. As soon as we are given the evidence that it is raining the probability that it is summer drops to just 0.01 or around 1%.

Now this is a trivial example, and of course most sane people would know the season of the year. Bayes is typically applied when we cannot observe the outcome and the best we can do is use evidence to establish the probability of a particular event happening. It is used in gambling, stock market forecasting, identifying spam, and very heavily by the military. It was even used to predict the likelihood of a space shuttle accident, before the disaster struck Columbia. Bayes gave an estimate of about one in thirty whereas traditional statistics put the probability at one in a thousand

Bayes for Business – Part 2

If you are unfamiliar with Bayes please read part 1 of this series.
In the article previous to this one we established a simplified form of Bayes rule for the very specific example involving the weather and the season of the year. The equation looked like this:

P(Summer|Rain) = P(Summer) x P(Rain|Summer)/P(Rain)

Obviously we would want to generalise this and so we can replace ‘summer’ and ‘rain’ with variable events A and B. For those not familiar with probability notation a slight detour is needed. Please read this to get the basic vocabulary.

So our expression for Bayes becomes the celebrated formula:

P(A|B) = P(A) x P(B|A) / P(B)

The way it is expressed here emphasises the relationship between the posterior and the prior probabilities, P(A|B) and P(A) respectively. This formula shows how knowledge that event B has occurred changes the probability that event A has occurred. We need to know all three terms on the right to compute the value on the left. Critics of Bayes quite rightly point out that this information is not always available – but that is not the point. Often it is, and when it is we can perform a little magic that almost always surprises.

The first application to a business scenario we are going to make is that of recruitment. I’m afraid there is a little more mathematical manipulation we have to plough through, but it shouldn’t be too arduous.

The term P(B) is often not known directly, and to create the full blown expression for Bayes we need to consider how P(B) might be broken down to something which is more general. To get to grips with this we need to consider the diagram below.

partitionw

The rectangle represents our sample space and the lines divide it into a number of events (sets) E1, E2, E3, … , En. Of course in the diagram there are just four segments, but there could be any number n. The way the sample space has been divided up is known as a partition – because it partitions the sample space with no overlapping areas. Or to use the jargon, E1, E2, … ,En are pairwise disjoint. Without getting too bogged down in the math it should be clear that A can be reconstructed from its intersections with each of these partitioned areas. More specifically we can state that (and remembering we use the & symbol for an intersection):

A = (A&E1) U (A&E2) U … U (A&En)

The ‘U’ is being used as the symbol for union – similar to addition when numbers are considered instead of sets. Because the partition is pairwise disjoint we can represent the probability of A as the sum of the probabilities of the intersections.

P(A) = P(A&E1) + P(A&E2) + … + P(A&En)

Just one more trick. We saw in the previous article that P(A&E1) is equal to P(A|E1)*P(E1). So substituting this back into the equation above we get:

P(A) = P(A|E1) x P(E1) + P(A|E2) x P(E2) + … +P(A|En) x P(En)

This pain is necessary unfortunately to get to the most useful form of Bayes rule. We are going to move a few things around. We will now consider P(Ei|A) as the posterior probability we wish to calculate. There is a rabbit hole here and I don’t want to go down it right now. But when we are updating a probability using Bayes rule we need to know all the possible scenarios that might have happened. In our ‘summer’, ‘rain’ example we were looking to calculate the probability it is summer given that it has rained. The other options are that it might be winter, spring or autumn. The four seasons together make up the whole sample space of days in the year, and we are using the knowledge that it has rained to establish the most likely season. We only calculated the value for summer and found it was pretty unlikely. The important point here is that we are using the evidence given to us to update the probabilities of a number of alternatives, which together should cover all possibilities. So without further ado let’s restate Bayes:

P(Ei|A) = P(Ei) x P(A|Ei) / (P(A|E1)*P(E1) + P(A|E2)*P(E2) + … +P(A|En)*P(En))

This is the general, and most useful formulation for Bayes rule, and it will become apparent why this is so very soon.

Probably a good time to take a break.

OK – here is the business problem we are going to consider, and not only will it provide a surprising insight into the recruitment process, but should help us put the pieces together. Your organisation is looking to recruit someone with very specific IT skills – C++, SAP, SOA, Insurance Industry, Linux – not too many of those around. Prior experience shows you that in a situation like this only one out of every twenty applicants is likely to be suitable. But to compensate for this, the recruitment process in your organisation is top notch and gets it right nine times out of ten  – 90% of the time.
We are going to use a decision tree to represent this process, and to clarify the reasoning behind the Bayes rule.

bayesRecW

The first split on the tree is whether a candidate is suitable. Of course we don’t actually know which candidates are suitable, just that 19 out of 20 are usually unsuitable. What we do know is that given a suitable candidate there is a 90% chance she or he will be selected. Similarly we know there is a 10% change they will be rejected. The situation reverses for unsuitable candidates – with an acceptance rate of 10% and a rejection rate of 90%. If you have followed so far some bells should be ringing. We’ve just talked about the probability of accepting a candidate given he or she is suitable.

What we want to know is the probability that a candidate is suitable given that they have been selected – this is the only thing we are really interested in. So now I’m going to list the things we know and plug them into Bayes rule – remember we want P(S|A).

We know:

P(A|S) = 0.9
P(R|S) = 0.1
P(A|U) = 0.1
P(R|U) = 0.9
P(S) = 0.05
P(U) = 0.95

Expressed in terms of Bayes formula we transform

P(Ei|A) = P(Ei) x P(A|Ei) / (P(A|E1)*P(E1) + P(A|E2)*P(E2) + … +P(A|En)*P(En))

to

P(S|A) = P(S) x P(A|S) / (P(A|S)xP(S) + P(A|U)xP(U))

Note that S and U partition our sample space.

When we put the numbers in this becomes:

P(S|A) = (0.05 x 0.9)/(0.05 x 0.9 + 0.1 x 0.95) = 0.32

The probability we recruit a suitable candidate is less than one third at 32%. On the face of it this does not seem reasonable in light of the excellent recruitment skills of the organisation. The probability is skewed by the fact there is such a large number of unsuitable candidates, and even though 90% of them get rejected, the 10% that are accepted is a large slice of the acceptance pie. The solution to this problem is simple enough – do a second round of interviews. We know that for this second round there is a 32% chance that a candidate is suitable. Plug the numbers in and the probability of getting a suitable candidate is 81% – somewhat better.

In the next article I’m going to go over this example with a fine tooth comb to map the features of this problem to the Bayes rule.

Phew.

GIS Tips – Creating ArcGIS figures with “Arrow” distance markers

It’s always the simple requests from people which make you scratch your head. In this case, it was a simple question “How do I add distance arrows to ArcGIS?”….this coming from a GIS guy who has used ArcGIS for over 6yrs and was previously employed at Ordnance Survey….

ArcGIS is known for being both simple and technical in equal measure but when it came to adding a “draughtsman” labelling style I couldn’t find one out of the box at all, so walked away saying that I just needed to get a coffee and during that 5 minutes worked out the solution:

In this run through we are going to show the measurement arrows between our points and the woodland. There are 2 ways of getting the measurement (3 but it depends on your level of ArcGIS licence). I am going to assume that this is high-level and we will draw lines for the measurements. You could easily use the spatial join tool or the near tool for more accurate measure, this demo is more to show how to create the symbol & label.

1. Create the lines

Go to the drawing toolbar, select the line tool and draw your lines between the features you wish to measure.

Drawing the Lines
Drawing the Lines

2. Convert the lines to polylines (features)

To create the “length” field which we will use later on in the procedure, we need to convert the drawn lines to features. Select all the lines you wish to use and use the “Convert graphics to features” option on the Drawings toolbar.

3_Convert_to_polyline

3. Start symbolising

Now that you have some line features, we need to turn them into arrows and label them. First click on symbol, then scroll through the symbol selection menu (2) and find the arrow symbol.

 

Symbolise the features
Symbolise the features
Select the arrow symbol
Select the arrow symbol

4. Enable the labelling.

To make the labelling more elegant, we are going to enable the Maplex labelling which ESRI have now enabled for free in ArcGIS 10 and above. First of all, turn on the labelling tool bar.

Adding the Labelling toolbar
Adding the Labelling toolbar

 

5. Add and position the labels

First of all go to the properties of the line features and enable labels (1). Then, in the bottom left of the menu, select “Placement Properties”, then select “Position”(2). I am going to assume that you aren’t using straight lines, if you are using straight lines then it won’t matter if you select this option anyway. So, with this in mind, select the “Offset Curved” option(3).

This is where the polyline is useful as you set the field to the “length” field…if using a local coordinate system such as OSGB36 or NAD83, this will be in metres.

Setting up the label position
Setting up the label position

6. Add the unit of measure to the label.

Now, go to the expression tab (I assume you are using metres but feel free to change the “m” to “km” or “Degrees” or whatever the unit of measure is).

In the label expression box, the format for adding text at the end of the field is – [Fieldname]&”TEXT”

Therefore if the fieldname is [Length] and the distance is in metres, then the expression you need is [Length]&”m”

Add the expression
Add the expression

 

7. Feel smug that its done

Having followed the steps above, the final symbols and labels should look something like the ones below….If you want to get really cool, you can go into the symbol properties and adjust the cartographic line arrow styles or update the expression to show multiple distance formats ([Length]&”m “&[Length]*1000&”km”)

My final map, what does yours look like?
My final map, what does yours look like?

 

Nick D

MXD to QGIS

Anyone that went to the inaugural Scottish QGIS UK user group meet would not only have had a great time & enjoyed the great company of Ross & Thinkwhere but would have also seen Neil Bennys’ presentation “Qgis Evangelism” which I only just caught a couple of days ago. Neil is a geoninja of the highest order, he sneaked in a little slide showing a tool for ArcGIS called Mxd2Qgs…

Over the last 6yrs of using both ArcGIS & Qgis alongside one another, the thought of a tool that could translate the MXDs is a godsend, I have been asked so many times by clients and other consultants whether this holy grail existed (in a working format) that I was ACTUALLY thinking of building one….so after seeing the presentation I got straight onto Lutra Consulting….then realised I got it completely wrong and got onto ThinkWhere!

Although the website which I was given was in Italian, it wasn’t too much effort to translate, below is the blog in English:

There is a little tool for ArcGIS created by Allan Maungu that lets you convert a project file format. Mxd owner of the software ESRI format. Qgs use with Quantum GIS. https://sites.google.com/site/lumtegis/files/Mxd2Qgs.zip As is known, the project file. Mxd contains no data in it but only the links to any Shapefile, GeoTIFF and other datasets with their display settings. The installation of the tool is pretty simple.
MXD to QGS step 1

Once you have extracted the contents of the file. Zip need to install the Python library launching the application. Exe file inside the unzipped folder.

MXD QGS to step 2

Then we start ArcMap (guaranteed compatibility with ArcGIS 10 and later) with a right-click of the mouse & select “Add Toolbox …”

MXD to QGS step 3

From the next window we reach the path where the file is located just unzipped “Z_ArcMap to Qgs.tbx” and click on “Open” to add it to our ArcToolbox.

MXD to QGS step 4

Finally right-click on the new tool “Z_ArcMap to QGS> ArcMap to Quantum GIS” and select “Properties” to indicate the location of the script “mxd2qgs.py,” also from the unzipped file. Downloaded zip.

MXD to QGS window

Although its use is immediate, there is no need to set any parameters except of course the destination of the output file.

MXD to QGS error

In a test carried out on a project of large size with considerable complexity and number of processing on the data, the operation of the toolbox has not been completed.

With a simpler design, there were no problems.

Nick D

Don’t be such a geomagician!

I was dealing with a client the other day and they said to me; “The reason I like you so much is that you make this GIS stuff so easy to understand…”
When I enquired further, the client compared GIS to a magic box which was only known to a select few….they even pointed out that the name “GIS” was confusing and Google really didn’t help. I then look in the newspapers and see the headline:

Missing flight MH370: Robotic submarine to begin search

Referring, of course, to the Autonomous Underwater Vehicle (AUV) which will be used to collect multibeam and SSS of the seabed. Though this is a true reflection of how far us specialists can be from the real world when it comes to conversation. So where does the problem lie? Is it that people need education or is it that we need to carry around geogeek dictionaries around with us? … and what is it that I do differently?

Geomagician

Is it me?

Here’s a little secret (Shh! don’t tell anyone!)….I didn’t leave school and go to University to study GIS, in fact I studied electrical & electronic engineering, after that I managed a pub/nightclub before moving to mapping and GIS (it seemed like an obvious choice). What I discovered upon entering the GIS industry was quite scary, lots of very clever people working in their own little boxes, not sharing, each thriving off the credit for their work.

I’ve got to be honest, the open source community scared me. My first steps were hard, not knowing who to go to with questions and searching for answers online was hard…..obviously, 15yrs on a lot has changed but it can still be a baptism of fire. That’s not to say that the proprietary software community didn’t scare me….there was a lot of time where I would be asking “Why can’t I do this” & would spend a lot of time reading manuals which were bloated full of text without much content….and let’s not mention the training courses!

So, enough about me….how can this help you to be a better geogeek? Well here are my top tips for getting on with the normal people on this planet (I by no means count myself as one anymore either!)

  1. Because someone is nodding, don’t expect them to really understand you – to the average person longitude and latitude are stuff geeks deal with and z factor is a poor cousin of a Simon Cowell show. Break it down assume you are talking to a 12yr old……or your mother!
  2. Help your fellow geogeeks – spending 2 mins helping a noob might repay you further on. Within this business we all specialise and go niche, so one day you’ll find those people you helped being valuable pools of knowledge.
  3. Get involved – I spent years working in my little bubble, being quite successful doing what I do. The day I started talking with people on Twitter and joining the little GIS groups was the day I REALLY started to learn.
  4. Be honest, if you dont know, ask – Nothing is worse that assuming you know and baffling a client with bull. Be honest, ask other geopeople and not only will you get a better understanding, but you may just earn a little trust and respect.
  5. Google is your friend – I sound stupid right? Not really…I often Google terms & if they are translated or have a more “normal” term – like, for example, the “Robot Submarine” term. This makes sure that when you speak with people you are using terms they will understand.
  6. Share – Don’t be afraid to lose your “specialism” if you share some of your skills. What I found is that by sharing some of my geodetic know-how, I got to see a better system for dealing with coordinates in GIS software & the inclusion of OSTN02 (a personal mission of mine) into all the major systems.
  7. Get to know at least One GIS Developer, One Surveyor, One Manager & One person in the same field as you – This will keep you & the others relevant. The developer will help you put things into perspective from the software side (You can help by feeding detail on how it SHOULD work). The Surveyor will keep things in perspective by making sure the correct references are used and is grounded in the “real world” (you can help by showing how the data can be utilised and analysed). The manager will help you to to understand the business needs and thinking (you can help by keeping them up to date on technology).
  8. Stop using the word “GIS” – What is it? Is it REALLY what you do or do you really work as a data manager or as a geospatial system operator? Nothing turns people off more that beginning with the word GIS! (Yes, I am still employed as a GIS Consultant BTW)
Group Hug

Please feel free to suggest more do’s and don’ts, the more the merrier. I have definitely found the above has helped me to understand the geo-world around me and kept me from sitting in a corner measuring different vertical datum,

Below are some more links to groups worth getting involved in (mostly UK centric)

Links to groups:

UK QGIS User Group

British Cartographic Society

Association of Geographic Information

Twitter people worth following

@Dragons8mycat, @UnderdarkGIS, @Timlinux, @QGIS, @ESRIUK, @OrdnanceSurvey, @Geospatialnews, @Directionsmag, @ElliotHartley, @Boundlessgeo, @Nordpil, @Geosmiles

Data sources:

The Spatial Blog Data Sources

Nick D